copy-scat: deconvoluting single-cell chromatin accessibility of genetic subclones in cancer copy-scat: deconvoluting single-cell chromatin accessibility of genetic subclones in cancer ana nikolic , , , divya singhal , , , katrina ellestad , , , michael johnston , , , aaron gillmor , , , sorana morrissy , , , jennifer a chan , , , paola neri , , nizar bahlis , , marco gallo , , * arnie charbonneau cancer institute alberta children’s hospital research institute department of biochemistry and molecular biology department of oncology cumming school of medicine, university of calgary, calgary, ab, canada *corresponding author: marco gallo marco.gallo@ucalgary.ca abstract single-cell epigenomic assays have tremendous potential to illuminate mechanisms of transcriptional control in functionally diverse cancer cell populations. however, application of these techniques to clinical tumor specimens has been hampered by the current inability to distinguish malignant from non-malignant cells, which potently confounds data analysis and interpretation. here we describe copy-scat, an r package that uses single-cell epigenomic data to infer copy number variants (cnvs) that define cancer cells. copy-scat enables studies of subclonal chromatin dynamics in complex tumors like glioblastoma. by deploying copy- scat, we uncovered potent influences of genetics on chromatin accessibility profiles in individual subclones. consequently, some genetic subclones were predisposed to acquire stem-like or more differentiated molecular phenotypes, reminiscent of developmental paradigms. copy-scat is ideal for studies of the relationships between genetics and epigenetics in malignancies with high levels of intratumoral heterogeneity and to investigate how cancer cells interface with their microenvironment. introduction single-cell genomic technologies have made enormous contributions to the deconvolution of complex cellular systems, including cancer ( ). single-cell rna sequencing (scrna-seq) in particular has been widely employed to understand the implications of intratumoral transcriptional heterogeneity for tumor growth, response to therapy and patient prognosis ( – ). this field has hugely benefited from an emerging ecosystem of computational tools that have enabled complex analyses of scrna data. since copy number variants (cnvs) mostly accrue in malignant cells and are rare in non-malignant tissues, computational platforms that use scrna data to call cnvs have resulted in improved understanding of the behavior of genetic subclones in tumors ( – ). on the other hand, the application of single-cell epigenomic techniques, including the assay for transposase accessible chromatin (scatac) ( , ), to study cancer has been slowed by computational bottlenecks. for instance, unlike scrna-seq, currently no dedicated tool exists to call cnvs using scatac data. this technical gap has prevented scatac studies of clinical tumor specimens, which often are surgical resections .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / that include both malignant and non-malignant cells. inability to deconvolute these cell populations after the generation of scatac datasets would confound downstream analyses and interpretation of this data type. in this report, we describe copy-scat (copy number inference using scatac-seq data), a new computational tool that uses scatac datasets to call cnvs at the single-cell level. using scatac datasets from adult glioblastoma (agbm), pediatric gbm (pgbm) and multiple myeloma (mm), we demonstrate the effectiveness of copy-scat in calling (a) focal amplifications, (b) segmental gains and losses and (c) chromosome arm-level gains and losses. at the most basic level, copy-scat can therefore discriminate between malignant and non-malignant cells in scatac datasets based on the presence or absence, respectively, of cnvs. this distinction is fundamental to ensure that downstream analyses include only the appropriate tumor or microenvironment cell populations. at a more sophisticated level, we show that implementation of copy-scat allows investigations of the relationship between genetic and epigenetic principles governing the behavior of individual subclones. in this regard, we show that each genetic subclone has characteristic accessible chromatin profiles, indicating that genetics imparts information that determines key epigenetic features. strong influence of genetics on chromatin states is demonstrated by the predisposition of genetic subclones to have stem-like or more differentiated molecular profiles in gbm. results design and implementation of copy-scat we designed copy-scat, an r package that uses scatac-seq information to infer copy number alterations. copy-scat uses fragment files generated by cellranger-atac ( xgenomics) as input to generate chromatin accessibility pileups, keeping only barcodes with a minimum number of fragments (defaulting to , fragments). it then generates a pileup of total coverage (number of reads × read lengths) over bins of determined length ( million bp as default) (fig. a). binned read counts then undergo linear normalization over the total signal in each cell to account for differences in read depth, and chromosomal bins which consist predominantly of zeros (at least % zero values) are discarded from further analysis. all parameters, including reference genome, bin size, and minimum length cut-off are user-customizable. copy-scat then implements different algorithms to detect focal amplifications and larger-scale copy number variation. to call focal amplifications (fig. b), copy-scat generates a linear scaled profile of density over normalized mbp bins along each chromosome on a single-cell basis, centering on the median and scaling using the range. copy-scat then uses changepoint analysis ( ) (see methods) to identify segments of abnormally high signal (z score > ) along each chromosome in each single cell. these calls are then pooled together to generate consensus regions of amplification, in order to identify putative double minutes and extrachromosomal amplifications. each cell is scored as positive or negative for each amplified genomic region. segmental losses are called in a similar fashion, by calculating a quantile for each bin on a chromosome, running changepoint analysis to identify regions with abnormally low average signal, and then using gaussian decomposition of total signal in that region to identify distinct clusters of cells. for larger copy number alterations, copy-scat pools the bins further at the chromosome arm level using a trimmed mean, while normalizing the data on the basis of length of cpg islands contained in each bin (fig. c). data is then scaled for each chromosome arm, compared to a pseudodiploid control (expected signal distribution for a diploid genotype) that is modeled for each sample, and cluster assignments are generated using gaussian decomposition. cluster assignments are then normalized to get an estimate of copy number .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / for each cell (fig. d). these assignments can be optionally combined with clustering information to generate consensus genotypes for each cluster of cells and further filter false positives (fig. e) for full details regarding the execution of copy-scat, see methods. a step-by-step tutorial for copy-scat is available on github (see methods). fig. . copy-scat workflow. (a) copy-scat accepts barcode fragment matrices generated by cellranger ( xgenomics) as input. (b) large peaks in normalized coverage matrices can be used to infer focal cnvs. (c) normalized matrices can be used to infer segmental and chromosome-arm level cnvs. (d) example of chromosome-arm level cnv (chromosome p loss) called by copy-scat (e) consensus clustering is used to finalize cell assignment. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / copy-scat effectively calls cnvs in diverse malignancies we have tested the ability of copy-scat to use scatac data to call cnvs with three different approaches and with different tumor types. first, we benchmarked copy-scat against cnv calls made with whole- genome sequencing (wgs) data for adult gbm (agbm) surgical resections (n = samples, , cells). this approach consisted in isolating nuclei from flash-frozen agbm samples, mixing nuclei in suspension, and then using these nuclei for either scatac or wgs library construction (fig. a). this was meant to ensure similar representation of genetic subclones, which are usually regionally contiguous in this solid tumor, in both scatac and wgs libraries. second, we benchmarked copy-scat against cnv calls made using pediatric gbm (pgbm) surgical resections (n = patient-matched diagnostic-relapse samples, , cells). in this case, scatac and wgs libraries were generated from separate geographical regions of the same tumor (fig. b). third, we benchmarked copy-scat against cnv calls made with the single-cell cnv (sccnv) assay ( xgenomics) using multiple myeloma (mm) clinical samples (n = samples, , cells). overall, we observed that copy-scat correctly inferred all or most of the cnvs that were called with wgs (figs. a,b; figs. s , s ) or sccnv data (fig. c; fig s ). in total, we profiled , cells from malignancies from patients, and were able to infer cnv status for a total of , cells (table s ). on average, we were able to call cnvs for . % of cells in each sample (range: . – . %) (table s ). for chromosome-arm level cnv gains, sensitivity ranged from . for mm to . for agbm and specificity ranged from . to . (table s ). for chromosome arm-level losses, sensitivity ranged from . to . and specificity from . to . . the sensitivity and specificity of focal amplifications were very high (> . , table s ). the variation observed may reflect technical differences between the strategies used for benchmarking. as expected, the calls of copy-scat for agbm were the most accurate, likely because scatac and wgs datasets were generated by relatively homogeneous starting material, as described above. because of its design, it is also possible that copy-scat is more sensitive at inferring cnvs that occur in relatively rare subclones compared to wgs, potentially explaining (in addition to true false positives) why the number of cnvs inferred by our new tool is sometime higher than inferences made with wgs. scatac data can be used to distinguish malignant from non-malignant cells tumor cells often harbor cnvs, and we reasoned that the use of copy-scat should enable the use of scatac data to infer cnvs and therefore to distinguish between malignant and non-malignant cells. to test this hypothesis, we overlayed cnvs called by copy-scat onto scatac datasets displayed in uniform manifold approximation and projections (umap) plots. this exercise led to the identification of cells that were clearly positive for multiple cnvs and others that appeared to have a normal genome. as an illustrative example, we found that the agbm sample cgy was composed of discrete cell populations that harbored focal amplifications at the mdm (fig. d), pdgfra (fig. e) and egfr (fig. f) loci, as well as chromosome p deletion (fig. g) and chromosome gain (fig. h,i). copy-scat results suggest specific lineage relationships between subclones. for instance, chromosome amplifications are clonal in this sample (fig. h,i), whereas the chromosome deletion is subclonal (fig. g). in addition, our computational tool predicts that pdgfra (fig. e) and egfr (fig. f) focal amplifications are mutually exclusive, a phenomenon that has been reported in agbm ( ). altogether, these results illustrate one specific population of cells (shaded green in fig. i) that harbors several cnvs and are therefore putative cancer cells. at the same time, we also identified cells (labeled in dark blue in fig. i) that did not appear to have any cnvs and are therefore likely to be cells from the tumor microenvironment. equivalent results were obtained for pgbm (fig. s ) and mm samples (fig. s ). since the latter appear as multiple scatac clusters, it is possible that our strategy detects multiple distinct non- neoplastic cell clusters. differential motif analysis with chromvar confirmed high scores for neural progenitor cell-associated motifs like nfic and ascl in cnv+ cells (fig. j,k), while the putative non- neoplastic clusters showed increased occupancy at transcription factor motifs associated with hematopoietic lineages, such as ikzf (fig. l). another cnv- cluster showed enrichment of foxg binding motifs in accessible chromatin, in keeping with a non-neoplastic neural cell identity (fig. ). using this approach, it .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / was possible to discriminate between malignant and cells from the tumor microenvironment in all tumor samples analyzed (extended figs. s -s ). copy-scat therefore effectively uses scatac data to infer cnvs, which can then be used to distinguish malignant from non-malignant cells and to infer lineage relationships between genetic subclones that coexist in a tumor. fig. . benchmarking of copy-scat with three methods involving clinical samples from three distinct malignancies. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / (a) banked frozen agbm samples were used for both scatac and wgs. nuclei were isolated from the samples, mixed, and used for both scatac and wgs. number of chromosome-arm level gains detected in adult gbm samples identified using both methods, versus total numbers of gains detected by scatac or wgs. (b) surgical pgbm resections were split, and one section was used for scatac and the other for wgs. number of chromosome-arm level gains detected in adult gbm samples identified using both methods, versus total numbers of gains detected by scatac or wgs. (c) multiple myeloma samples were profiled by both scatac and the single-cell cnv assay. number of chromosome-arm level gains detected in adult gbm samples identified using both methods, versus total numbers of gains detected by scatac or sccnv assay. (d) mdm amplification in an adult gbm sample (cgy ). amplified cells are coloured dark blue, and normal cells in pale blue. (e) pdgfra amplification in an adult gbm sample (cgy ). amplified cells are coloured dark blue, and normal cells in pale blue. (f) egfr amplification in an adult gbm sample (cgy ). amplified cells are coloured dark blue, and normal cells in pale blue. (g) chromosome p loss in an adult gbm sample. (h) chromosome p gain in an adult gbm sample. (i) chromosome q gain in an adult gbm sample. (j) chromvar activity score for ascl . (k) chromvar activity score for nfic. (l) chromvar activity score for ikzf . (m) chromvar activity score for foxg . subclonal genetics shapes chromatin accessibility profiles in agbm we noticed that in most tumors we analyzed, cells harboring a given cnv had a tendency to cluster together (fig. d-i). individual clusters were in fact defined by the presence of specific cnvs (fig. a-c). this was an unexpected observation, because it is widely assumed that clustering of scatac data reflects the global patterns of chromatin accessibility. one possible explanation for this observation could be that chromosomal regions affected by a cnv display imbalances in the fragment depth distribution of scatac datasets, and that these patterns have a dominant effect on cluster assignment. most scatac-seq workflows rely on some variant of term-frequency inverse document frequency (tf-idf) normalization rather than feature scaling, and this may amplify the effects of cnv-driven dna content imbalances. for instance, it is possible that focal amplifications of the pdgfra locus result in increased frequency of transposition events that are mapped to this site. a dominant effect of chromatin accessibility at this amplified locus could result in pdgfra-amplified cells clustering together in umap representations of scatac data (fig d,e). indeed we found that compared to a random selection of peaks, the chromosomes which carried cnvs had significantly different numbers of peaks ranked as highly variant than chromosomes that did not have cnvs, leading to a markedly uneven distribution of top peaks (p < . e- ; chi-squared test; fig. s a) this was not seen in non-neoplastic cells, which had relatively even top fragment distribution patterns (p = . , chi-squared test; fig. s b). to test this hypothesis, we used copy-scat to call cnvs in our tumor samples, then removed all peaks mapping to chromosomes predicted to harbor cnvs, and finally re-clustered all cells in each sample (fig. f). we found that although removing chromosomes with cnvs from our analyses changed the overall cluster structure of a sample (fig. g), pdgfra-amplified cells still clustered close to each other (fig. h). in fact, our results indicate that clustering after cnv removal is more granular but overall very stable (fig. i). in this case, pdgfra-amplified cells localized to a single cluster before removing chromosomes affected by cnvs. following removal of cnv+ chromosomes and re-clustering, most pdgfra-amplified cells still clustered together, with only a few cells merging into a cluster that .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / included both amplified and non-amplified cells. comparing the most variable peaks after chromosome cnv removal showed a distribution closer to normal, supporting the marked effect of the cnvs on the identification of variant peaks (p = . e- ; fig. s c). contrary to current views of cancer epigenomics, these data indicate that genetic subclones may have characteristic patterns of chromatin accessibility, and that a cell’s genetic background has significant influence on its likelihood of attaining specific epigenetic states. fig. . subclonal genetics influences clustering of scatac-seq data. (a-c) cnvs in adult gbm cgy segregate within specific scatac clusters. (d, e) pdgfra-amplified cells cluster together in adult gbm cgy . (f) diagram summarizing our strategy to remove cnvs from clustering of scatac data. all chromosomes or regions with putative cnvs were removed from downstream analyses, and cells were re-clustered. (g) reclustering of (d) following removal of chromosomes and regions affected by cnvs in cgy . (h) distribution of pdgfra-amplified cells following re-clustering. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / (i) cluster assignments of cells in cgy (agbm specimen) before and after removal of cnv-containing regions (purple: pdgfra-amplified cells). genetic events predispose subclones to the acquisition of developmental chromatin states we further explored the notion that cnvs may shape chromatin accessibility profiles and its possible implications for cell fate determination. as an illustrative example, we focused on an agbm sample (cgy ) where cnvs at chromosome p characterized three genetic subclones, as determined with copy-scat: (i) a subclone with two copies of chromosome p; (ii) a subclone with loss of p; (iii) a subclone with gain of p (fig. a). we were interested in determining whether the major genetic subclones in this tumor had similar cycling properties. unlike scrna-seq, we found it is not possible to use scatac profiles at cell cycle genes to determine whether a cell is proliferating. we reasoned that cells that are actively going through cell division have to replicate their dna. given that cancer cells have numerous cnvs on autosomes and could lead to noisy data, we decided to use copy-scat to identify cells that have doubled the number of their x chromosomes and defined them as actively cycling cells. to validate this approach, we determined the number of cells with double the number of expected x chromosomes – ie putative cycling cells – in previously published scatac datasets for mouse brain and peripheral blood mononuclear cells (pbmcs). we hypothesized that we should be able to identify cycling cells in fetal mouse brain, but not in pbmcs. in fact, we detected numerous cycling cells (with twice the expected number of x chromosomes) in brain tissue but not in pbmcs (fig. s ). this method detected putative cycling cells in our datasets (fig. b). we used scatac data to arrange cells from this tumor along pseudotime with the package stream ( ) (fig. c) and then superimposed cell cycle status determined with our x chromosome doubling method (fig. d). the results show that cells along branch , which is strongly enriched for cells with chromosome p gains, are also the most proliferative (fig. e), with over % of the cells actively going through replication (p = . × - ; chi-square test). on the other hand, ~ % of cells along branch and ~ % of cells along branch were cycling. these data therefore indicate functional differences between cells with gain or loss of chromosome p. we then used chromvar( ) and stream-atac to calculate scores for transcription factor (tf) binding motifs that are associated with neurodevelopmental processes. this analysis revealed that motifs bound by tfs that are associated with stem-like phenotypes, including olig and hoxa , are enriched in accessible chromatin regions in cells that have one copy of chromosome p (fig. f). motifs bound by tfs associated with progenitor (fig g) and differentiated states (fig. h) were enriched in the branch with more cells showing gain of chromosome p. this was associated with a significant shift in the overall distribution of enrichment of these motifs in cells along the different branches of the trajectory (fig. i-k). a distribution of genetic subclones along developmental chromatin accessibility states was observed in other tumor samples we studied (fig. s -s ). overall, the data support the notion that tumor cells sample a discrete number of chromatin states, but their transition probabilities differ based on genotype. consequently, chromatin states associated with each genetic subclone manifest as different functional properties, here demonstrated at the level of cell proliferation and stemness profiles. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure . subclonal genetic alterations predispose cells to adopt developmental chromatin states. (a) cells were clustered based on scatac chromvar motif scores, then shaded based on the presence of , or copies of chromosome p. (b) cells were shaded based on their predicted cycling properties. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / (a) data shown in (a) projected onto pseudotime. the resulting three branches are populated preferentially by cells with gain or loss of chromosome p respectively. (d) proliferation status as shown in (b), overlaid onto pseudotime. (e) branches enriched for p gain show greater proportions of proliferative cells (statistics: chi-squared test). (f) scaled chromatin accessibility at binding motifs for olig and hoxa , two tfs associated with stemness. (g) scaled chromatin accessibility at binding motifs for rfx and nfix, two tfs associated with progenitor-like phenotypes. (h) scaled chromatin accessibility at binding motifs for rara::rxra and stat , two tfs associated with differentiated phenotypes. (i) enrichment plot for motif z scores for olig and hoxa . (j) enrichment plot for motif z scores for rfx and nfix. (k) enrichment plot for motif z scores for rara::rxra and stat . p values calculated by kruskal-wallis test. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / discussion here we describe copy-scat, the first computational tool dedicated to inferring cnvs using scatac data. copy-scat resolves a computational bottleneck that has restricted the application of single-cell epigenomic techniques to the study of clinical tumor samples, which are often mixtures of malignant and non-malignant cells. the presence of non-malignant cells can severely confound the analyses of these samples and downstream data interpretation. cell admixture is a particular problem for scatac data because of the inherent sparsity of these datasets and because they do not provide direct information on the expression status of cell lineage markers that could be used to solve cellular identities. because most tumor types harbor cnvs, copy-scat provides a simple way of solving this problem. it is important to note that copy-scat enables users to perform analyses on both malignant and non- malignant cells from a tumor sample, because cell barcodes associated with both presence or absence of cnvs can be selected for downstream analyses. implementation of copy-scat will therefore be beneficial to groups interested in defining the epigenomes of both tumor cells and their microenvironment. because chromatin accessibility datasets provide information on mechanisms of transcriptional regulation by distal and proximal enhancer and super enhancer elements, copy-scat could be useful in clarifying epigenetic mechanisms involved in immune suppression and t cell exhaustion, for instance. copy-scat also allows scatac studies of frozen banked cancer specimens (see methods), because it requires no prior knowledge of cell composition. we show that the underlying cnv architecture plays a significant role in clustering of scatac data, a problem that is amplified by the use of tf-idf algorithms for normalization. these effects are less pronounced when clustering is based on motif activity scores (e.g. chromvar), likely as this incorporates data from multiple chromosomes, thus dampening the effect of variation at any one specific locus. further studies are needed to identify the optimal way to address the effects of cnvs in downstream analyses, as they may present a significant confounder and potentially mask significant biological relationships. in this report, we provide evidence that copy-scat can be used to shed new light on how genetics and epigenetics interface in cancer. we show that genetic subclones tend to have unique chromatin accessibility landscapes that can promote or antagonize stem-like phenotypes. consequently, we report that some genetic subclones have greater proportions of stem-like cells, and others appear more differentiated. these results offer a radically different view of functional hierarchies in gbm, where stem-like properties were thought to be programmed by epigenetic factors, independently of genotype. these findings provide a simple explanation for the observed intra-tumoral transcriptional heterogeneity in gbm (( , )), by suggesting that each genetic subclone achieves specific chromatin accessibility profiles, which in turn result in subclone- specific transcriptional outcomes. copy-scat will enable future studies of subclonal chromatin dynamics in complex tumor types and may be an important tool in better understanding the functional relationships between subclones, their microenvironment and therapy response. materials and methods ethics and consent statement all samples were collected and used for research with appropriate informed consent and with approval by the health research ethics board of alberta. scatac-seq sample processing .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / gbm samples were either frozen surgical resections (pediatric gbm) or cells dissociated from fresh surgical specimens and cryopreserved (adult gbm). samples were dissociated in a . ml microcentrfuge tube, using a wide-bore p pipette followed by a narrow bore p pipette in nuclear resuspension buffer ( mm tris-hcl; mm nacl; mm mgcl ; . % igepal, . % tween- , . % digitonin, % bsa in pbs), then vortexed briefly, chilled on ice for minutes, then pipetted again, and spun at °c, g for minutes. this step was repeated, and the sample was then resuspended in tween wash buffer ( mm tris-hcl; mm nacl; mm mgcl ; . % igepal, . % tween- ; % bsa in pbs), then strained though a μm cell strainer facs tube (fisher scientific - - ) to remove debris. nuclei were then quantified by trypan blue on the countess ii (invitrogen), spun down at g at °c for minutes, resuspended in the nuclear isolation buffer ( x genomics), and the rest of the scatac was performed as per the x genomics protocol. mm samples were from bone marrow aspirates collected from patients; tumor cells were isolated from mononuclear cell fractions through ficoll gradients coupled with magnetic bead sorting of cd + cells. scatac libraries were prepared from gbm and mm samples using a chromium controller ( xgenomics). libraries were sequenced on nextseq or novaseq instruments (illumina) at the centre for health genomics and informatics (chgi; university of calgary) using the recommended settings. scatac-seq initial data analysis the raw sequencing data was demultiplexed using cellranger-atac mkfastq (cell ranger atac, version . . , x genomics). single cell atac-seq reads were aligned to the hg reference genome (grch , version . . , x genomics) and quantified using cellranger-atac count function with default parameters (cell ranger atac, version . . , x genomics). single-cell cnv analysis fragment pileup and normalization the fragment file was processed and signal was binned into bins of a preset size (default mb) across the hg chromosomes to generate a genome-wide read-depth map. only barcodes with a minimum of reads were retained, in order to remove spurious barcodes. this flattened barcode-fragment matrix pileup was cleaned by removal of genomic intervals which were uninformative (greater than % zeros) and barcodes with greater than a certain number of zero intervals. cells passing this first filter were normalized with counts-per-million normalization using cpm in the edger package ( ). chromosome arm cnv analysis the normalized barcode-fragment matrix was collapsed to the chromosome arm level, using chromosome arm information from the ucsc (ucsc table: cytoband), centromeres were removed, and signal in each bin was normalized using the number of basepairs in cpg islands in the interval using the ucsc cpg islands table (ucsc table: cpgislandextunmasked). the signal was then summarized using a quantile-trimmed- mean (between the th and th quantiles). only chromosome arms with a minimum trimmed mean signal were kept for analysis. the chromosome arm signal matrix is mixed with a generated set proportion of pseudodiploid control cells, defined using the mean of chromosome segment medians with a defined standard deviation. this cell-signal matrix is then scaled across each chromosome arm and centered on the median signal of all chromosomes. each chromosome arm segment is then analyzed using gaussian decomposition with mclust ( ). the subsequent clusters are filtered based on z scores and mixing proportions, and redundant clusters are combined. these z scores are then translated into estimated copy numbers for each segment for each .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / barcode. the barcode cnv assignments can be optionally used to assign consensus cnvs to clusters generated in other software packages such as loupe or seurat/signac. detection of amplifications the normalized barcode-fragment matrix was scaled and mean-variance changepoint analysis using the changepoint package was performed for each cell and each chromosome to identify areas of abnormally high signal (z score greater than ) ( ). the consensus coordinates of each amplification region were generated across all cells and only abnormalities affecting a minimum number of cells were kept for analysis. detection of loss of heterozygosity the normalized barcode-fragment matrix was scaled as above. as overall coverage levels in these samples are quite sparse, a chromosome-wide coverage profile was generated for the entire sample in bulk, using the % quantile as a cut-off, and then changepoint analysis was used to find inflection points. this was followed by gaussian decomposition of the values using mclust to identify putative areas of loss or gain, thresholded by a minimum difference in signal between the clusters identified by mclust. scatac trajectory analysis stream-atac and stream ( ) were used to generate pseudotime trajectories based on motif occupancy profiles generated using chromvar ( ) with the jaspar motif database as reference ( ). dimensionality reduction was performed using the top components and neighbours, and an initial elastic graph was generated on the d umap projection using clusters, using the kmeans method with n_neighbours = . an elastic principal graph was constructed using the parameters epg_alpha = . , epg_mu = . , epg_lambda = . and epg_trimmingradius = . , with branch extension using ‘quantdists’. trees were rooted using the branch with highest motif activities for olig and etv motifs as root. whole genome sequencing dna was extracted from residual nuclei from the same samples and tissue fragments used for scatac-seq of adult gbm samples, using the qiagen dneasy blood and tissue dna extraction kit (qiagen # ). libraries were prepared using the nebnext ultra ii dna library prep kit (#e ) and sequenced on the novaseq (illumina) at the chgi (university of calgary), in paired-end mode. whole genome data processing genome data was aligned to the hg assembly using bwa mem (bwa . . )( ). samtools was used to extract high-quality reads (q > ) and picard tools (broad institute) was used to remove duplicates ( ). whole genome snv and cnv detection gatk mutect (broad institute) was run on the filtered data to detect snvs with low stringency using the following settings: --disable-read-filter mateonsamecontigornomappedmatereadfilter. cnvkit was subsequently used to call copy number variants using the following parameters: --filter cn -m clonal –purity . ( ). adjacent segments were further combined and averaged using bedtools ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / data visualization and clustering data was visualized and umap plots were generated using seurat . . and signac . . ( ) and cell loupe version . . ( ). statistical analysis between-group differences in discrete values (e.g. chromosome peaks, branch assignments) were calculated using the chi-squared test. differences in non-parametric distributions (motif accessibility in clusters) were quantified using the kruskal-wallis test. references . b. lim, y. lin, n. navin, advancing cancer research and medicine with single-cell genomics. cancer cell. , – ( ). . a. p. patel, i. tirosh, j. j. trombetta, a. k. shalek, s. m. gillespie, h. wakimoto, d. p. cahill, b. v. nahed, w. t. curry, r. l. martuza, d. n. louis, o. rozenblatt-rosen, m. l. suvà, a. regev, b. e. bernstein, single-cell rna-seq highlights intratumoral heterogeneity in primary glioblastoma. science ( -. ). ( ), doi: . /science. . . s. darmanis, s. a. sloan, d. croote, m. mignardi, s. chernikova, p. samghababi, y. zhang, n. neff, m. kowarsky, c. caneda, g. li, s. d. chang, i. d. connolly, y. li, b. a. barres, m. h. gephart, s. r. quake, single-cell rna-seq analysis of infiltrating neoplastic cells at the migrating front of human glioblastoma. cell rep., – ( ). . j. gojo, b. englinger, l. jiang, j. m. hübner, m. l. shaw, o. a. hack, s. madlener, d. kirchhofer, i. liu, j. pyrdol, v. hovestadt, e. mazzola, n. d. mathewson, m. trissal, d. lötsch, c. dorfer, c. haberler, a. halfmann, l. mayr, a. peyrl, r. geyeregger, b. schwalm, m. mauermann, k. w. pajtler, t. milde, m. e. shore, j. e. geduldig, k. pelton, t. czech, o. ashenberg, k. w. wucherpfennig, o. rozenblatt-rosen, s. alexandrescu, k. l. ligon, s. m. pfister, a. regev, i. slavc, w. berger, m. l. suvà, m. kool, m. g. filbin, single-cell rna-seq reveals cellular hierarchies and impaired developmental trajectories in pediatric ependymoma. cancer cell. , – ( ). . c. neftel, j. laffy, m. g. filbin, t. hara, m. e. shore, g. j. rahme, a. r. richman, d. silverbush, m. l. shaw, c. m. hebert, j. dewitt, s. gritsch, e. m. perez, l. n. gonzalez castro, x. lan, n. druck, c. rodman, d. dionne, a. kaplan, m. s. bertalan, j. small, k. pelton, s. becker, d. bonal, q.-d. nguyen, r. l. servis, j. m. fung, r. mylvaganam, l. mayr, j. gojo, c. haberler, r. geyeregger, t. czech, i. slavc, b. v. nahed, w. t. curry, b. s. carter, h. wakimoto, p. k. brastianos, t. t. batchelor, a. stemmer-rachamimov, m. martinez-lage, m. p. frosch, i. stamenkovic, n. riggi, e. rheinbay, m. monje, o. rozenblatt-rosen, d. p. cahill, a. p. patel, t. hunter, i. m. verma, k. l. ligon, d. n. louis, a. regev, b. e. bernstein, i. tirosh, m. l. suvà, an integrative model of cellular states, plasticity, and genetics for glioblastoma. cell. , - .e ( ). . m. c. vladoiu, i. el-hamamy, l. k. donovan, h. farooq, b. l. holgado, y. sundaravadanam, v. ramaswamy, l. d. hendrikse, s. kumar, s. c. mack, j. j. y. lee, v. fong, k. juraschka, d. przelicki, a. michealraj, p. skowron, b. luu, h. suzuki, a. s. morrissy, f. m. g. cavalli, l. garzia, c. daniels, x. wu, m. a. qazi, s. k. singh, j. a. chan, m. a. marra, d. malkin, p. dirks, l. heisler, t. pugh, k. ng, f. notta, e. m. thompson, c. l. kleinman, a. l. joyner, n. jabado, l. stein, m. d. taylor, childhood cerebellar tumours mirror conserved fetal transcriptional programs. nature. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / , – ( ). . i. tirosh, a. s. venteicher, c. hebert, l. e. escalante, a. p. patel, k. yizhak, j. m. fisher, c. rodman, c. mount, m. g. filbin, c. neftel, n. desai, j. nyman, b. izar, c. c. luo, j. m. francis, a. a. patel, m. l. onozato, n. riggi, k. j. livak, d. gennert, r. satija, b. v. nahed, w. t. curry, r. l. martuza, r. mylvaganam, a. j. iafrate, m. p. frosch, t. r. golub, m. n. rivera, g. getz, o. rozenblatt-rosen, d. p. cahill, m. monje, b. e. bernstein, d. n. louis, a. regev, m. l. suvà, single-cell rna-seq supports a developmental hierarchy in human oligodendroglioma. nature. , – ( ). . a. s. venteicher, i. tirosh, c. hebert, k. yizhak, c. neftel, m. g. filbin, v. hovestadt, l. e. escalante, m. l. shaw, c. rodman, s. m. gillespie, d. dionne, c. c. luo, h. ravichandran, r. mylvaganam, c. mount, m. l. onozato, b. v. nahed, h. wakimoto, w. t. curry, a. j. iafrate, m. n. rivera, m. p. frosch, t. r. golub, p. k. brastianos, g. getz, a. p. patel, m. monje, d. p. cahill, o. rozenblatt-rosen, d. n. louis, b. e. bernstein, a. regev, m. l. suvà, decoupling genetics, lineages, and microenvironment in idh-mutant gliomas by single-cell rna-seq. science ( -. ). ( ), doi: . /science.aai . . s. müller, a. cho, s. j. liu, d. a. lim, a. diaz, conics integrates scrna-seq with dna sequencing to map gene expression to tumor sub-clones. bioinformatics ( ), doi: . /bioinformatics/bty . . j. d. buenrostro, p. g. giresi, l. c. zaba, h. y. chang, w. j. greenleaf, transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, dna-binding proteins and nucleosome position. nat. methods ( ), doi: . /nmeth. . . j. d. buenrostro, b. wu, u. m. litzenburger, d. ruff, m. l. gonzales, m. p. snyder, h. y. chang, w. j. greenleaf, single-cell chromatin accessibility reveals principles of regulatory variation. nature ( ), doi: . /nature . . r. killick, i. a. eckley, changepoint: an r package for changepoint analysis. j. stat. softw. ( ), doi: . /jss.v .i . . m. snuderl, l. fazlollahi, l. p. le, m. nitta, b. h. zhelyazkova, c. j. davidson, s. akhavanfard, d. p. cahill, k. d. aldape, r. a. betensky, d. n. louis, a. j. iafrate, mosaic amplification of multiple receptor tyrosine kinase genes in glioblastoma. cancer cell ( ), doi: . /j.ccr. . . . . h. chen, l. albergante, j. y. hsu, c. a. lareau, g. lo bosco, j. guan, s. zhou, a. n. gorban, d. e. bauer, m. j. aryee, d. m. langenau, a. zinovyev, j. d. buenrostro, g. c. yuan, l. pinello, single-cell trajectories reconstruction, exploration and mapping of omics data with stream. nat. commun. , ( ). . a. n. schep, b. wu, j. d. buenrostro, w. j. greenleaf, chromvar: inferring transcription-factor- associated accessibility from single-cell epigenomic data. nat. methods. , pages – ( ). . a. p. patel, i. tirosh, j. j. trombetta, a. k. shalek, s. m. gillespie, h. wakimoto, d. p. cahill, b. v. nahed, w. t. curry, r. l. martuza, d. n. louis, o. rozenblatt-rosen, m. l. suvà, a. regev, b. e. bernstein, single-cell rna-seq highlights intratumoral heterogeneity in primary glioblastoma. science. , – ( ). . m. d. robinson, d. j. mccarthy, g. k. smyth, edger: a bioconductor package for differential expression analysis of digital gene expression data. bioinformatics. , – ( ). . l. scrucca, m. fop, t. b. murphy, a. e. raftery, mclust : clustering, classification and density estimation using gaussian finite mixture models. r j. , – ( ). . r. killick, i. a. eckley, changepoint: an r package for changepoint analysis. j. stat. softw. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ( ), doi: . /jss.v .i . . h. chen, l. albergante, j. y. hsu, c. a. lareau, g. lo bosco, j. guan, s. zhou, a. n. gorban, d. e. bauer, m. j. aryee, d. m. langenau, a. zinovyev, j. d. buenrostro, g. c. yuan, l. pinello, single-cell trajectories reconstruction, exploration and mapping of omics data with stream. nat. commun. ( ), doi: . /s - - - . . a. n. schep, b. wu, j. d. buenrostro, w. j. greenleaf, chromvar: inferring transcription-factor- associated accessibility from single-cell epigenomic data. nat. methods ( ), doi: . /nmeth. . . a. khan, o. fornes, a. stigliani, m. gheorghe, j. a. castro-mondragon, r. van der lee, a. bessy, j. chèneby, s. r. kulkarni, g. tan, d. baranasic, d. j. arenillas, a. sandelin, k. vandepoele, b. lenhard, b. ballester, w. w. wasserman, f. parcy, a. mathelier, jaspar : update of the open-access database of transcription factor binding profiles and its web framework. nucleic acids res. ( ), doi: . /nar/gkx . . h. li, r. durbin, fast and accurate short read alignment with burrows-wheeler transform. bioinformatics. , – ( ). . h. li, b. handsaker, a. wysoker, t. fennell, j. ruan, n. homer, g. marth, g. abecasis, r. durbin, the sequence alignment/map format and samtools. bioinformatics. , – ( ). . e. talevich, a. h. shain, t. botton, b. c. bastian, cnvkit: genome-wide copy number detection and visualization from targeted dna sequencing. plos comput. biol. ( ), doi: . /journal.pcbi. . . a. r. quinlan, i. m. hall, bedtools: a flexible suite of utilities for comparing genomic features. bioinformatics. , – ( ). . t. stuart, a. srivastava, c. lareau, r. satija, biorxiv, in press, doi: . / . . . . . a. butler, p. hoffman, p. smibert, e. papalexi, r. satija, integrating single-cell transcriptomic data across different conditions, technologies, and species. nat. biotechnol. , – ( ). acknowledgments funding: a canada research chair in brain cancer epigenomics (tier ) from the government of canada, project grants from the canadian institutes of health research (cihr; pjt- , pjt- ), a discovery grant from the natural sciences and engineering research council (nserc) and an azrieli future leader in canadian brain research grant to mg; a clinician investigator program fellowship from alberta health services and a fellowship from alberta innovates to an; an eyes high scholarship from the university of calgary to ds; a clark smith postdoctoral fellowship and a cihr postdoctoral fellowship to mj; a canada research char in precision oncology (tier ) and a cihr grant (pjt- ) to sm; an alberta graduate excellence scholarship and alberta innovates scholarship to ag. this project has been made possible by the brain canada foundation through the canada brain research fund, with the financial support of health canada and the azrieli foundation. author contributions: conception and experimental design: an, mg. generation of datasets: an, ke, jc, pn, nb. data acquisition and analysis: an, ds, mj, ag, sm, nb, mg. data interpretation and creation of new software: an, mg. manuscript preparation: all co-authors. competing interests: the authors declare no competing interests. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / data and materials availability: the copy-scat package and a sample tutorial are available on github at http://github.com/spcdot/copyscat. all datasets will be made available upon publication in a peer reviewed journal. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplemental material table s . summary of samples and cells profiled by copy-scat sample unique barcodes after pileup unique barcodes after filtering percent passing filters cgy . % cgy . % cgy . % cgy . % pcgy . % pcgy . % pcgy . % pcgy . % pcgy . % pcgy . % mm . % mm . % mm . % mm . % mm . % mm . % mm . % mm . % mm . % mm . % total cells profiled . % .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table s . sensitivity and specificity of copy-scat in agbm, pgbm and mm samples gains losses amplifications samples sensitivity specificity sensitivity specificity sensitivity specificity agbm (n = ) . . . . . . pgbm (n= ) . . . . n/a . mm (n = ) . . . . n/a n/a .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . comparison of cnvs inferred by copy-scat and by wgs for adult gbm samples. (a) comparison of chromosome arm level losses detected in three adult gbm samples by single cell atac, wgs, or both methods. (b) comparison of focal amplifications detected in three adult gbm sample by scatac, wgs, or both methods. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . comparison of cnvs inferred by copy-scat or wgs in pediatric gbm samples. (a) gains detected in three pediatric gbm samples compared to linked-reads wgs. (b) losses detected in three pediatric gbm samples compared to linked-reads wgs. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . comparison of cnvs inferred by copy-scat or with the sccnv assay in multiple myeloma samples. (a) comparison of gains seen in additional myeloma samples versus x single-cell cnv sequencing. (b) comparison of chromosome losses seen in additional myeloma samples versus x single-cell cnv sequencing. (c,d) number of gains and losses detected by both methods compared to number of cells in scatac-seq sample. (e-f) number of shared gains or losses detected between the two methods, plotted versus the number of cells in the scatac-seq experiment. (g-h) number of shared gains or losses detected between the two methods, plotted versus the number of reads per cell in the scatac. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . cnvs are detected in scatac clusters with copy-scat in pediatric gbm samples. (a) overview of cell assignments in two paired patient libraries. (b-d) representative wgs-confirmed alterations detected in pcgy and pcgy . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . cnvs are identified by copy-scat in specific scatac clusters in multiple myeloma samples. (a) gain of chromosome p restricted to neoplastic cell populations. (b) similar pattern with gain of chromosome q. (c) similar pattern with loss of chr q. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . additional chromosome copy number analyses for cgy . (a) initial neighbourhood clustering results from signac. (b-f) representative chromosome-level copy number alteration profiles for tumour and normal cells. (g-n) representative motif scores from chromvar for different motifs, including (g) elf , (h) spib, (i) ascl , (j) ikzf , (k) neurod , (l) nfic, (m) nfya, (n) elk . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . representative copy number information and distribution for agbm sample cgy . (a) neighbourhood clustering results from signac. (b-c) distribution of amplifications in egfr and mdm . (d-i) representative chromosome-level copy number alteration profiles for tumour and normal cells. (j-l) representative motif scores from chromvar for different motifs, including (j) nfic, (k) spib and (l) foxg . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . representative copy number information and distribution for agbm sample cgy . (a) neighbourhood clustering results from signac. (b) distribution of amplifications in egfr. (c-j) representative chromosome-level copy number alteration profiles for tumour and normal cells. (g-l) representative motif scores from chromvar for different motifs, including (g) nfic, (h) fos::jun, (i) neurod , (j) elf , (k) spib, and (l) ikzf . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . effects of removing cnvs on variance in agbm sample cgy . (a) distribution of the top most variable peaks in the tumour cells after filtering out non-neoplastic cells; p value from chi-squared test. (b) distribution of top most variable peaks in non-neoplastic cells after filtering (p value from chi-squared test). chromosomes with cnvs or amplification regions are highlighted in pink. (c) distribution of top most variable peaks in tumour cells after filtering of non-neoplastic cells and removal of regions containing cnvs (p value from chi-squared test). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . validation of copy-scat and identification of putative proliferative cells in non- neoplastic datasets. (a) chromosome copy number distribution in a x dataset of human pbmcs. (b) seurat clusters for the x dataset of human pbmcs. (c) estimate of cycle status for the x dataset of human pbmcs. (d) chromosome copy number distribution in a x dataset of mouse embryonic brain at e . (e,f) predicted cycle status and cluster assignments in e mouse brain. (g,h) predicted cell cycle status and cluster profile in p mouse brain dataset from x. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . pseudotime trajectory analysis of agbm sample cgy . distribution of egfr amplification (a) and cell cycle status (b) amongst branches. distribution of chromvar motif scores in branches for proneural motifs ascl and olig (c,d), etv (e), nfix (f), and mesenchymal motifs jun::junb (g) and stat (h). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . pseudotime trajectory analysis of agbm sample cgy . distribution of pdgfra amplification (a) and cycling status (b) amongst branches. distribution of chromvar motif scores in branches for proneural motifs ascl and olig (c,d), etv (e), nfix (f), and mesenchymal motifs jun::junb (g) and stat (h). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. s . pseudotime trajectory analysis of agbm sample cgy . distribution of chromvar motif scores in branches for proneural motifs ascl and olig (a,b), etv (c), nfix (d), and mesenchymal motifs jun::junb (e) and stat (f). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / bfimpute: a bayesian factorization method to recover single-cell rna sequencing data paper bfimpute: a bayesian factorization method to recover single-cell rna sequencing data zi-hang wen, jeremy l. langsam, lu zhang, wenjun shen , ∗ and xin zhou , , ∗ school of optical and electronic information, huazhong university of science and technology, luoyu road, wuhan, , hubei, china, department of biomedical engineering, vanderbilt university, vanderbilt place, , nashville, usa, department of computer science, hong kong baptist university, room r , sir run run shaw building, kowloon tong, hong kong, department of bioinformatics, shantou university medical college, no. xinling road, shantou, , guangdong, china, department of computer science, vanderbilt university, vanderbilt place, , nashville, usa and data science institute, vanderbilt university, sony building, th ave s building, suite , , nashville, usa ∗corresponding authors: maizie.zhou@vanderbilt.edu; wjshen@stu.edu.cn for publisher only received on date month year; revised on date month year; accepted on date month year abstract single-cell rna-seq (scrna-seq) offers opportunities to study gene expression of tens of thousands of single cells simultaneously, to investigate cell-to-cell variation, and to reconstruct cell-type-specific gene regulatory networks. recovering dropout events in a sparse gene expression matrix for scrna-seq data is a long-standing matrix completion problem. we introduce bfimpute, a bayesian factorization imputation algorithm that reconstructs two latent gene and cell matrices to impute final gene expression matrix within each cell group, with or without the aid of cell type labels or bulk data. bfimpute achieves better accuracy than other six publicly notable scrna-seq imputation methods on simulated and real scrna-seq data, as measured by several different evaluation metrics. bfimpute can also flexibly integrate any gene or cell related information that users provide to increase the performance. availability: bfimpute is implemented in r and is freely available at https://github.com/maiziezhoulab/bfimpute. key words: single cell; rna-seq; imputation; bayesian factorization introduction single-cell rna-seq (scrna-seq) has been widely used to study genome-wide transcriptomes in single cell resolution. the cellular resolution made possible by scrna-seq data distinguishes it from bulk rna-seq and makes it advantageous in investigating cell-to-cell variation [ ]. today, different commercial platforms are available to perform scrna-seq, including fluidigm c , wafergen icell and x genomics chromium. droplet-based methods via x genomics chromium can process tens of thousands of cells; microwell- based, microfluidic-based methods via fluidigm c and wafergen icell process fewer cells but with a higher sequencing depth. for all these platforms, missing values make up a large proportion of scrna-seq data, ranging from % - % in the gene expression count matrix [ , , , , ]. in scrna-seq data, this large percentage of missing events is defined as the so-called ‘dropout’ phenomenon [ ]. gene ‘dropout’ means a gene is observed at a moderate expression level in one cell but it is not detected in another cell of the same type. analyses of scrna-seq data, including dimensionality reduction, clustering, and differential expression (de) analysis have shown that effective imputations for dropout events improve downstream analyses and assist biological interpretations [ , , , ]. to date, several notable imputation methods have been proposed: scimpute [ ], drimpute [ ], magic [ ], saver [ ], viper [ ] and scrabble [ ]. scimpute first performs clustering to identify cell subpopulations and further identifies dropout events through a gamma-normal mixture model, finally imputes dropout events by a non-negative least squares regression [ ]. drimpute optimizes the step of identifying cell subpopluations to impute dropout events by averaging the imputation from multiple clustering results [ ]. magic builds a markov affinity-based graph for imputation relying on cell to cell interactions [ ]. saver uses a bayesian- based model by various prior probability, and alters all gene expression values [ ]. viper imputes dropout events relying on local neighborhood cells via non-negative sparse regression models [ ]. scrabble has been recently introduced to impute dropout events by adopting the bulk rna-seq data [ ]. even though a lot of efforts have been taken into analyzing and imputing real dropout events, imputation of dropout events is still a difficult problem because of the high dropout rate and complex cellular heterogeneities for different scrna-seq datasets. relying on matrix completion to .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint email:email-id.com https://github.com/maiziezhoulab/bfimpute https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / wen et al. latent vectors true values dropouts training imputing gene gene genen cell cellm g t g t gn t c cm expression matrix e cell latent matrix c p(e|g,c,α) = n i= m j= n(ei j |gi tcj,α − ) iij ê = gtc · · · = × · · · · · · · · · · · · · · · · · · · · · · · · · · · gi tcj gene latent matrix gt fig. . a brief illustration blueprinting the architecture of bfimpute method. in each group, bfimpute borrows information from true values and factorizes the expression matrix into two latent matrices using mcmc. after training, bfimpute imputes dropouts by performing product of the latent matrices. the details are shown in methods section. impute missing values is a long-standing question and has been investigated in biological sciences, including gene expression prediction, mirna–disease, protein-protein interaction [ ] etc. even though similar mathematical models could be applied to different biological problems, to solve matrix completion problem in scrna-seq (recovering the dropout events), it is crucial to take the features of scrna-seq into consideration. most of existing scrna-seq imputation methods have shown it is advantageous for imputation to borrow and leverage information from similar cells. in recent years, researchers also start to integrate additional gene or cell related information (e.g. bulk data for scrabble) to assist imputation which is important in matrix completion problem. in this study, we present bfimpute, a powerful imputation tool for scrna-seq data that recovers dropout events by factorizing the count matrix into the product of gene-specific and cell-specific feature matrices [ , ]. bfimpute uses full bayesian inference to describe the latent information for genes and cells and carries out a markov chain monte carlo scheme which is able to easily incorporate any gene or cell related information to train the model and perform the imputation [ ] (figure ). we demonstrate that bfimpute performs better than the six other notable published imputation methods mentioned above (scimpute, saver, viper, drimpute, magic, and scrabble) in both simulated and real scrna-seq datasets on improving clustering and differential gene expression analyses and recovering gene expression temporal dynamics (pseudotime analysis) [ ]. methods cell clustering and dropout detection bfimpute first provides an optional normalization step to smooth the gene expression values (counts per million, followed by logarithm base with bias . ). bfimpute then performs a local imputation within each cell group. we adopt the same approach as scimpute [ ] to detect cell clusters, which applies spectral clustering methods on the result of principal component analysis (pca) to reduce the impact of dropout events. we integrate spectral clustering by using the ’spectrum’ function of the spectrum r package [ ] or the ’specc’ function of the kernlab r package [ ]. bfimpute also adopts the gamma-normal mixture distribution model from scimpute to determine dropout events [ ]. probabilistic model for scrna-seq expression matrix imputation after above-mentioned steps, we adapted a multi-variate priors model from bayesian probabilistic matrix factorization (bpmf) [ ] to recover dropouts for scrna-seq datasets. since every cell group is mathematically equivalent, we arbitrarily choose one to demonstrate local imputation in bfimpute. suppose we have n genes and m cells in one cell group, and the expression matrix is e ∈ rn×m. each entity eij represents the expression level of gene i in cell j. bfimpute factorizes e into g ∈ rd×n and c ∈ rd×m which are defined as gene and cell latent matrix, respectively, where d is the dimension of the latent factor. column vector gi and cj represent the gene-specific and cell-specific latent vector, respectively. the imputed matrix to recover e will be given as ê = gtc. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / bfimpute we introduce the gaussian noise model for the gene expression profile e with precision α, which was firstly proposed by probabilistic matrix factorization (pmf) [ ]: p(e|g,c,α) = n∏ i= m∏ j= [ n(eij|gi t cj,α − ) ]iij ( ) where iij is the indicator function that is if the eij is a dropout and equal to otherwise. to get use of gene or cell related information such as bulk data or other data user provided, we add entity features sg ∈ rfg×n and sc ∈ rfc×m as gene and cell feature matrix, respectively, where fg and fc are the dimentionalities of these additional features. the gaussian model for the prior distributions over genes and cells latent vectors adapted from macau [ ] will be given by: p(gi|sgi ,µg, Λg,βg) = n(gi|µg + βg tsgi , Λ − g ) p(cj|scj ,µc, Λc,βc) = n(cj|µc + βc tscj , Λ − c ) ( ) where {µg,µc} and {Λg, Λc} are the means and precisions, and βg ∈ rfg×d and βc ∈ rfc×d are the weight matrices for the entity features. weight initialization by a zero mean normal distribution is used and they will be updated iteratively by the bayesian inference steps (details described later). also, direct imputation of single cell rna-seq data could be applied by initiating zeros into feature vectors sg and sc(where fg = fc = ) if no additional information is given. to perform bayesian inference, we introduce the priors referring to bpmf [ ] for {µg, Λg} and {µc, Λc}. p(µg, Λg|µ ,β ,ν ,w ) = n(µg|µ , (β Λg) − ) ×w(Λg|w ,ν ) p(µc, Λc|µ ,β ,ν ,w ) = n(µc|µ , (β Λc) − ) ×w(Λc|w ,ν ) ( ) where w is the wishart distribution with ν as the degrees of freedom and w as the scale matrix. we also set a zero mean normal distribution as βg and βc’s priors and a gamma distribution as the problem dependent αg and αc’s hyperpriors adapted from macau [ ]: p(βg|Λg,αg) = n(vec(βg)| , Λg− ⊗ (αgi)− ) p(βc|Λc,αc) = n(vec(βc)| , Λc− ⊗ (αci)− ) ( ) p(αg|k,θ) = g(αg|k/ , θ/k) p(αc|k,θ) = g(αc|k/ , θ/k) ( ) where vec(βx) is the vectorization of βx, ⊗ represents the kronecker product and αx is the precision (x ∈ {g,c}). k/ and θ/k are shape and scale, respectively. k and θ are hyperparameters which are set to . gibbs sampler to impute dropout events we use markov chain monte carlo (mcmc) algorithm to train bfimpute, which is a sampling based approach to tackle the bayesian inference problem. bfimpute constructs a markov chain from a random initial value and after running the chain for k̃ steps, it will eventually converge to its stationary distribution. bfimpute then uses the average of (k − k̃) stationary stages to approximate the real distribution of e and gain the estimated values êij for dropouts: p(êij|e,g,c) ≈ k − k̃ k∑ k=k̃+ p(êij|gi (k) ,ci (k) ,α) ( ) more specifically, bfimpute chooses gibbs sampler to achieve bayesian matrix factorization. in every cycle, we sample the conditional distribution from the posterior distribution in bayes’ theorem. since the probabilistic models of genes and cells are symmetric, the conditional distributions over genes and the conditional distribution over cells have the same form. in particular, based on ( ) and ( ), the conditional probability for gi is: p(gi|e,c,α,s g i ,µg, Λg,βg) = n(gi|µ (g)′ i , Λ (g)′ i ) ( ) ∝ m∏ j= [ n(eij|gi t cj,α − ) ]iij ×p(gi|s g i ,µg, Λg,βg) where  Λ (g)′ i = Λg + α ∑ j ( sjsj t )iij µ (g)′ i = ( [Λ (g)′ i ] − )[ Λg ( µg + βg tx (g) i ) + α ∑ j (eijcj) iij ] according to ( ) and ( ), we can derive the conditional probability for µg and Λg: p(µg, Λg|g,s g ,βg,αg,µ ,β ,ν ,w ) = n(µg|µ ′ , ( β ′ Λg )− )w(Λg|w ′ ,ν ′ ) ( ) ∝ p(gi|s g i ,µg, Λg,βg) ×p(µg, Λg|µ ,β ,ν ,w ) where  µ ′ = β µ +nḡ β +n β ′ = β + n ν ′ = ν + n + fg w ′ = [w − + nh̄ + β µ µ t −β ′µ ′µ ′ t + αgβg tβg] − ḡ = n ∑ n i= ( gi −βgtsgi ) h̄ = n ∑ n i= ( gi −βgtsgi )( gi −βgtsgi )t considering ( ) and ( ), we get the conditional probability for αg: p(αg|βg, Λg,k,θ) = g(αg|k ′ / , θ ′ /k ′ ) ( ) ∝ p(βg|Λg,αg) ×p(αg|k,θ) ( ) where { k′ = (fgd+θ)k θ+θ·tr(βgtβgΛg) θ′ = fgd + θ from ( ) and ( ), we are able to know the conditional probability for βg: p(βg|Λg,αg,g,s g ,µg) = n(µβg, Λβg ) ( ) ∝ p(βg|Λg,αg) × ∏ i p(gi|s g i ,µg, Λg,βg) because the size of the precision matrix Λβg is too large to compute, we consider to do this part in an alternative way which is firstly proposed by macau [ ] by calculating: β̃g = ( s gt s g + αgi )− ( s gt ( g̃ + e ) + √ αge ) ( ) where g̃ = (g−µg)t , and each row of e ∈ rn×d and e ∈ rfg×d is sampled from n( , Λg− ). .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / wen et al. algorithm gibbs sampling in bfimpute . initialize {g ,c ,βg( ),βc( ),αg( ),αc( )} . for k = , , . . . ,k a. sample the means {µg,µc} and precisions {Λg, Λg} of gene and cell latent matrices: µg (k) , Λg (k) ∼ p(µg, Λg|g (k− ) ,s g ,βg (k− ) ,αg (k− ) ) µc (k) , Λc (k) ∼ p(µc, Λc|c (k− ) ,s c ,βc (k− ) ,αc (k− ) ) b. sample gene and cell latent matrices {g,c}: • for each i = , . . . ,n sample gene latent vectors in parallel: gi (k) ∼ p(gi|e,c (k− ) ,s g i ,µg (k) , Λg (k) ,βg (k− ) ) • for each i = , . . . ,m sample cell latent vectors in parallel: ci (k) ∼ p(ci|e,g (k) ,s g i ,µg (k) , Λg (k) ,βg (k− ) ) c. sample the precisions {αg,αc} of weight matrices: αg (k) ∼ p(αg|βg (k− ) , Λg (k) ) αc (k) ∼ p(αc|βc (k− ) , Λc (k) ) d. sample weight matrices {βg,βc}: βg (k) = ( s gt s g + αg (k) i )− ( s gt ( g̃ (k) + e ) + √ αg(k)e ) βc (k) = ( s ct s c + αc (k) i )− ( s ct ( c̃ (k) + e ) + √ αc(k)e ) the gibbs sampling steps of bfimpute are shown in algorithm : generation of simulated data we first simulated a single cell rna-seq count matrix with genes and cells evenly split into groups using the scater(v . . ) [ ] package and splatter(v . . ) [ ] package. the parameter which controls the probability that a gene will be selected as de was set to . while the location and scale factor were set to . and . , respectively. we used ’experiment’ to add the global dropout for every cell. in order to show the universal applicability of bfimpute, we further generated , , groups of cells with , , as total cell numbers and runs for each data with different seeds using the same parameters mentioned above. quality control for real datasets we did quality control (qc) (https://github.com/gongx / scdatasets) for all real datasets to ensure fairness for all methods before imputation except for pbmcs dataset (see details in github). as the pbmcs dataset is based on x genomics platform with an extremely high dropout rate, the qc step for pbmcs datasets could remove and lose nearly % genes. evaluation metrics of clustering results we used four evaluation methods: adjusted rand index [ ], jaccard index [ ], normalized mutual information (nmi) [ ], and purity score, to analyse the agreement between true cluster labels and the spectral clustering [ ] results on the first two principle components (pcs) of imputed matrix. most of these four measurements vary from to , with indicating perfect match between them, except the adjusted rand index which could yield negative values when agreement is less than expected by chance. the adjusted rand index is an adjusted version of rand’s statistic [ ] which is the probability that a randomly selected pair is classified in agreement. the jaccard index is similar to rand index, but disregards the pairs of elements that are in different clusters for both clusterings [ ]. the normalized mutual information combines multiple clusterings into a single one without accessing the original features or algorithms that determine these clusterings. the purity score shows the rate of the total number of cells that are classified correctly. results we demonstrated the performance of bfimpute in gene expression recovering, data visualization, cell subpopulation clustering, pseudotime and de analysis on five publicly available scrna-seq datasets (supplementary table ), and we compared bfimpute with six state-of-the-art imputation methods: scimpute, saver, viper, drimpute, magic, and scrabble in the following sections. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/gongx /scdatasets https://github.com/gongx /scdatasets https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / bfimpute a raw bfimpute scimpute saver viper drimpute magic jaccard index nmi purity adjusted rand index jaccard index nmi purity adjusted rand index jaccard index nmi purity adjusted rand index jaccard index nmi purity adjusted rand index . . . . . v a lu e b k = k = k = k = − − − − − − − − − − − group group group group group − − − − − − − − − − group group group group group a viper drimpute magicsaver raw bfimpute scimputecomplete tsne tsne tsne tsne t s n e t s n e fig. . bfimpute recovers dropout values and improves cell type identification in the simulated data. a. the scatter plots show the first two dimensions of the t-sne results calculated from the complete data, the raw data, and the imputed data by bfimpute, scimpute, saver, viper, drimpute, and magic. b. k represents the number of cell clusters in simuated data. the adjusted rand index, jaccard index, nmi, and purity scores of clustering results are based on the raw and imputed data. bfimpute improves both visualization and cell type identification pca and t-distributed stochastic neighbor embedding (t-sne) [ , ] are two popular dimensionality reduction techniques often used to visualize high-dimensional scrna-seq datasets. since the dropout values were unknown in real datasets, we first tested accuracy of all different imputation methods using a simulated dataset where the ground truth was known. we applied the splatter method to generate simulated datasets, which simulated many features observed in the scrna-seq data, including zero-inflation, gene-wise dispersion, and differing sequencing depths between cells. to test the strength and robustness of different imputation methods, we simulated a wide range of datasets to include , , and different cell types (methods section). bfimpute achieved the most compact and well separated clusters on the simulation, followed by scimpute and drimpute (figure ). for all different cell types simulations, we also evaluated the clustering performances by the evaluation metrics, where bfimpute achieved the best scores for adjusted rand index, jaccard index, normalized mutual information and purity score compared to the raw data and other five imputation methods (methods section). we further used two real datasets for this analysis and the first two principal components (pcs) from pca were plotted to compare every dataset across seven different conditions: raw dataset, and six imputed ones through the bfimpute, scimpute, saver, viper, drimpute, and magic methods. we first applied all imputation methods to a real scrna-seq dataset from a human embryonic stem (es) cell differentiation study [ ] to demonstrate the capacity of bfimpute for improving the performance of data visualization. the dataset contains single cells from seven cell groups: neuronal progenitor cells (npcs), definitive endoderm cell (dec), endothelial cells (ecs) and trophoblast-like cells (tbs) are progenitors differentiated from h human es cells. h human es cells and human foreskin fibroblasts (hffs) were used as controls cells. the raw dataset (i.e. without imputation) clearly identified the cluster of hff cells, however five other cell types were clustered very closely. after imputation by bfimpute, the homogeneous subpopulations of h and h human es cells were observed to substantially overlap and well separated from the rest of the progenitors. the decs, ecs, hffs, npcs and tbs were also compactly clustered and well separated on the pca plot (figure a). compared with the raw dataset, saver, viper and drimpute had no significant improvement for cell groups identification. scimpute was the second best and generated similar compact cell groups as bfimpute. we then compared clustering results of the spectral clustering algorithms [ ] on the first two pcs to demonstrate the capability of bfimpute to improve clustering accuracy in cell type identifications. for .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / wen et al. − − pca ( %) p c a ( % ) raw − − pca ( %) p c a ( % ) bfimpute − pca ( %) p c a ( % ) dec ec h h hff npc tb scimpute − pca ( %) p c a ( % ) dec ec h h hff npc tb − pca ( %) p c a ( % ) viper − pca ( %) p c a ( % ) drimpute − − − pca ( %) p c a ( % ) dec ec h h hff npc tb . . . . . v a lu e cell types - - - - - - - - saver magic jaccard index nmi purity adjusted rand index raw bfimpute scimput saver viper drimpute magic same different raw bfimpute scimpute saver viper drimpute magic . . . . . . p e a rs o n c o rr e la ti o n a b c fig. . bfimpute improves pca visualization and cell type identification. a. the first two pcs calculated from the raw data, and the imputed data by bfimpute, scimpute, viper, drimpute, magic, and saver. b. the adjusted rand index, jaccard index, nmi, and purity scores of clustering results based on the raw and imputed data. c. average pearson correlations between any two cells from same type and different type. the true labels, we had seven cell types for this dataset, and we evaluated the clustering results by four different metrics: adjusted rand index, jaccard index, normalized mutual information (nmi), and purity (methods section). all four metrics suggested bfimpute achieved the best clustering accuracy compared with raw and other five imputation methods (figure b). we also showed the comparison of visualization performance through t-sne. t-sne on the raw dataset can better identify the seven cell types comparing to pca. bfimpute, drimpute and saver can further separate different cell groups and improve the visualization, however the other four imputation methods demonstrated worse t-sne results than raw data (supplementary figure ). to illustrate the recovering of dropouts in individual cells by imputation, we calculated the pearson correlation from log -transformed read counts between every pair of cells in the same type and from different cell types. this result indicated imputation did recover the zero counts in every cell and the pearson correlation increased from . to . for bfimpute, . for scimpute, . for saver, . for viper, . for drimpute, and . for magic (figure c, blue bars). one scatter plot of correlations between two randomly selected stem cells of the same cell type was demonstrated in supplementary figure . as we expected, imputation methods usually increased the pearson correlation between any two cells in the same cell type. imputation should not increase the correlation between cells in different cell types by disregarding the biological variation between them. among all imputation methods, magic achieved the highest correlation in the same cell type, but the correlation between different cell types was also the highest (figure c, red bars). bfimpute demonstrated the best balance, by maximizing the difference between correlation for the same over different cell types. we further investigated bfimpute’s performance of visuali- zation and cell type identification on another zebrafish [ ] scrna-seq dataset. this dataset contains single cells from six cell groups, and hematopoietic stem and progenitor cells (hspcs) and hspcs/thrombocytes among them come from one defined cell type with expected heterogeneity. after the qc step, the zebrafish dataset was still sparse with zeros .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / bfimpute composing over . % of the total counts. the comparison of visualization performance via pca on the raw and six imputed datasets is shown in supplementary figure . the raw dataset only roughly identified the cluster for neutrophil cells, whereas cells from other cell types were mixed and spread wildly. after imputation by bfimpute, four distinct immune cell subpopulations can be identified for neutrophils, t, natural killer (nk) and b cells, where the cluster members were much more compact compared to those of the raw dataset. neutrophils, t, nk and b cells were distantly positioned on the pca plot. hspcs and hspcs/thrombocytes were from one defined cell type with expected heterogeneity, so after bfimpute’s imputation, they were still spatially closer than other cells (supplementary figure a). the raw data and the imputed data by other five imputation methods did not correctly identify the four immune cell subpopulations. clustering accuracy results from the four metrics for bfimpute were better than the other five imputation methods, and bfimpute achieved a better correlation for the same cell type without loosing variation between different cells types (supplementary figure b,c). bfimpute improves de and pseudotime analysis de analysis is widely used in bulk rna-seq data. performing de analysis for scrna-seq data to reveal the stochastic nature of gene expression in single cells is challenging since scrna- seq data suffers from high dropout events. however, it has been proven that good imputation methods could lead to a better agreement between scrna-seq and bulk rna-seq data of the same biological condition on genes known to have little cell-to-cell heterogeneity. we utilized a real dataset by chu et al [ ] with both bulk and scrna-seq data available on human embryonic stem cells and definitive endoderm cells (dec) [ , ], to compare bfimpute with the raw dataset and other five imputation methods for de analysis. this dataset contained six samples of bulk rna-seq (four in h es cells and two in dec) and samples of scrna-seq ( in h es cells and in dec). the percentages of zero entries were . % in bulk data and . % in scrna-seq data, respectively. we first performed de analysis in the bulk data and identified the top de genes by deseq [ ]. we then plotted these genes’ expression profiles in scrna-seq data for seven conditions: raw dataset, bfimpute, scimpute, saver, viper, drimpute, and magic. we found these top genes’ expression profiles after bfimpute’s imputation demonstrated better concordance with those in bulk data (figure a). to further evaluate whether imputation improves de analysis in scrna-seq data, we first used deseq to identify de genes for raw scrna-seq dataset and scrna-seq datasets after six different imputations. we then generated different lists of de genes for the bulk data by applying different thresholds for false discovery rates of genes. finally for every threshold, we compared the de genes for the bulk data and scrna-seq data of those seven different conditions and calculated the auc values for each condition. the auc values suggested all imputation methods improved de analysis. bfimpute generated de genes most consistent with the bulk data (auc values raw: . , bfimpute: . , scimpute: . , saver: . , viper: . , drimpute: . and magic: . ). bulk data for the same biological condition was provided and could be used as a gold standard to compare the average gene expression level with the scrna-seq data, even though the scrna-seq data presented more cell-to-cell variation. we expected that average gene expression level in the scrna- seq data was highly correlated with bulk rna-seq data. to investigate this, we plotted correlations between gene expression in single-cell and bulk data and found that all imputation methods did improve the correlation between bulk and scrna-seq data, and bfimpute, magic and scimpute had the best improvement (supplementary figure ). we further selected several genes (e.g., angpt ,gdf , bmp , epb l ) of decs from different time points to plot their average gene expression levels in both bulk and scrna-seq data. these genes were annotated with the go term “endoderm development”, and they were likely to be affected by dropout events [ , ]. imputed read counts for these genes by bfimpute showed higher gene expression correlation and better consistency with the bulk data (figure b and supplementary figure , ). in addition to the de analysis, we also used the time course scrna-seq data [ ] from the same chu et al study to show bfimpute improved gene expressions temporal dynamics through pseudotime analysis. in this dataset, a total of single cells were captured and profiled by scrna-seq at , , , , , and h of differentiation. we first applied bfimpute, scimpute and drimpute to the raw scrna-seq data with true cell type labels, and then study how the time-course expression patterns change in the imputed data. the pca results showed that imputed read counts by bfimpute better distinguished cells of different time points and the six time points cell groups were compact (supplementary figure a), and the first principle component from pca indicated that imputed read counts from bfimpute reflected more accurate transcriptome dynamics along the different time course (figure d). bfimpute could better differentiate the last two time points ( h and h). in the next section, we will discuss impuation with the aid of cell type labels more in details. bfimpute improves performance with the aid of additional experimental information imputation methods including bfimpute, scimpute and drimpute all first identified similar cells based on clustering, and imputation was then performed by leveraging the expression values from similar cells. being able to first identify the appropriate cell groups enhanced the ability of imputing the dropout events. a substantial number of scrna-seq studies have identified cell types from experimental design or marker genes. we applied bfimpute, scimpute and drimpute to the raw scrna-seq data with true cell type labels in three real datasets we have used before, and two more new real datasets. in this study, saver, viper, drimpute, and magic were excluded since they were not applicable to use cell labels. we then investigated again the pca and t-sne visualizations for cell subpopulations identification. our results showed bfimpute outperformed the other two methods and clearly differentiated almost every cell group in different datasets (figure and supplementary figure , ). for the human embryonic stem cell dataset, bfimpute further correctly identified three outlier cells into correct groups compared to the previous imputation without cell labels (see figure a versus figure a: one ec (orange point), one dec (blue point), and one npc (yellow point) cell were brought back to the corresponding ec, dec and npc cell groups, respectively). h cells were also further apart from h cells in the vertical dimension. for the zebrafish dataset, even the most mixed b, nk, t cells (blue, green, and yellow colors) from the raw dataset were separated from each other after bfimpute’s imputation, and hspcs and .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / wen et al. cell type dec h h h h h h h t im e p o in t bfimpute scimpute h h h h h h drimputeraw first principal component first principal component first principal component first principal component - - - - - - - a c bulk raw bfimpute scimpute saver viper magicdrimpute cell type dec h l o g (e x p re s s io n + ) angpt bfimpute scimpute saverraw h h h h h h h h h h h h h h h h h h h h h h h h b fig. . bfimpute improves de and pseudotime analysis. a. the expression profiles of the top de genes detected in the bulk data by deseq for seven conditions: raw dataset, bfimpute, scimpute, saver, viper, drimpute, and magic. b. time-course expression patterns of the example gene angpt that is annotated with go term “endoderm development”. the small black triangles marks the average bulk data for each time point. c. the first principal component is plotted to show cells of different time points along the differentiation. hspcs/thrombocytes cells were spatially close, but split into two cell groups (figure b and supplementary figure b). to test bfimpute with another kind of cell-label information, we used a human preimplantation embryonic development dataset (t-sne and pseudotime analyses are shown in figure c). the petropoulos dataset [ ] included single cells from five stages of human preimplantation embryonic development, ranging from developmental day (e) to . the five different stages were clearly distinguished from each other after bfimpute’s imputation. we also applied three imputation methods to a large x dataset generated by the high-throughput droplet-based system. to generate this dataset, we randomly selected cells from nine immune cell types, so it contained a total of peripheral blood mononuclear cells (pbmcs) [ , ]. in the raw data, . % read counts are exactly zeros. our pca and t-sne results indicated that bfimpute’s imputation identified nine immune cell types from raw data ( d). in summary, these results suggested that bfimpute with the aid of labels always further improved visualization and identification of cell subpopulations, and the downstream analysis. scrabble is another recent approach integrating bulk data to impute dropout events in scrna-seq data. since bfimpute can easily adopt bulk data as additional information into the gene latent matrix, we have also tested if bulk data can further improve performance. in the scrna-seq dataset of .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / bfimpute − pca ( %) p c a ( % ) − pca ( %) p c a ( % ) bfimpute − − − pca ( %) p c a ( % ) scimpute − − pca ( %) p c a ( % ) dec ec h h hff npc tb drimpute − tsne t s n e − − tsne t s n e bfimpute − − tsne t s n e scimpute − tsne t s n e b cells hspcs hspcs/thrombocytes neutrophils nk cells t cells drimpute − − tsne t s n e − − − tsne t s n e bfimpute − − tsne t s n e scimpute − − − tsne t s n e e e e e e drimpute raw raw raw - - - - - - - - - - - - - - a b c d - - zebrafish human embryonic stem cell differentiation human preimplantation embryonic developement − − t s n e - t s n e - - - - - - - - - b cytotoxic t helper t memory t monocyte naive cytotoxic t naive t natural killer regulatory tt s n e t s n e tsne tsne tsne tsne bfimpute scimpute drimputeraw peripheral blood mononuclear cells (pbmcs) fig. . bfimpute with labels improves pca and t-sne visualizations and cell type identification. a. the first two pcs calculated from the raw data, and the imputed data by bfimpute, scimpute, and drimpute for the human embryonic stem cell differentiation study. b. the first two dimensions from the raw data, and the imputed data by bfimpute, scimpute, and drimpute for the zebrafish data. c. the first two dimensions from the raw data, and the imputed data by bfimpute, scimpute, and drimpute for the human preimplantation embryonic development. d. the first principal component is plotted to show cells of different time points along the embryonic development. human embryonic stem cells with bulk data, we did not observe significant differences between bfimpute and bfimpute with bulk data as additional information (supplementary figure versus figure a). the reason could be that similar gene level information has less effect than similar cell level information for the imputation of dropout events. we also found that in these scrna-seq datasets, scrable’s performance after integrating cell labels information with bulk data, was not better than bfimpute (supplementary figure ). discussion and conclusion scrna-seq has become an indispensable tool in recent years, as it has made it possible to study genome-wide transcriptomes in single cell resolution. due to sequencing technical issues, a large proportion of dropout events exist in scrna-seq data, which limit its usefulness. several approaches have been proposed to solve this problem, with modest results. in this study, we introduced bfimpute to recover dropout events in scrna-seq data. we have shown that bfimpute can improve performance in recovering gene expression detected by bulk rna-seq, as well .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / wen et al. as in downstream analyses, including identification of cell sub- populations, differential expressed genes and gene expressions temporal dynamics. bfimpute uses a fully bayesian probabilistic matrix factorization by substituting hyperparameters with hyperpriors and performing gibbs sampling for the approximate inference. the advantage of this bayesian model is that it provides a predictive distribution instead of just a single number during recovering each dropout event, and the confidence in the prediction can be quantified and considered into the model. the use of a full bayesian model proved to be a considerable advantage for bfimpute to outperform other imputation methods. bfimpute imputes two latent cell and gene matrices for each cell group through a gibbs sampling process, and reaches a stationary state to generate the final cell-gene expression matrix, in which the dropout events will be recovered. another advantage of bfimpute is being able to integrate any gene or cell related information of scrna-seq data into these two latent gene and cell matrices to impute missing values. information from both similar cells or/and bulk data can be easily integrated into our model. even though scimpute and drimpute have a similar functionality in this respect, that allows them to impute dropout events with the aid of number of cell types or cell labels, they fail to achieve as good performance as bfimpute for most of scrna-seq data that we tested. any resource provided by the users from the cell level and gene level could be used as additional information to improve dropout events imputation in scrna-seq data in the future. key points • imputation to recover dropout events for scrna- seq data is important for determining genome-wide transcriptomes in single cell resolution. • bfimpute uses a fully bayesian probabilistic matrix factorization by substituting hyperparameters with hyperpriors and performing gibbs sampling for approximate inference. • the advantage of this bayesian model is that it provides a predictive distribution instead of just a single number during recovering each dropout event, and the confidence in the prediction can be quantified and considered into the model. • bfimpute is able to integrate any gene or cell related information of scrna-seq data into these two latent gene and cell matrices to impute missing values. • bfimpute achieves better accuracy than other six widely used scrna-seq imputation methods on simulated and real scrna-seq data, as measured by several different evaluation metrics. competing interests there is no competing interest. author contributions x.z. conceived and led this work. z.h.w. and x.z. designed the model and implemented the bfimpute software. z.h.w., j.l.l, w.s and x.z led the data analysis. z.h.w, w.s and x.z wrote the paper with feedback from j.l.l and l.z. funding this work was supported by vanderbilt university development funds (ff ). l.z. is partially supported by research grant council early career scheme (hkbu ). references . fuchou tang, catalin barbacioru, yangzhou wang, ellen nordman, clarence lee, nanlan xu, xiaohui wang, john bodeau, brian b tuch, asim siddiqui, et al. mrna- seq whole-transcriptome analysis of a single cell. nature methods, ( ): – , . . li-fang chu, ning leng, jue zhang, zhonggang hou, daniel mamott, david t vereide, jeea choi, christina kendziorski, ron stewart, and james a thomson. single- cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. genome biology, ( ): , . . qin tang, sowmya iyer, riadh lobbardi, john c moore, huidong chen, caleb lareau, christine hebert, mckenzie l shaw, cyril neftel, mario l suva, et al. dissecting hematopoietic and renal cell heterogeneity in adult zebrafish at single-cell resolution using rna sequencing. journal of experimental medicine, ( ): – , . . sophie petropoulos, daniel edsgärd, björn reinius, qiaolin deng, sarita pauliina panula, simone codeluppi, alvaro plaza reyes, sten linnarsson, rickard sandberg, and fredrik lanner. single-cell rna-seq reveals lineage and x chromosome dynamics in human preimplantation embryos. cell, ( ): – , . . grace xy zheng, jessica m terry, phillip belgrader, paul ryvkin, zachary w bent, ryan wilson, solongo b ziraldo, tobias d wheeler, geoff p mcdermott, junjie zhu, et al. massively parallel digital transcriptional profiling of single cells. nature communications, ( ): – , . . peng qiu. embracing the dropouts in single-cell rna-seq analysis. nature communications, ( ): – , . . peter v kharchenko, lev silberstein, and david t scadden. bayesian approach to single-cell differential expression analysis. nature methods, ( ): – , . . ingrid lönnstedt and terry speed. replicated microarray data. statistica sinica, pages – , . . simon anders and wolfgang huber. differential expression analysis for sequence count data. nature precedings, pages – , . . michael i love, wolfgang huber, and simon anders. moderated estimation of fold change and dispersion for rna-seq data with deseq . genome biology, ( ): , . . oliver stegle, sarah a teichmann, and john c marioni. computational and analytical challenges in single-cell transcriptomics. nature reviews genetics, ( ): – , . . wei vivian li and jingyi jessica li. an accurate and robust imputation method scimpute for single-cell rna-seq data. nature communications, ( ): – , . . wuming gong, il-youp kwak, pruthvi pota, naoko koyano-nakagawa, and daniel j garry. drimpute: imputing dropout events in single cell rna sequencing data. bmc bioinformatics, ( ): – , . . david van dijk, roshan sharma, juozas nainys, kristina yim, pooja kathail, ambrose j carr, cassandra burdziak, .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / bfimpute kevin r moon, christine l chaffer, diwakar pattabiraman, et al. recovering gene interactions from single-cell data using data diffusion. cell, ( ): – , . . mo huang, jingshu wang, eduardo torre, hannah dueck, sydney shaffer, roberto bonasio, john i murray, arjun raj, mingyao li, and nancy r zhang. saver: gene expression recovery for single-cell rna sequencing. nature methods, ( ): – , . . mengjie chen and xiang zhou. viper: variability- preserving imputation for accurate gene expression recovery in single-cell rna sequencing studies. genome biology, ( ): – , . . tao peng, qin zhu, penghang yin, and kai tan. scrabble: single-cell rna-seq imputation constrained by bulk rna-seq data. genome biology, ( ): , . . jaak simm, adam arany, pooya zakeri, t haber, jörg k wegner, v chupakhin, hugo ceulemans, and yves moreau. macau: scalable bayesian factorization with high- dimensional side information using mcmc. in ieee th international workshop on machine learning for signal processing (mlsp), pages – . ieee, . . andriy mnih and russ r salakhutdinov. probabilistic matrix factorization. in advances in neural information processing systems, pages – , . . ruslan salakhutdinov and andriy mnih. bayesian probabilistic matrix factorization using markov chain monte carlo. in proceedings of the th international conference on machine learning, pages – , . . robrecht cannoodt, wouter saelens, and yvan saeys. computational methods for trajectory inference from single-cell transcriptomics. european journal of immunology, ( ): – , . . christopher r john, david watson, michael r barnes, costantino pitzalis, and myles j lewis. spectrum: fast density-aware spectral clustering for single and multi-omic data. bioinformatics, ( ): – , . . andrew y ng, michael i jordan, yair weiss, et al. on spectral clustering: analysis and an algorithm. advances in neural information processing systems, : – , . . davis j mccarthy, kieran r campbell, aaron tl lun, and quin f wills. scater: pre-processing, quality control, normalization and visualization of single-cell rna-seq data in r. bioinformatics, ( ): – , . . luke zappia, belinda phipson, and alicia oshlack. splatter: simulation of single-cell rna sequencing data. genome biology, ( ): – , . . leslie c morey and alan agresti. the measurement of classification agreement: an adjustment to the rand statistic for chance agreement. educational and psychological measurement, ( ): – , . . paul jaccard. the distribution of the flora in the alpine zone. . new phytologist, ( ): – , . . alexander strehl and joydeep ghosh. cluster ensembles— a knowledge reuse framework for combining multiple partitions. journal of machine learning research, (dec): – , . . william m rand. objective criteria for the evaluation of clustering methods. journal of the american statistical association, ( ): – , . . silke wagner and dorothea wagner. comparing clusterings: an overview. universität karlsruhe, fakultät für informatik karlsruhe, . . laurens van der maaten and geoffrey hinton. visualizing data using t-sne. journal of machine learning research, ( ), . . pei wang, ryan t rodriguez, jing wang, amar ghodasara, and seung k kim. targeting sox in human embryonic stem cells creates unique strategies for isolating and analyzing developing endoderm. cell stem cell, ( ): – , . . pei wang, kristen d mcknight, david j wong, ryan t rodriguez, takuya sugiyama, xueying gu, amar ghodasara, kun qu, howard y chang, and seung k kim. a molecular signature for purified definitive endoderm guides differentiation and isolation of endoderm from mouse and human embryonic stem cells. stem cells and development, ( ): – , . . judith a blake, janan t eppig, james a kadin, joel e richardson, cynthia l smith, carol j bult, and mouse genome database group. mouse genome database (mgd)- : community knowledge resource for the laboratory mouse. nucleic acids research, (d ):d –d , . .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction methods cell clustering and dropout detection probabilistic model for scrna-seq expression matrix imputation gibbs sampler to impute dropout events generation of simulated data quality control for real datasets evaluation metrics of clustering results results bfimpute improves both visualization and cell type identification bfimpute improves de and pseudotime analysis bfimpute improves performance with the aid of additional experimental information discussion and conclusion competing interests author contributions funding havoc, a bioinformatic pipeline for reference-based consensus assembly and lineage assignment for sars-cov- sequences. title havoc, a bioinformatic pipeline for reference-based consensus assembly and lineage assignment for sars-cov- sequences. authors and institutional addresses phuoc truong nguyen , ilya plyusnin , , tarja sironen , , olli vapalahti , , , ravi kant † , , teemu smura † , . department of virology, faculty of medicine, university of helsinki, helsinki, finland . institute of biotechnology, university of helsinki, helsinki, finland . department of veterinary biosciences, university of helsinki, helsinki, finland . department of virology, university of helsinki and helsinki university hospital, helsinki, finland †correspondence to: ravi.kant@helsinki.fi or teemu.smura@helsinki.fi abstract background: sars-cov- related research has increased in importance worldwide since december . several new variants of sars-cov- have emerged globally, of which the most notable and concerning currently are the uk variant b. . . , the south african variant b . and the brazilian variant p. . detecting and monitoring novel variants is essential in sars-cov- surveillance. while there are several tools for assembling virus genomes and performing lineage analyses to investigate sars-cov- , each is limited to performing singular or a few functions separately. results: due to the lack of publicly available pipelines, which could perform fast reference- based assemblies on raw sars-cov- sequences in addition to identifying lineages to detect .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / variants of concern, we have developed an open source bioinformatic pipeline called havoc (helsinki university analyzer for variants of concern). havoc can reference assemble raw sequence reads and assign the corresponding lineages to sars-cov- sequences. conclusions: havoc is a pipeline utilizing several bioinformatic tools to perform multiple necessary analyses for investigating genetic variance among sars-cov- samples. the pipeline is particularly useful for those who need a more accessible and fast tool to detect and monitor the spread of sars-cov- variants of concern during local outbreaks. havoc is currently being used in finland for monitoring the spread of sars-cov- variants. havoc user manual and source code are available at https://www.helsinki.fi/en/projects/havoc and https://bitbucket.org/auto_cov_pipeline/havoc, respectively. keywords sars-cov , variant detection, reference assembly, lineage identification, coronavirus, sequence analysis. background emerging pathogens pose a continuous threat to mankind, as exemplified by the ebola virus epidemic in west africa in [ ], zika virus pandemic in [ ], and the ongoing coronavirus disease (covid- ) pandemic. these viruses are zoonotic, i.e. have crossed species barriers from animals to humans, alike the majority of emerging human pathogens [ , ]. the likelihood of this host switching is enhanced by several factors, e.g. global movement of people and animals, environmental changes, increased proximity of humans, wildlife and livestock, and population expansion into new environments [ ]. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the mutation and evolution rate of rna viruses is considerably higher than their hosts, which is advantageous for viral adaptation. mutations in the viral genome are most of the time silent or, if affecting phenotype, related to attenuation, although mutations can also lead to more pathogenic strains. a new virus variant may have one or more mutations that separate it from the wild-type virus already circulating among the general population. coronaviruses (family coronaviridae) are enveloped single-stranded rna viruses, which cause respiratory, enteric, hepatic, and neurological diseases of a broad spectrum of severity among different animals and humans. severe acute respiratory syndrome coronavirus (sars-cov- ), a novel evolutionary divergent virus responsible for the present pandemic, has devastated societies and economies globally. the sars-cov- pandemic has already infected more than million people in countries, causing over . million global deaths as of rd february [ ]. in autumn , a new variant of sars-cov- known as b/ y.v (b. . . ) was detected in south-eastern england, wales, and scotland [ ]. this variant has since spread globally to more than countries. the variant has undergone mutations with - nonsynonymous mutations, four amino acid deletions, and six synonymous mutations making the virus more transmissible [ ]. another variant c/ y.v (b. . ) was detected in south africa which was genetically distant from the uk b/ y.v variant [ ]. this south african variant with its two mutations in the receptor-binding motif that mainly forms the interface with the human ace receptor has also been widely spreading to circulate globally. it has been noticed that some existing vaccines against sars-cov- are less effective against the c/ y.v variant [ – ]. a third variant being closely monitored is p. detected first in brazil [ ]. interestingly, all these three variants have a mutation in the receptor binding domain (rbd) of the spike protein at position , where the amino acid asparagine (n) has been replaced with tyrosine (y) enabling specific pcr to detect the n y mutation [ ]. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / as more transmissible coronavirus variants are circulating worldwide, the role of researchers and technology specialists in controlling the pandemic has received more emphasis. the surveillance of virus variants by sequencing the sars-cov- genomes would provide a fast way to monitor variants and their spread, however, there are only few publicly available methods for quick reference-based consensus assembly and lineage assignment for sars- cov- samples. for this purpose, we have developed a simple pipeline, called havoc (helsinki university analyzer for variants of concern), for quick reference-based consensus assembly and lineage assignment for sars-cov- samples. this will provide the end user a quick and accessible method of variant identification and monitoring. the pipeline was developed to be run on unix/linux operating systems, and thus can also be used in remote servers, e.g. csc – it center for science, finland. implementation havoc consists of a single shell script, which performs reference-based consensus assemblies to query sars-cov- fastq sequence libraries and assigns lineages to them individually in succession. the script can be started by typing the following line into your command line terminal: sh havoc.sh [fastq directory] the computing of consensus sequences starts with the tool detecting fastq files generated via paired end sequencing in a given input directory and checking that each query fastq file has its corresponding counterpart, i.e. mates file. the names of the files are modified to be more concise, e.g. query-seq: _x _y _r _ .fastq.gz to query-seq: _r .fastq.gz. the pipeline accepts fastq files both in gzipped and uncompressed format. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for the analyses, the user can choose which bioinformatic tools to utilize. this can be done by typing the tool wanted (tools_prepro, tools_aligner and tools_sam) within the options section in the beginning of the script file. for example, if the user wants to deploy trimmomatic to pre- process fastq files, the following line can be changed as follows: from tools_prepro="fastp" to tools_prepro="trimmomatic" other options include the number of threads, minimum coverage below which a region is masked (min_coverage), and whether to run pangolin to assign lineages to the consensus genome (run_pangolin). an additional option allows havoc to be run in the csc servers (run_in_csc). the pre-alignment quality control, e.g. removing and trimming low quality reads and bases, removing adapter sequences, can be done with either fastp [ ] or trimmomatic [ ]. the reads are then aligned to a reference genome of sars-cov- isolate wuhan-hu- (genbank accession code: nc_ . ) with bwa-mem [ ] or bowtie [ ]. the resulting sam and bam files are processed (includes sorting, filling in mate coordinates, marking duplicate alignments, and indexing reads) with sambamba [ ] or samtools [ ] and the low coverage regions are masked with bedtools [ ]. after masking a variant call is done with lofreq [ ] before computing the consensus sequence via bcftools of samtools [ ]. finally, the consensus sequence is analyzed with pangolin [ ] to assign a lineage. the whole process is depicted in figure . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fig. flowchart describing processes and steps performed by havoc pipeline. the pipeline constructs consensus sequences from all fastq files in an input directory and then compares the resulting sequences to other established sars-cov- genomes to assign them the most likely lineages. the pipeline requires a fasta file of adapter sequences for fastq pre- processing and a reference genome of sars-cov- in a separate fasta file. the adapter file is not required when running the pipeline with fastp option. input files are highlighted in green and the outputs in red. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / usage example we are going to demonstrate a common use case for havoc with fastq files containing reads for sars-cov- sequences, provided by the viral zoonoses research unit at university of helsinki, finland. the test files within the example_fastqs folder contain paired-end fastq files for the uk variant (uk-variant- ) and the south african variant (s-africa-variant- ). to analyse these example files, the aforementioned command needs to be deployed as follows: sh havoc.sh example_fastqs results the fastq files are processed and analyzed with the default options utilizing faster bioinformatic tools (fastp, bwa-mem and sambamba) in ca. – minutes, depending on the performance of the platform (local or server). after havoc has finished the analyses, each fastq file is moved to their respective result folders within the fastq directory. each result folder contains a fasta file for the consensus sequence (e.g. uk-variant- _consensus.fa) and a csv file with the lineage information produced by pangolin (e.g. uk-variant- _pangolin_lineage.csv). in addition to these main result files, each directory contains the original fastq files, bam files (original, indexed and sorted), variant call files (vcf) with mutation data, bed file used for masking regions, and fastp report files with the results of fastq processing. the resulting directory and file structure with the example files will look as follows: example_fastqs/ uk-variant- / uk-variant- .bam uk-variant- _r .fastq.gz uk-variant- _r .fastq.gz .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / uk-variant- _consensus.fa uk-variant- _fixmate.bam uk-variant- _indel.bam uk-variant- _indel.vcf uk-variant- _indel_flt.vcf uk-variant- _lowcovmask.bed uk-variant- _markdup.bam uk-variant- _namesort.bam uk-variant- _pangolin_lineage.csv uk-variant- _sorted.bam fastp.html fastp.json s-africa-variant- / s-africa-variant- .bam s-africa-variant- _r .fastq.gz s-africa-variant- _r .fastq.gz s-africa-variant- _consensus.fa s-africa-variant- _fixmate.bam s-africa-variant- _indel.bam s-africa-variant- _indel.vcf s-africa-variant- _indel_flt.vcf s-africa-variant- _lowcovmask.bed s-africa-variant- _markdup.bam s-africa-variant- _namesort.bam s-africa-variant- _pangolin_lineage.csv s-africa-variant- _sorted.bam .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fastp.html fastp.json each of the example uk variants should have been categorized as b. . . and the south african variants as b. . (with pangolearn release - - ). it is important to note however, that as more sequences are uploaded and the pangolin lineage nomenclature updated, the assigned lineages may differ from the expected ones described in this paper. regions with low coverages (with default setting under ) are marked with the letter n during masking and represent gaps in the final consensus sequences. havoc is comparable to alternative combinations of tools, e.g. jovian and pangolin, in both speed and accuracy. these tools however operate separately, and as of publishing, there are no single public tools that can both perform a reference-based consensus assembly and a lineage identification in an easily accessible manner. conclusions early detection and understanding of the potential impact of emerging variants of sars-cov- is of primary importance and can assist in more efficient surveillance and control of the disease. the likelihood of emergence of novel sars-cov- variants of concern is increased and accelerated by the high mutation rates typical in rna viruses and the growing number of transmissions and infections both locally and globally. with the rising number of variants detected worldwide and with many of them associated with increased transmissibility and lower vaccine efficacy, there is an emerging need for fast, efficient and reliable pipelines to help detect, identify and trace sars-cov- lineages. these .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / pipelines should in addition be accessible to researchers who may not be familiar with utilizing complex bioinformatic tools or scripting pipelines. due to these challenges, we have developed havoc, a simple, reliable and user-friendly pipeline, which can be simply downloaded from our repository and run without being installed. all its dependencies can be installed via existing package managers, of which we recommend bioconda. havoc could help in the current pandemic situation by detecting variants of concern in the sequencing centers and public health or other organisations currently running and tracing variants of concern worldwide. havoc is currently utilized for detecting and tracing sars-cov- variants of concern, mainly b. . . , b . and p. , in finland. availability and requirements project name: havoc (helsinki university analyzer for variants of concern) project home page: https://www.helsinki.fi/en/projects/havoc and https://bitbucket.org/auto_cov_pipeline/havoc operating system(s): linux, mac programming language: shell script other requirements: trimmomatic or fastp, bwa-mem or bowtie , samtools, bedtools, bcftools, lowfreq and pangolin. license: gnu gpl any restrictions to use by non-academics: license needed list of abbreviations sars-cov- - severe acute respiratory syndrome coronavirus covid- - coronavirus disease havoc - helsinki university analyzer for variants of concern .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references . dixon mg, schafer ij, centers for disease control and prevention (cdc). ebola viral disease outbreak--west africa, . mmwr morb mortal wkly rep. ; : – . . kindhauser mk, allen t, frank v, santhana rs, dye c. zika: the origin and spread of a mosquito-borne virus. bull world health organ. ; : - c. doi: . /blt. . . . taylor lh, latham sm, woolhouse me. risk factors for human disease emergence. philos trans r soc lond b biol sci. ; : – . doi: . /rstb. . . . woolhouse mej, gowtage-sequeria s. host range and emerging and reemerging pathogens. emerging infect dis. ; : – . doi: . /eid . . . morens dm, fauci as. emerging pandemic diseases: how we got to covid- . cell. ; : – . doi: . /j.cell. . . . . worldometer - covid- virus pandemic. https://www.worldometers.info/coronavirus/. accessed feb . . rambaut a, loman n, pybus o, barclay w, barrett j, carabelli a, et al. preliminary genomic characterisation of an emergent sars-cov- lineage in the uk defined by a novel set of spike mutations. virological. . https://virological.org/t/preliminary-genomic-characterisation-of-an- emergent-sars-cov- -lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/ . accessed feb . . leung k, shum mh, leung gm, lam tt, wu jt. early transmissibility assessment of the n y mutant strains of sars-cov- in the united kingdom, october to november . euro surveill. ; . doi: . / - .es. . . . . . tegally h, wilkinson e, giovanetti m, iranzadeh a, fonseca v, giandhari j, et al. emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus (sars- cov- ) lineage with multiple spike mutations in south africa. medrxiv. . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / doi: . / . . . . . mahase e. covid- : novavax vaccine efficacy is % against uk variant and % against south african variant. bmj. ;:n . doi: . /bmj.n . . kupferschmidt k. vaccine . : moderna and other companies plan tweaks that would protect against new coronavirus mutations. science. . doi: . /science.abg . . edwards e. j&j says vaccine effective against covid, though weaker against south africa variant. nbc news. . https://www.nbcnews.com/health/health-news/j-j-vaccine-effective- against-covid-though-weaker-against-south-n . accessed feb . . faria nr, claro im, candido d, franco lam, andrade ps, coletti tm, et al. genomic characterisation of an emergent sars-cov- lineage in manaus: preliminary findings. virological. . https://virological.org/t/genomic-characterisation-of-an-emergent-sars-cov- - lineage-in-manaus-preliminary-findings/ . accessed feb . . centers for disease control and prevention (cdc). emerging sars-cov- variants. https://www.cdc.gov/coronavirus/ -ncov/more/science-and-research/scientific-brief- emerging-variants.html. accessed feb . . chen s, zhou y, chen y, gu j. fastp: an ultra-fast all-in-one fastq preprocessor. bioinformatics. ; :i – . doi: . /bioinformatics/bty . . bolger am, lohse m, usadel b. trimmomatic: a flexible trimmer for illumina sequence data. bioinformatics. ; : – . doi: . /bioinformatics/btu . . li h. aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arxiv. . . langmead b, salzberg sl. fast gapped-read alignment with bowtie . nat methods. ; : – . doi: . /nmeth. . . tarasov a, vilella aj, cuppen e, nijman ij, prins p. sambamba: fast processing of ngs alignment formats. bioinformatics. ; : – . doi: . /bioinformatics/btv . . li h, handsaker b, wysoker a, fennell t, ruan j, homer n, et al. the sequence .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / alignment/map format and samtools. bioinformatics. ; : – . doi: . /bioinformatics/btp . . quinlan ar, hall im. bedtools: a flexible suite of utilities for comparing genomic features. bioinformatics. ; : – . doi: . /bioinformatics/btq . . wilm a, aw ppk, bertrand d, yeo ght, ong sh, wong ch, et al. lofreq: a sequence- quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. nucleic acids res. ; : – . doi: . /nar/gks . . pangolin. https://github.com/cov-lineages/pangolin. accessed feb . declarations ethics approval and consent to participate not applicable. consent for publication not applicable. availability of data and materials publicly available at https://bitbucket.org/auto_cov_pipeline/havoc. competing interests the authors declare that they have no competing interests. funding this study was supported by the academy of finland (grant number ), veo - european union’s horizon (grant number ) and the jane and aatos erkko foundation. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / authors' contributions conceptualization: ptn ip rk ts tsi ov. development: ptn ip rk ts. testing/formal analysis: ptn ip rk ts. funding acquisition: tsi ov. investigation: ptn ip rk ts. methodology: ptn ip rk ts. project administration: rk ts ov. resources: ptn rk ip ts tsi ov. validation: ptn ip rk ts. writing – original draft: ptn rk. writing – review & editing: ip ts tsi ov. acknowledgements none. authors' information none. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / searchpv: a novel approach to identify and assemble human papillomavirus-host genomic integration events in cancer title: searchpv: a novel approach to identify and assemble human papillomavirus-host genomic integration events in cancer running title: searchpv: detecting viral integrations authors: lisa m. pinatti b.s. , *, wenjin gu m.s. *, yifan wang phd , ahmed el hossiny , apurva d. bhangale b.s. , collin v. brummel b.a. , thomas e. carey phd , , , ryan e. mills phd , †, j. chad brenner phd , , † *l.m. pinatti and w. gu should be considered joint first author †r.e. mills and j.c. brenner should be considered joint senior author cancer biology program, program in the biomedical sciences, rackham graduate school, university of michigan, ann arbor, mi department of otolaryngology/head and neck surgery, university of michigan, ann arbor, mi department of computational medicine and bioinformatics, university of michigan, ann arbor, mi department of human genetics, university of michigan, ann arbor, mi rogel cancer center, michigan medicine, ann arbor, mi department of pharmacology, university of michigan, ann arbor, mi corresponding author: j. chad brenner b msrb , w. medical center drive, ann arbor, mi - - chadbren@umich.edu funding statement: this study was supported by nih-nci r ca (t.e. carey and j.c. brenner), as well as start-up discretionary funds to j.c. brenner and r.e. mills from the university of michigan. l.m. pinatti was supported by nih-nci r ca . conflict of interest: the authors declare that there is no conflict of interest. author contributions: l.m. pinatti: conceptualization, data curation, formal analysis, investigation, project administration, validation, visualization, writing - original draft, and writing - review and editing. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / w. gu: conceptualization, data curation, formal analysis, investigation, project administration, methodology, resources, software, visualization, writing - original draft, and writing - review and editing. y. wang: data curation, formal analysis, investigation, methodology, resources, and software. a.d. bhangale: data curation, formal analysis, investigation, methodology, resources, software, and validation. c.v. brummel: data curation, investigation, project administration, and resources. a. elhossiny: data curation, investigation and software. t.e. carey: conceptualization, funding acquisition, project administration, resources, supervision, and writing - review and editing. r.e. mills: conceptualization, funding acquisition, methodology, resources, software, supervision, and writing - review and editing. j.c. brenner: conceptualization, funding acquisition, project administration, resources, software, supervision, visualization, and writing - review and editing. acknowledgments: we would like to thank the university of michigan advanced genomics core for carrying out the targeted capture sequencing and x linked read sequencing. we thank dr. tom wilson for discussions of the data. precis: to overcome technical challenges of detecting viral integrations in human papillomavirus-related cancers, we optimized a new pipeline called searchpv. using this tool, we found frequent integration near genes and areas of large structural rearrangements in hpv+ models. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / abstract: background: human papillomavirus (hpv) is a well-established driver of malignant transformation in a number of sites including head and neck, cervical, vulvar, anorectal and penile squamous cell carcinomas; however, the impact of hpv integration into the host human genome on this process remains largely unresolved. this is due to the technical challenge of identifying hpv integration sites, which includes limitations of existing informatics approaches to discover viral-host breakpoints from low read coverage sequencing data. methods: to overcome this limitation, we developed a new hpv detection pipeline called searchpv based on targeted capture technology and applied the algorithm to targeted capture data. we performed an integrated analysis of searchpv-defined breakpoints with genome-wide linked read sequencing to identify potential hpv-related structural variations. results: through analysis of hpv+ models, we show that searchpv detects hpv-host integration sites with a higher sensitivity and specificity than two other commonly used hpv detection callers. searchpv uncovered hpv integration sites adjacent to known cancer-related genes including tp and myc, as well as near regions of large structural variation. we further validated the junction contig assembly feature of searchpv, which helped to accurately identify viral-host junction breakpoint sequences. we found that viral integration occurred through a variety of dna repair mechanisms including non-homologous end joining, alternative end joining and microhomology mediated repair. conclusions: in summary, we show that searchpv is a new optimized tool for the accurate detection of hpv-human integration sites from targeted capture dna sequencing data. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / keywords: genomics, bioinformatics, papillomavirus infections, virus integration, squamous cell carcinoma, dna sequence analysis total # of: . text pages: . tables: . figures: . supporting files: .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction: human papillomavirus (hpv) is a well-established driver of malignant transformation in a number of cancers, including head and neck squamous cell carcinomas (hnscc). although hpv genomic integration is not a normal event in the lifecycle of hpv, it is frequently reported in hpv+ cancers - and it may be a contributor to oncogenesis. in cervical cancer, hpv integration increases in incidence during progression from stages of cervical intraepithelial neoplasia (cin) i/ii, cin iii and invasive cancer development. this process has a variety of impacts on both the hpv and cellular genomes, including disruption of the transcriptional repressor of the hpv oncoproteins e , leading to increase in genetic instability. hpv integration occurs within/near cellular genes more often than expected by chance and has been reported to be associated with structural variations . recent studies in hnsccs have also suggested that additional oncogenic mechanisms of hpv integration may exist through direct effects on cancer-related gene expression and generation of hybrid viral-host fusion transcripts. a wide array of methods has been previously used for the detection of hpv integration. polymerase chain reaction (pcr)-based methods, such as detection of integrated papillomavirus sequences pcr (dips-pcr) and amplification of papillomavirus oncogene transcripts (apot) , are low sensitivity assays and are limited in their ability to detect the broad spectrum of genomic changes resulting from this process. next-generation sequencing (ngs) technologies overcomes these limitations. previous groups have assessed hpv integration within hnscc tumors in the cancer genome atlas (tcga) and cell lines by whole-genome sequencing (wgs). , , there are a variety of viral integration detection tools developed for wgs data, such as virusfinder , and virusseq . however, these strategies are designed for a broad range of .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / virus types and require whole genomes to be sequenced at uniform coverage, which can result in a lower sensitivity of detection for specific types of rare viral integration events. to overcome this issue, others have begun to use hpv targeted capture sequencing. , - this strategy allows for better coverage of integration sites than an untargeted approach like wgs but requires sensitive and accurate viral-human fusion detection bioinformatic tools, of which the field has been lacking. in our lab, we have found the previously available viral integration callers to have a relatively low validation rate and limitations on the structural information surrounding the fusion sites, which impairs mechanistic studies. therefore, we set out to generate a novel pipeline specifically for targeted capture sequencing data to serve as a new gold standard in the field. materials and methods: targeted capture sequencing: dna from um-scc- and pdx- r were submitted to the university of michigan advanced genomics core for targeted capture sequencing. targeted capture was performed using a custom designed probe panel with high density coverage of the hpv genome, the hpv / / l /l regions, and over hnscc-related genes, which are detailed in heft neal et. al . following library preparation and capture, the samples were sequenced on an illumina novaseq or hiseq , respectively, with nt paired end run. data was de-multiplexed and fastq files were generated. novel integration caller (searchpv): the pipeline of searchpv has four main steps which are detailed below: ( ) alignment; ( ) genome fusion point calling; ( ) assembly; ( ) hpv fusion .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / point calling (figure ). the package is available on github: https://github.com/mills- lab/searchpv. alignment the customized reference genome used for alignment was constructed by catenating the hpv genome (from papillomavirus episteme (pave) database , ) and the human genome reference ( genomes reference genome sequence, hs d ). we aligned paired-end reads from targeted capture sequencing against the customized reference genome using bwa mem aligner. then we performed an indel realignment by picard tools and gatk . duplications were marked by picard markduplicates tool for the filtering in downstream steps. genome fusion points calling to identify the fusion points, we extracted reads with regions matched to hpv and filtered those reads to meet these criteria: ( ) not secondary alignment; ( ) mapping quality greater or equal than ; ( ) not duplicated. genome fusion points were called by split reads (reads spanning both the human and hpv genomes) and the paired-end reads (reads with one end matched to hpv and the other matched the human genome) at the surrounding region (+/- bp) (figure a). the cut-off criteria for identifying the fusion points were based on empirical practice. we then clustered the integration sites within bp to avoid duplicated counting of integration events due to the stochastic nature of read mapping and structural variations. assembly .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/mills-lab/searchpv https://github.com/mills-lab/searchpv https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / to construct longer sequence contigs from individual reals, we extracted supporting split reads and paired-end for local assembly from each integration event. due to the library preparation methods we implemented for the targeted capture approach, some reads exhibited an insertion size less than x read length, resulting in overlapping read segments. for such events, we first merged these reads using pear and then combined them with other individual reads to perform a local assembly by cap (figure ). hpv fusion point calling for each integration event, the assembly algorithm was able to report multiple contigs. we developed a procedure to evaluate and select contigs for each integration event to call hpv fusion point more precisely. first, we aligned the contigs against the human genome and hpv genome separately by bwa mem. if the contig met the following criteria, we marked it as high confidence: ( ) has at least supportive reads ( ) % < 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚ℎ𝑒𝑒𝑒𝑒 𝑙𝑙𝑒𝑒𝑙𝑙𝑙𝑙𝑚𝑚ℎ 𝑜𝑜𝑜𝑜 𝑚𝑚ℎ𝑒𝑒 𝑚𝑚𝑜𝑜𝑙𝑙𝑚𝑚𝑐𝑐𝑙𝑙 𝑚𝑚𝑜𝑜 𝐻𝐻𝐻𝐻𝐻𝐻 𝑙𝑙𝑒𝑒𝑙𝑙𝑙𝑙𝑚𝑚ℎ 𝑜𝑜𝑜𝑜 𝑚𝑚𝑜𝑜𝑙𝑙𝑚𝑚𝑐𝑐𝑙𝑙 < % then we separated the contigs we assembled into two classes: from left side (contig a in fig b) and from right side (contig b in fig b). for each class, if there were high confidence contigs in the class, we selected the contig with maximum length among them, otherwise we selected the contig with most supportive reads. for each insertion event, we reported one contig if it only had contigs from one side and we reported two contigs if it had contigs from both sides (figure c). finally, we identified the fusion points within hpv based on the alignment results of the selected contigs against the hpv genome. the bam/sam file processing in this pipeline was done by samtools and the analysis was performed with r . . and python. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / results: searchpv pipeline: to overcome the limitations of viral integration detection in wgs of detecting rare events, we performed hpv targeted capture sequencing which allows for deeper investigation of these events. current bioinformatics pipelines available are not designed for this type of data so we developed a novel hpv integration detection tool for targeted capture sequencing data, which we termed “searchpv”. two hpv + hnscc models, um-scc- and pdx- r, were subjected to targeted-capture based illumina sequencing using a custom panel of probes spanning the entire hpv genome. the paired end reads then went through the four steps of analysis of searchpv: alignment to custom reference genome, genome fusion points calling, local assembly and precise fusion point calling (figure ). analysis of the integration sites in the models using our pipeline searchpv showed a high frequency of hpv integration with a total of six events in um-scc- and ninety-eight in pdx- r (figure , table s -s ). comparison to other integration callers and confirmation of integration sites: in addition to using searchpv, we used two previously developed integration callers, virusfinder and virusseq to independently call integration events in both um-scc- and pdx- r (figure , tables s - ). we found that searchpv called hpv integration events at a much higher rate than either previous caller. there were a large number of sites that were only identified by searchpv (n= ). in order to assess the accuracy of each caller, we performed pcr on source genomic dna followed by sanger sequencing with primers spanning the hpv-human junction sites predicted by the callers (figure c. s , table s ). we tested all integration sites with sufficient sequence complexity for primer design (n= ), twenty-five of which were unique to searchpv and five which were unique to virusseq. virusfinder does not allow for local .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / assembly of the integration junctions which rendered us unable to test these sites. sites unique to searchpv had a confirmation rate of / ( %). the confirmation rate of high confidence searchpv sites was higher than that for low confidence sites ( / ( %) versus / ( %)). in contrast, only / ( %) sites unique to virusseq could be confirmed. localization of integration sites: we next examined the integration sites detected by searchpv. the six integration sites discovered in um-scc- were clustered on chromosome q within/near the cellular gene tp and either involved the hpv genes e , e or l . the integration sites fell within intron , intron and exon . one additional integration site was . kb downstream of the tp coding region. within pdx- r, hpv integration sites were identified across different chromosomes, occurring most frequently on chromosome . for the integration events of pdx- r, we identified breakpoints in the hpv genome. the most frequently involved hpv genes were e ( / ( %)) and l ( / ( %)). most of the integration sites mapped to within/near (< kb) a known cellular gene ( / ( %)). of the sites that fell within a gene, the majority of integrations took place within an intronic region ( / ( %)). although the integration sites were scattered throughout the human genome, we saw examples of closely clustered sites around cancer-relevant genes, including znf and snx on chromosome q . , myc on chromosome q . and foxn on chromosome p . . association of integration sites and large-scale duplications .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / we predicted that the complex integration sites we discovered in um-scc- and pdx- r would be associated with large-scale structural alterations of the genome, such as rearrangements, deletions and duplications. to identify these alterations, we subjected um-scc- and pdx- r to x linked-read sequencing. we generated over billion reads for each sample (table s ), with phase blocks (contiguous blocks of dna from the same allele) of up to . m and . m bases in length for um-scc- and pdx- r, respectively (figure s ). this led to the identification of high confidence large structural events in um-scc- and events in the pdx- r model. we then performed integrated analysis with our searchpv results. there was a kb duplication surrounding the integration events in tp in um-scc- (figure a). in pdx- r, / ( %) integration sites were within a region that contained a large-scale duplication, while the other integration events fell outside regions of large structural variation. this suggested that in this pdx model, / ( %) large structural events were potentially induced during hpv integration. for example, the clusters of integration events surrounding znf and snx , myc, as well as foxn were also associated with large genomic duplications (figure b-c). microhomology at junction sites: finally, to evaluate possible mechanisms of dna repair-mediated integration, we examined the degree of sequence overlap between the genomes at each junction sites that covered by contigs. we saw three types of junction points: those with a gap of unmapped sequence between the human and hpv genomes, those that had a clean breakpoint between the genomes, and those with sequence that could be mapped to both genomes (figure a). the majority of junction sites in both samples had at least some degree of microhomology ( %) (figure b-c). integration .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / sites with clean breaks ( bp overlap) and bp of overlap were the most frequently seen junctions in pdx- r, but there was a wide range of levels seen. there was also a large number of junctions with gaps between the human and hpv genomes ranging from - bp long. discussion we developed a novel bioinformatics pipeline that we termed “searchpv” and show that it operated in a more accurate and efficient manner than existing pipelines on targeted capture sequencing data. the software also has the advantage of performing local contig assembly around the junction sites, which simplifies downstream confirmation experiments. we used our new caller to interrogate the integration sites found in two hnscc models in order to compare the accuracy of our caller to the existing pipelines. we then evaluated the genomic effects of these integrations on a larger scale by x linked-reads sequencing to identify the role of hpv integration in driving structural variation in the tumor genome. using searchpv, we were able to investigate the hpv-human integration events present in um-scc- and pdx- r. importantly, um-scc- has been previously assessed for hpv integration by a variety of methods , - , which we leveraged as ground truth knowledge to validate our integration caller. all previous studies were in agreement that hpv is integrated within the cellular gene tp , although the exact number of sites and locations within the gene varied by study. in this study, searchpv also called hpv integration sites within tp . we found integrations of e , e and l within tp intron , l within intron and e within tp exon . these integration sites were also detected using dips-pcr and/or wgs with the exception of e into intron , which was unique to our caller and confirmed by direct pcr. it is possible that the integration sites detected in this sample represent multiple fragments of one larger .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / integration site. there were additional sites called by other wgs studies that we did not detect (intron and exon ), although it is possible that alternate clonal populations grew out due to different selective pressures in different laboratories. nonetheless, the analysis clearly demonstrated that searchpv was able to detect a well-established hpv insertion site. in contrast to um-scc- , to our knowledge pdx- r has not been previously analyzed for viral-host integration sites and therefore represented a true discovery case. we identified widespread hpv integration sites throughout the host genome and also observed that % of integration sites were found within or near genes. this aligns with previous reports that integrations are detected in host genes more frequently than expected by chance. , , , one particularly interesting cluster of integration events surrounded the cellular proto-oncogene myc. importantly, myc has been identified as a potential hotspot for hpv integration , and the junctions we detected in/near this gene had - bp of microhomology, potentially driving this observation. accordingly, an hpv-integration related promoter duplication event, which may be expected to drive expression, would be consistent with a novel genetic mechanism to drive expression of this oncogene. tp has also been reported to be a hotspot for hpv integration, as it has been recorded in multiple samples besides um-scc- . , , , there is a high degree of microhomology between hpv and this gene. given the high frequency of molecular alterations in the epidermal differentiation pathway (e.g. notch / , tp and znf ) in hpv+ hnsccs, this data supports hpv integration as a pivotal mechanism of viral-driven oncogenesis in this model. hpv integration sites have been associated with structural variations in the human genome , , , which supports an additional genetic mechanism as to why hpv integration sites may often be detected adjacent to host cancer-related genes. these structural variation events are .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / thought to be due to the rolling circle amplification that takes place at the integration breakpoint, leading to the formation of amplified segments of genomic sequence flanked by hpv segments. , our data are consistent with these previous reports in that approximately half of the integration events we discovered were associated with a large-scale amplification. it is unclear why only some integration sites were associated with structural variants, but it is possible that an alternative mechanism of integration occurred. importantly, this observation that hpv integration events tended to be enriched in cellular genes could result from multiple different mechanisms. integration could occur preferentially in regions of open chromatin during cell replication and keratinocyte differentiation. other potential mechanisms are: ) that hpv integration is directed to specific host genes by homology, or ) that hpv integration is random, but events that are advantageous for oncogenesis are clonally selected and expanded, implicating non-homology based dna repair mechanisms. therefore, to help resolve differences in the mechanism of integration, we assessed microhomology at the hpv- human junction points. the majority of breakpoints had some level of microhomology. the most frequent levels of overlap were and bp, which potentially implicates non-homologous end joining (nhej) in repair at these sites, since this pathway most frequently results in - bp of overlap. there were also a number of junction sites that demonstrated a gap of inserted sequence between the hpv and human genomes. it has been described that during polymerase theta- mediated end joining (tmej), stretches of - bp are frequently inserted at the site of repair, possibly accounting for these sites. however, given the relatively small number of events we examined, we expect that future analysis with our pipeline will help resolve the specific role of each dna repair pathway in hpv-human fusion breakpoints. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / overall, our new hpv detection pipeline searchhpv overcomes a gap in the field of viral- host integration analysis. while the performance of searchpv has only been examined on two models, in the future, we expect that the application of this pipeline in large hpv+ cancer tissue cohorts will help advance our understanding of the potential oncogenic mechanisms associated with viral integration. with the emerging set of tools such as searchpv, we believe the field is now primed to make major advances in the understanding of hpv-driven pathogenesis, some of which may lead to the development of novel biomarkers and/or treatment paradigms. references: . gao g, wang j, kasperbauer jl, et al. whole genome sequencing reveals complexity in both hpv sequences present and hpv integrations in hpv-positive oropharyngeal squamous cell carcinomas. bmc cancer. apr ; ( ): . doi: . /s - - - . nulton tj, olex al, dozmorov m, morgan im, windle b. analysis of the cancer genome atlas sequencing data reveals novel properties of the human papillomavirus genome in head and neck squamous cell carcinoma. oncotarget. mar ; ( ): - . doi: . /oncotarget. . parfenov m, pedamallu cs, gehlenborg n, et al. characterization of hpv and host genome interactions in primary head and neck cancers. proc natl acad sci u s a. oct ; ( ): - . doi: . /pnas. . pinatti lm, sinha hn, brummel cv, et al. association of human papillomavirus integration with better patient outcomes in oropharyngeal squamous cell carcinoma. head neck. oct ;doi: . /hed. . tian r, cui z, he d, et al. risk stratification of cervical lesions using capture sequencing and machine learning method based on hpv and human integrated genomic profiles. carcinogenesis. oct ; ( ): - . doi: . /carcin/bgz . mcbride aa, warburton a. the role of integration in oncogenic progression of hpv- associated cancers. plos pathog. apr ; ( ):e . doi: . /journal.ppat. . bodelon c, untereiner me, machiela mj, vinokurova s, wentzensen n. genomic characterization of viral integration sites in hpv-related cancers. int j cancer. nov ; ( ): - . doi: . /ijc. . akagi k, li j, broutian tr, et al. genome-wide analysis of hpv integration in human cancers reveals recurrent, focal genomic instability. genome res. feb ; ( ): - . doi: . /gr. . . pinatti lm, walline hm, carey te. human papillomavirus genome integration and head and neck cancer. j dent res. jun ; ( ): - . doi: . / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . luft f, klaes r, nees m, et al. detection of integrated papillomavirus sequences by ligation-mediated pcr (dips-pcr) and molecular characterization in cervical cancer cells. int j cancer. apr ; ( ): - . . klaes r, woerner sm, ridder r, et al. detection of high-risk cervical intraepithelial neoplasia and cervical cancer by amplification of transcripts derived from integrated papillomavirus oncogenes. cancer res. dec ; ( ): - . . wang q, jia p, zhao z. virusfinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data. plos one. ; ( ):e . doi: . /journal.pone. . wang q, jia p, zhao z. verse: a novel approach to detect virus integration in host genomes through reference genome customization. genome med. ; ( ): . doi: . /s - - - . chen y, yao h, thompson ej, tannir nm, weinstein jn, su x. virusseq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue. bioinformatics. jan ; ( ): - . doi: . /bioinformatics/bts . holmes a, lameiras s, jeannot e, et al. mechanistic signatures of hpv insertions in cervical carcinomas. npj genom med. ; : . doi: . /npjgenmed. . . montgomery nd, parker js, eberhard da, et al. identification of human papillomavirus infection in cancer tissue by targeted next-generation sequencing. appl immunohistochem mol morphol. aug ; ( ): - . doi: . /pai. . morel a, neuzillet c, wack m, et al. mechanistic signatures of human papillomavirus insertions in anal squamous cell carcinomas. cancers (basel). nov ; ( )doi: . /cancers . nkili-meyong aa, moussavou-boundzanga p, labouba i, et al. genome-wide profiling of human papillomavirus dna integration in liquid-based cytology specimens from a gabonese female population using hpv capture technology. sci rep. feb ; ( ): . doi: . /s - - - . heft neal me, bhangale ad, birkeland ac, et al. prognostic significance of oxidation pathway mutations in recurrent laryngeal squamous cell carcinoma. cancers (basel). oct ; ( )doi: . /cancers . niaid. papillomavirus episteme. bioinformatics and computational biosciences branch. . https://pave.niaid.nih.gov/ . van doorslaer k, li z, xirasagar s, et al. the papillomavirus episteme: a major update to the papillomavirus sequence database. nucleic acids res. jan ; (d ):d -d . doi: . /nar/gkw . li h, durbin r. fast and accurate short read alignment with burrows-wheeler transform. bioinformatics. jul ; ( ): - . doi: . /bioinformatics/btp . institute b. picard toolkit. broad institute github repository. ; . mckenna a, hanna m, banks e, et al. the genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. genome res. sep ; ( ): - . doi: . /gr. . . zhang j, kobert k, flouri t, stamatakis a. pear: a fast and accurate illumina paired-end read merger. bioinformatics. mar ; ( ): - . doi: . /bioinformatics/btt . huang x, madan a. cap : a dna sequence assembly program. genome res. sep ; ( ): - . doi: . /gr. . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://pave.niaid.nih.gov/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . team rc. r: a language and environment for statistical computing. r foundation for statistical computing. ; . van rossum g, drake f.l. python reference manual: python documentation manual part . createspace independent publishing platform. ; . khanal s, shumway bs, zahin m, et al. viral dna integration and methylation of human papillomavirus type in high-grade oral epithelial dysplasia and head and neck squamous cell carcinoma. oncotarget. jul ; ( ): - . doi: . /oncotarget. . myers je, guidry jt, scott ml, et al. detecting episomal or integrated human papillomavirus dna using an exonuclease v-qpcr-based assay. virology. nov ; : - . doi: . /j.virol. . . . olthof nc, huebbers cu, kolligs j, et al. viral load, gene expression and mapping of viral integration sites in hpv -associated hnscc cell lines. int j cancer. mar ; ( ):e - . doi: . /ijc. . walline hm, goudsmit cm, mchugh jb, et al. integration of high-risk human papillomavirus into cellular cancer-related genes in head and neck cancer cell lines. head neck. may ; ( ): - . doi: . /hed. . hu z, zhu d, wang w, et al. genome-wide profiling of hpv integration in cervical cancer identifies clustered genomic hot spots and a potential microhomology-mediated integration mechanism. nat genet. feb ; ( ): - . doi: . /ng. . ferber mj, thorland ec, brink aa, et al. preferential integration of human papillomavirus type near the c-myc locus in cervical carcinoma. oncogene. oct ; ( ): - . doi: . /sj.onc. [pii] . schmitz m, driesch c, jansen l, runnebaum ib, durst m. non-random integration of the hpv genome in cervical cancer. plos one. ; ( ):e . doi: . /journal.pone. pone-d- - [pii] . walline hm, komarck cm, mchugh jb, et al. genomic integration of high-risk hpv alters gene expression in oropharyngeal squamous cell carcinoma. mol cancer res. oct ; ( ): - . doi: . / - .mcr- - . cancer genome atlas n. comprehensive genomic characterization of head and neck squamous cell carcinomas. nature. jan ; ( ): - . doi: . /nature . groves ij, coleman n. human papillomavirus genome integration in squamous carcinogenesis: what have next-generation sequencing studies taught us? the journal of pathology. may ; ( ): - . doi: . /path. . pannunzio nr, li s, watanabe g, lieber mr. non-homologous end joining often uses microhomology: implications for alternative end joining. dna repair (amst). may ; : - . doi: . /j.dnarep. . . . carvajal-garcia j, cho je, carvajal-garcia p, et al. mechanistic basis for microhomology identification and genome scarring by polymerase theta. proc natl acad sci u s a. apr ; ( ): - . doi: . /pnas. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure legends: figure : workflow of searchpv. (a) paired-end reads from targeted capture sequencing were aligned to a catenated human-hpv reference genome. after removing duplication and filter, fusion points were identified by split reads and pair-end reads. informative reads were extracted for local assembly. reads pairs that have overlaps were merged first before assembly. assembled contigs were aligned to hpv genome to identify the breakpoints on hpv. (b) contigs were divided to two classes. blue solid triangle demonstrates the matched region of the contig. grey dashed triangle demonstrates the clipped region of the contig. contig a would be assigned to left group and contig b would be assigned to right group. contig c would be randomly assigned to left or right group. (c) workflow for the contig selection procedures for fusion point with multiple candidates contigs. for each fusion point. we report at least one contig and at most two contigs representing two directions. figure : distribution of breakpoints in the human and hpv genomes called by searchpv. (a) distribution of integration sites in the human genome for pdx- r. each bar denotes the count of breakpoints within the region. (b) links of breakpoints in the human and hpv genomes for pdx- r. (c) links of breakpoints in the human and hpv genomes for um-scc- . (d) quantification of breakpoint calls in human genes for pdx- r. (e) quantification of breakpoints calls in the hpv genes for pdx- r. (f) quantification of breakpoint calls in the hpv genes for um-scc- . figure : comparison of integration sites called by searchpv, virusseq and virusfinder in both models. (a) each bar denotes an integration site. the colormap shows the count of the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / integration sites. (b) number of integration sites called by each program. (c) pcr confirmation rate of sites called by each program. figure : genomic duplications associated with hpv integration in um-scc- (a) and pdx- r (b-d). red arrows indicate integration site. each plot shows the number of overlapping barcodes observed in sequencing reads of that region. figure : microhomology at junction points. (a) the three types of junction points. (b) level of microhomology (in bp) in um-scc- . (c) level of microhomology (in bp) in pdx- r. junctions with a gap are shown as negative numbers. figure s : pcr validation gel electrophoresis. top band of each row shows gapdh ( bp), bottom bands represent predicted hpv-human junctions (ranging from - bp). red boxes demonstrate bands that appeared at the correct molecular weight and were validated by sanger sequencing. figure s : linked read snp phase plots for um-scc- (a) and pdx- r (b) genomes. alternating colors represent different phase blocks, which are contiguous blocks of dna from the same allele based on differential snp phasing performed by longranger software. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c targeted capture sequencing bwa mem hg alignment genome hpv pair-end read pair-end read split read fusion point genome fusion points calling remove duplication + filter assembly pair-end read read length insertion size merge split read assemble contigs hpv fusion points calling bwa mem hpv type contig fusion point hpv type figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / virusseqvirusfinder searchpv - % / nt % / % / % / % / b. virusseq n= virusfinder n= searchpv n= a. integration calls integration confirmation ratesc. srch vf vs ! figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / kb duplication a. b. kb duplication kb duplication kb duplication c. d. x x x x x x figure .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / human hpv un ma pp ed gap (- bp overlap) clean break ( bp overlap) microhomology (+ bp overlap) human hpv human hpv a. b. c. microhomology at junction – um-scc- microhomology at junction – pdx figure gapdh hpv-human junction gapdh hpv-human junction gapdh hpv-human junction gapdh hpv-human junction gapdh hpv-human junction gapdh hpv-human junction figure s a. b.um-scc- pdx- r figure s manuscript.pdf references: figures.pdf streamlining differential exon and ' utr usage with diffutr streamlining differential exon and ’ utr usage with diffutr stefan gerber , , gerhard schratt & pierre-luc germain , , ,* group of computational neurogenomics, d-hest institute for neurosciences, eth zürich lab of systems neuroscience, d-hest institute for neurosciences, eth zürich lab of statistical bioinformatics, dmls, university of zürich sib swiss institute of bioinformatics *correspondence to pierre-luc germain (pierre-luc.germain@hest.ethz.ch) abstract background: despite the importance of alternative poly-adenylation and ’ utr length for a variety of biological phenomena, there are limited means of detecting utr changes from standard transcriptomic data. results: we present the diffutr bioconductor package which streamlines and improves upon differential exon usage (deu) analyses, and leverages existing deu tools and alternative poly- adenylation site databases to enable differential ’ utr usage analysis. we demonstrate the diffutr features and show that it is more flexible and more accurate than state-of-the-art alter- natives, both in simulations and in real data. conclusions: diffutr enables differential ’ utr analysis and more generally facilitates deu and the exploration of their results. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / background coding sequences in eukaryotic mrnas are generally flanked by transcribed but untranslated regions (utrs) which can impact rna stability, translation, and localization [ ]. in particular, the length of ’ utrs often varies even within a given gene due to the use of different poly-adenylation (polya) sites [ ], leading especially to the inclusion or not of regulatory elements such as binding sites for micrornas (mirnas) or rna-binding proteins [ ]. alternative poly-adenylation (apa) is highly prevalent in mammals [ ] and has been shown to be important to a variety of biological phenomena [ , , , ]. a number of methods for ’ end sequencing have been developed with the goal to map apa sites [ , , , , , , ], leading to the development of atlases such as polyasite [ ] or polya db [ ]. as such methods are only marginally used, however, it would be beneficial to leverage the widespread availability of traditional rna-seq for the purpose of identifying changes in ’ utr usage. a chief difficulty here is that most utr variants are not catalogued in standard transcript annotations, limiting the utility of standard transcript-level quantification based on reference transcripts, such as salmon [ ]. nevertheless, a number of methods have been developed to this purpose. methods like dapars [ ] and apatrap [ ] try to infer new polya sites from read coverage changes from rna-seq experiments, however the depletion of rnaseq coverage at the ’ end of transcripts makes the precise inference of polya sites challenging [ ]. other tools like qapa [ ] and apalyzer [ ] use already available polya site databases but only compare the usage of the most proximal polya sites to distal ones in a pairwise fashion and fail to grasp the full complexity of dynamic apa when there are three or more polya sites, which is the case for approximately half of mammalian transcripts [ ]. furthermore they do not make use of the already proven statistical frameworks to analyse different exon usage (deu) from count data [ , , , ]. these tools take into account the inherent properties of read count distributions and are arguably more appropriate to analyse differences in relative polya site usage, which is conceptually highly similar to deu. we therefore developed diffutr, which streamlines and improves upon well established deu tools, and leverages them, along with polya site databases, to infer alternative ’ utr usage across conditions. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / results streamlining differential bin/exon usage analysis popular bin-based deu methods are provided by the limma [ , ], edger [ ] and dexseq [ ] packages. however, their usage is not straightforward for non-experienced users, and their results often difficult to interpret. we therefore developed a simple workflow (figure a), usable with any of the three methods but standardizing inputs and outputs. in particular, bin annotation and quan- tification, as well as different usage results, are all stored in a rangedsummarizedexperiment [ ], which facilitates data storage and exploration, and enables advanced plotting functions irre- spective of the underlying method. diffutr is flexible in its application, and supports the use of strand information if available. transcript annotation granges / ensdb / .gtf polya sites granges / .bed preparebins countfeatures d e u w ra p p e rs dexseq diffsplice edger bins (granges) ranged summarized experiment bam files plotting functions (rsubread) utrcds transcript annotation + polya sites bins a b figure : overview. a: diffutr workflow. bins are prepared from various types of gene anno- tations as well as, optionally, additional apa-driven segmentation and extension, then read counts within bins as well as bin information are stored in a standardized rangedsummarizedexperiment, which can then be used as an input for any of the three deu methods, producing again a stan- dardized output that can be used with the package’s plotting functions. b: schematic of bin preparation. apa sites are used to further segment and extend disjoined gene bins. improvement to diffsplice diffutr also implements an improved version of limma’s diffsplice method which does not assume constant residual variance across bins of the same gene (see diffsplice ). to test the effect .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / of these modifications in a standard deu setting, we ran both versions (as well as the other two deu methods) on simulated data from a previous deu benchmark [ ]. the precision and recall results (figure a) confirmed the previously observed superiority of dexseq and, more generally, the imperfect false discovery rate (fdr) control. importantly, it also confirmed that our improved diffsplice method outperforms the original, at no additional computing cost. . . . . . . . . . fdr t p r differential exon usage a . . . . . . . . . . fdr t p r method apalyzer apalyzer dapars dexseq diffsplice diffsplice edger qapa.dpau qapa.pval differential utr usage b d iff u t r figure : fdr and recall (tpr) on simulated data. a: in the classical deu context. b: in the differential utr usage context. the dashed line indicates a real false discovery rate (fdr) of %, and the dots indicate nominal fdrs of , and %. diffutr methods far outperform qapa and dapars. in both contexts, our modifications to diffsplice significantly improve its performance. application to differential utr usage and benchmark on a simulation we next sought to evaluate the methods when applied for differential utr analysis. for this purpose, apa sites are used to further segment and extend utr bins, as illustrated in figure b (see methods for the details). given the absence of rnaseq data with a differential utr usage ground truth, we simulated reads with known utr differences from real data (see simulated data). we then ran the different diffutr methods (as well as the unmodified diffsplice variant), and compared them to alternative methods. while dapars and apalyzer provide gene- level significance testing, qapa does not, and our attempts to use its equivalence classes with standard transcript usage methods (see methods) gave very poor results. therefore, for the purpose of comparison we tried two alternatives: simply ranked genes according to qapa’s main output, i.e. the absolute difference in polya site usage between conditions (|∆pau|), labeled in b as qapa.dpau, or running t -tests on the log-transformed pau values, labeled as qapa.qval. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / since apalyzer produces different analyses for genes’ ’ end and intronic apa usage, we used both the ’ end results and a combination of the two (the latter shown as apalyzer ). as figure b shows, all diffutr methods outperformed alternatives by far. on this test, our improved diffsplice had comparable performance to dexseq, at a fraction of the computing costs. differential utr usage in real data we next sought to test diffutr in real data. first, since ’ utrs are known to generally lengthen during neuronal differentiation [ , ], we expected to observe a skew towards positive fold changes of ’ utr bins when comparing rnaseq experiments from embryonic stem cells (esc) and esc- derived neurons. we therefore re-analyzed data from [ ] and observed clearly the expected skew among statistically-significant genes, especially for bins with a higher expression (figure a). we next found both ’ sequencing and standard rnaseq data from samples of mouse hip- pocampal slices undergoing forskolin-induced long-term potentiation [ ], which enabled us to use the ’ sequencing data as a truth for analysis performed on the standard rnaseq data (figure b and supplementary figure ). in this case we represent the results through receiver-operator characteristic (roc) curves since the precision-recall curves make the differences less visible due to the lower general power. although power to detect utr changes is necessarily low with respect to ’ sequencing, we again observed that diffutr methods clearly outperformed all alternative methods. exploring differential exon/utr usage results diffutr provides three main plot types to explore differential bin usage analyses, each with a number of variations. figure showcases them in the context of long-term potentiation of mouse hippocampal neurons [ ]. plottopgenes (figure a) provides gene-level statistic plots (similar to a ‘volcano’ plot), which come in two variations. for standard deu analysis, absolute bin-level coefficients are weighted by significance and averaged to produce gene-level estimates of effect sizes. for differential ’ utr usage, where bins are expected to have consistent directions (i.e. lengthening or shortening of the utr) and where their size is expected to have a strong impact on biological function, the signed bin-level coefficients are weighted both by size and significance to produce gene-level estimates of effect sizes. by default, the size of the points reflects the relative expression of the genes, and the color the relative expression of the significant bins with respect to the gene. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / a . . . . . . . . fpr t p r method apalyzer dapars dexseq diffsplice qapa.pval b − bin log (foldchange) b in m e a n l o g (c p m ) d iff u t r ' utr lengthening figure : differential utr analysis on real data. a:. ’ utr lengthening during neuronal differentiation. plotted are the utr bins found statistically significant (bin- and gene-level fdr both ¡ . ) by diffutr (diffsplice ) when comparing in vitro differentiated neurons to mouse embryonic stem cells. the color indicates the point density. the clear skew towards a positive bin- level foldchange (indicative, in most cases, of a utr lengthening), especially for bins with a higher mean count (cpm=counts per million reads sequenced). b: receiver-operator characteristic (roc) curves of differential utr usage analysis on the ltp dataset, using ’ sequencing to establish the ground truth. the axes are square-root-transformed to improve visibility, and only a subset of method variations are shown (see supplementary figure for all variants). deubinplot (figure b) provides bin-level statistic plots for a given gene, similar to those produced by dexseq and limma, but offering more flexibility. they can be plotted as overall bin statistics, per condition, or per sample, and can display various types of values. importantly, since all data and annotation are contained in the object, these can easily be included in the plots. figure b shows a lengthening of the jund ’ utr in the ltp group. finally, genebinheatmap (figure c) provides a compact, bin-per-sample heatmap represen- tation of a gene, allowing the simultaneous visualization of various information. we found these representations particularly useful to prioritize candidates from differential bin usage analyses. for example, many genes show differential usage of bins which are generally not included in most transcripts of that gene (low count density), and are therefore less likely to be relevant. further variations tested during implementation, we tested other changes to the method which were ultimately discarded as they did not improve performance, but which we here briefly report. first, differential utr analysis differs from typical differential exon usage analysis in that the vast majority of utr bins are consecutively transcribed, meaning that changes in the usage of a .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / smg ntrk homer nr a slc a txndc stmn rheb fosbrnf dio nr a syt scg hmgcs plk rbbp nfkb clta frmd arid a eprs lmna slc a grem . . . . weighted absolute coefficient − lo g (q .v a lu e ) genemeandensity . . . . − . . . density.ratio a sqrt−scaled genomic location type utr cds condition ctrl ltp jundb scaled s m g b in s ty p e lo g w id th m e a n l o g d e n s it y lo g p v a lu e logcpm condition type cds/utr cds utr/ utr utr utr non−coding log pvalue condition ctrl ltp scaled logcpm − − lognormdensity . . . . . c logcpm lo g (c p m ) figure : plotting functions. a: plottopgenes provides significance and effect size statistics aggregated at the gene level. b: deubinplot provides a more flexible version of the bin-level gene plots generated by common deu packages. shown here is the upregulation of jund ’ utr upon ltp. c: genebinheatmap provides a compact, bin-per-sample heatmap representation of a gene. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / bin should also be visible in downstream bins. we therefore reasoned that it would be beneficial to use this property to improve statistical analysis. we reasoned that connected bins with significant fold changes in the same direction could be unified and their p-values aggregated, and tested a rudimentary implementation using fisher’s aggregation. however, this decreased accuracy and led to a worse fdr control (supplementary figure ). second, most methods compare bin-level foldchanges to gene-level ones to identify bins be- having differently from the others, and we reasoned that, especially for genes with more utr bins than cds bins, including counts of ’ utr when calculating overall gene expression could under- estimate the gene expression and possibly mistake the utr foldchange for the gene foldchange. we therefore tried a modification of diffsplice to only calculate the gene foldchange from coding sequence (cds) bins and then compare it to the individual bins. again, this approach proved unsuccessful (supplementary figure ). discussion diffutr streamlines deu analysis and outperforms alternative methods in inferring utr changes, which demonstrates the utility of harnessing powerful, well-established frameworks for new ends. it must be noted that the way in which the simulation was performed, i.e. elongating transcripts to the next polya site(s), is similar to the way diffutr disjoins the annotation into bins, which could cause a bias towards this method (as well as qapa and apalyzer, which also makes use of alternative polya sites). however, this is unlikely to be the reason for the observed superiority of diffutr -based methods given the considerable extent by which they outperformed alternatives, and the observation of similar results in real data. similar to deu tools [ ], diffutr fails to control the fdr correctly, and our attempts so far to improve this remained unsuccessful. we therefore recommend prudence with results close to the significance threshold. in addition, and in contrast to deu where exons are subject to splicing in a potentially independent fashion, ’ utrs typically do not undergo splicing and therefore only differ in length between conditions. this means that the behavior of a utr bin is dependent on that of upstream bins, a property which could be exploited to improve accuracy at the gene-level. however, our simple attempt to do so by combining p-values of consecutive bins did not have the desired outcome, pointing to the need of more research in this direction. further, the bin-based approach has the drawback of not pinpointing the exact utr locations: .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / it is limited to the bin resolution, and the bins themselves are limited by incomplete transcript and apa annotations. additionally, because there is a significant drop off in read coverage at the end of transcripts, we have observed that it is often bins upstream of the actual utr lengthen- ing/shortening event which give a statistically-significant signal rather than the one truly affected. this is why we have provided tools to enable the further inspection of events in a given gene. finally, the results of bin-based analyses are limited by the overlaps of transcripts from different genes, an issue on which differential transcript usage analysis approaches appear superior (e.g. [ ]). however, transcript usage analysis tools are dependent on the completeness of the transcript annotation, while bin-based approaches are more open to the discovery of unannotated transcript variants, which is especially relevant for differential utr usage. here, we made the choice of including ambiguous bins, but flagging them as such, enabling users to interpret them with caution. while dexseq remains the tool of predilection for relative bin usage analyses, it scales very badly to larger sample sizes, and alternatives might be needed in some contexts. our changes to limma’s original diffsplice method consistently result in more accurate predictions, making this new method the best compromise for bin-based approaches when dexseq is not applicable. more generally, it also shows that even with well-established approaches, there is still room for incremental, but non-negligible improvement. methods . data and code availability the data objects and code used to produce the figures are available through the https:// github.com/plger/diffutr_paper repository. the diffutr source code is available at https: //github.com/ethz-ins/diffutr. . rnaseq data processing for the evaluation of diffsplice in a standard deu case, we used bin count data obtained from the authors of the original deu benchmark [ ]. for other datasets, reads were downloaded from the sra, aligned to the grcm .p genome using star . . a with default parameters and the gencode m annotation as guide. the same gene annotation was used as input for bin creation. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/plger/diffutr_paper https://github.com/plger/diffutr_paper https://github.com/plger/diffutr_paper https://github.com/ethz-ins/diffutr https://github.com/ethz-ins/diffutr https://github.com/ethz-ins/diffutr https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / . diffutr diffutr is implemented as a bioconductor package making use of the extensive libraries avail- able, especially the genomicranges package [ ] and the different deu methods (see differential analysis). . . preparing bins exons are extracted from the genome annotation and flattened into non-overlapping bins (figure b). in other words, the exon annotation is fragmented into the widest ranges where the set of overlapping features is the same. bins that do not overlap with coding sequences (cds) and belong to a protein coding transcript are labeled as utr and the rest as cds. when apa sites are also provided as input (for the purpose of this article, polyasite v . sites were used), bins are further segmented and/or extended. for this the closest upstream cds or utr is found for every poly(a) site and the utr is defined from this boundary to the polya site and assigned to the corresponding gene and transcript (figure b). if the newly defined utrs exceeds a predefined length specified by maxutrbinsize (default is bp), it is ignored as unlikely to be a real utr. moreover, if the start of a gene is the closest upstream sequence before any utr or cds the newly defined utr is ignored to avoid assignment problems. in order to later differentiate between regions that are ’ or ’ utrs, regions that are downstream of the last cds of a given transcript were labeled as ’ utr. the label ‘non-coding’ is assigned to all bins that have no protein coding transcript overlapping it. if a bin originates from regions belonging to different genes, the bin is duplicated and as- signed once to each gene, so that each gene contains the same fragment once. alternatively, the genewise argument can be used so that only exons belonging to the same gene are considered when flattening. . . quantification for quantification, countfeatures() uses the featurecounts() function from the rsubread package [ ] to count previously mapped reads overlapping each bin. by default every read is assigned once to every bin it overlaps with and can therefore be counted multiple times, which is needed because many bins are shorter than the read length. alternative counting methods, such as summarizeoverlaps() from the genomicalignments package [ ] performed considerably worse .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / in the simulation. the function returns a rangedsummarizedexperiment object [ ], containing the read counts as well as the bin annotation. . . differential analysis three wrappers implement corresponding deu methods on the rangedsummarizedexperiment object previously generated, returning results as further stan- dardized annotation within the object. for differential utr analysis, gene-level results are ob- tained by filtering the bin-level results for those assigned to the type utr and/or ’ utr, and setting all other p-values to before aggregation. diffsplicedge.wrapper() this is a wrapper around edger ’s deu method based on fitting a negative binomial generalized linear model [ ]. in a first step the bins are filtered to decide which have a large enough read count to be kept for the statistical analysis (filterbyexpr()), the library sizes are normalized (calcnormfactors()) and the dispersion is estimated (estimatedisp()). after this the model is fitted (glmfit()). if the option qlf = true (default), an extended model is fitted, using quasi-likelihood methods to account for gene specific variability (glmqlfit()). in the last step bin fold changes are tested to be different from overall gene fold changes, using a likelihood ratio test or a quasi-likelihood f-test depending on the qlf option chosen (diffsplicedge()). the gene level p-values are obtained by the simes’ method [ ]. dexseq.wrapper() in this method the standard dexseq differential exon usage pipeline [ ] is implemented. it is similarly to edger based on fitting a negative binomial model but instead of comparing fold change differences between bins and genes, dexseq compares a full model con- taining a term corresponding to the change in exon usage between conditions to a reduced model without this term. the two fits are compared using a χ likelihood-ratio test. the libraries are nor- malized (estimatesizefactor()), the dispersion is estimated (estimatedispersion() and the models are fitted (testfordeu()). in a last step the fold changes between the bins are estimated ( estimateexonfoldchanges()). to obtain gene level results the function pergeneqvalue() is used, which is based on the šidák method [ ]. diffsplice.wrapper() and diffsplice this method implements the differential exon usage pipeline of limma for rna-seq data [ ]. the pre-processing is identical to diffsplicedge.wrapper(), then the precision weights are estimated with (limma::voom()) and the linear models are fitted .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / (limma::lmfit()). in the last step, bin fold changes are tested to be different from overall gene fold changes, using a moderated t-test (diffsplice() or, by default, diffsplice () – see below). the gene level p-values are obtained by the simes’ method [ ]. the diffutr::diffsplice function provides an improved version of limma’s original diffsplice method. diffsplice works on the bin-wise coefficient of the linear model which corresponds to the log fold changes between conditions. it compares the log (fold change) β̂k,g of a bin k belonging to gene g, to a weighted average of log (fold change) of all the other bins of the same gene combined b̂k,g (the subscript g will be henceforth omitted for ease of reading). the weighted average of all the other bins in the same gene is calculated by b̂k = ∑n i,i =k wiβ̂i∑n i,i =k wi ( ) where wi = u i and ui refers to the diagonal elements of the unscaled covariance matrix (x t v x)− . x is the design matrix and v corresponds to the weight matrix estimated by voom. the difference of log fold changes, which is also the coefficient returned by diffsplice() is then calculated by ĉk = β̂k − b̂k. instead of calculating the t-statistic with ĉk, this value is scaled again in the original code: d̂k = ĉk √ − wk∑n i wi ( ) and the t -statistic is calculated as: tk = d̂k uksg ( ) s g refers to the posterior residual variance of gene g, which is calculated by averaging the sample values of the residual variances of all the bins in the gene, and then squeezing these residual variances of all genes using empirical bayes method. this assumes that the residual variance is constant across all bins of the same gene. in diffsplice (), we applied three changes to the above method. first, the residual variances are not assumed to be constant across all bins of the same gene. this results in the sample values of the residual variances of every bin now being squeezed using empirical bayes method, resulting in posterior variances s i for every individual bin i. second, the weights wi, used to calculate b̂k, now incorporate the individual variances by wi = s i u i . third, the ĉk value is .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / directly used to calculate the t -statistic, which after all these changes now corresponds to tk = ĉk uksi . ( ) . simulated data the simulation was done using the polyester r package [ ] using parameters obtained from the control samples of mouse hippocampus rnaseq [ ]. using salmon [ ] with a decoy-aware tran- scriptome index for the mm genome from [ ], the abundances for each transcript were first esti- mated to learn parameters for the simulation. transcripts from different genes were randomly chosen. the last exon of all these transcripts was lengthened to the next, second next or third next downstream apa site annotated in the polyasite database [ ]. duplicates of these transcripts were generated, which had less or no lengthening of their last exon, generating pairs of transcripts with different utr lengths. for each transcript pair, one transcript was up and the other one down reg- ulated by the same sampled fold change between . and . to make it more realistic, fold changes were also assigned to genes from the set with differential utr, and genes that did not have differences in utr usage. reads were then generated for two conditions with three replicates each using the simulate experiment() function with the options paired = false, error model = "illumina ", bias = "cdnaf" and strand specific = true. the simulated reads are avail- able on figshare at https://dx.doi.org/ . /m .figshare. . . ’-seq analysis to establish a set of true relative differences in utr usage from the ’ sequencing data [ ], we downloaded the authors’ counts per cluster from the gene expression omnibus (file gse reads count table.txt.gz). we used the h treatment because we observed it to have the strongest signal, and excluded one sample (a ) that appeared like a strong outlier based on pca and mds plots. we kept only clusters with at least reads in at least samples, and used dexseq to fit a negative binomial on each gene and estimate the significance of the cluster:condition term. we considered as true positives genes with a gene-level and bin-level q-value ≤ . , and true negatives genes with a gene-level q-value ≥ . . genes for which all tested methods produced a p-value of or na (i.e. genes filtered out as too lowly expressed in the standard rnaseq) were excluded for the benchmark. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://dx.doi.org/ . /m .figshare. https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / . comparisons with alternatives for the comparison of methods, all functions were used with their default parameters and run according to their manual. as qapa and dapars do not provide means to aggregate the results to the gene level this was implemented separately. for dapars the p-values were aggregated to the gene level by using simes’ method [ ] for comparability with diffutr. aggregation by taking the minimum p-value of all the transcripts in a gene produced extremely similar results. for qapa |∆pau| was calculated and aggregated to a gene level by taking the maximum from all transcripts of a gene and the genes were ranked by this value. alternatively, we also tested applying a t -test on the log-transformed pau values (log-transforming had a negligible effect), followed by simes’ gene-level aggregation. attempts to complement qapa with p-values estimated from established statistical tests working with its equivalence classes, such as bandits [ ], did not improve the results and were therefore discarded so as not to distort the original method. finally, for apalyzer we combined the ’ utr and intronic apa analyses by using the minimum of the two p-values. see the https://github.com/plger/diffutr_paper repository for details. we used the following software versions for comparisons: polyester . . , dexseq . . , edger . . , limma . . , dapars . . , apalyzer . . . for qapa, we used salmon . . with validatemappings. competing interests the authors declare no competing interests beside being the developers of the described package. author’s contributions sg developed the bin preparation and the diffsplice modification, and ran most of the analyses. plg and sg wrote the package and paper. plg and gs supervised the project. acknowledgements sg performed this research as part of his bachelor thesis in the interdisciplinary sciences program at eth. plg’s position is co-funded by prof. mark robinson (institute of molecular life sciences, university of zurich) and professors gerhard schratt, johannes bohacek and isabelle mansuy (institute of neuroscience, eth zurich). gs is supported by grants from the snf (snf , .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/plger/diffutr_paper https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / snf ) and the eth (eth- - (neurosno)). we thank the robinson group (uzh) for feedback. references . lewis, j. d., gunderson, s. i. & mattaj, i. w. the influence of ′ and ′ end structures on pre-mrna metabolism. journal of cell science. issn: ( ). . tian, b. & manley, j. l. alternative polyadenylation of mrna precursors. nature reviews molecular cell biology. issn: ( ). . fabian, m. r., sonenberg, n. & filipowicz, w. regulation of mrna translation and stability by micrornas. annual review of biochemistry. issn: ( ). . derti, a. et al. a quantitative atlas of polyadenylation in five mammals. genome research. issn: ( ). . sandberg, r., neilson, j. r., sarma, a., sharp, p. a. & burge, c. b. proliferating cells express mrnas with shortened ′ untranslated regions and fewer microrna target sites. science. issn: ( ). . mayr, c. & bartel, d. p. widespread shortening of utrs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. cell. issn: ( ). . miura, p., shenker, s., andreu-agullo, c., westholm, j. o. & lai, e. c. widespread and extensive lengthening of utrs in the mammalian brain. genome research. issn: ( ). . ha, k. c., blencowe, b. j. & morris, q. qapa: a new method for the systematic analysis of alternative polyadenylation from rna-seq data. genome biology. issn: x ( ). . fox-walsh, k., davis-turak, j., zhou, y., li, h. & fu, x. d. a multiplex rna-seq strategy to profile poly(a +) rna: application to analysis of transcription response and ′ end formation. genomics. issn: ( ). . fu, y. et al. differential genome-wide profiling of tandem utrs among human breast cancer and normal cells by high-throughput sequencing. genome research. issn: ( ). . zheng, d., liu, x. & tian, b. reads+, a sensitive and accurate method for end sequencing of polyadeny- lated rna. rna. issn: ( ). . jan, c. h., friedman, r. c., ruby, j. g. & bartel, d. p. formation, regulation and evolution of caenorhabditis elegans ′utrs. nature. issn: ( ). . shepard, p. j. et al. complex and dynamic landscape of rna polyadenylation revealed by pas-seq. rna. issn: ( ). . hwang, h. w. et al. ctag-paperclip reveals alternative polyadenylation promotes cell-type specific protein diversity and shifts araf isoforms with microglia activation. neuron. issn: ( ). . herrmann, c. j. et al. polyasite . : a consolidated atlas of polyadenylation sites from end sequencing. nucleic acids research. issn: ( ). .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / . wang, r., nambiar, r., zheng, d. & tian, b. polya db catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes. nucleic acids research , d –d . issn: - . https://doi.org/ . /nar/gkx ( ) (jan. ). . patro, r., duggal, g., love, m. i., irizarry, r. a. & kingsford, c. salmon provides fast and bias-aware quantification of transcript expression. en. nature methods . number: publisher: nature publishing group, – . issn: - . https://www.nature.com/articles/nmeth. ( ) (apr. ). . xia, z. et al. dynamic analyses of alternative polyadenylation from rna-seq reveal a ′-utr landscape across seven tumour types. nature communications. issn: ( ). . ye, c., long, y., ji, g., li, q. q. & wu, x. apatrap: identification and quantification of alternative polyadeny- lation sites from rna-seq data. bioinformatics. issn: ( ). . wang, z., gerstein, m. & snyder, m. rna-seq: a revolutionary tool for transcriptomics. nature reviews genetics. issn: ( ). . wang, r. & tian, b. apalyzer: a bioinformatics package for analysis of alternative polyadenylation isoforms. bioinformatics (oxford, england). issn: ( ). . anders, s., reyes, a. & huber, w. detecting differential usage of exons from rna-seq data. genome research. issn: ( ). . robinson, m. d., mccarthy, d. j. & smyth, g. k. edger: a bioconductor package for differential expression analysis of digital gene expression data. bioinformatics. issn: ( ). . law, c. w., chen, y., shi, w. & smyth, g. k. voom: precision weights unlock linear model analysis tools for rna-seq read counts. genome biology. issn: x ( ). . ritchie, m. e. et al. limma powers differential expression analyses for rna-sequencing and microarray studies. nucleic acids research. issn: ( ). . morgan, m., obenchain, v., hester, j. & pagès, h. summarizedexperiment: summarizedexperiment con- tainer. r package version . . ( ). . soneson, c., matthes, k. l., nowicka, m., law, c. w. & robinson, m. d. isoform prefiltering improves perfor- mance of count-based methods for analysis of differential transcript usage. genome biology. issn: x ( ). . blair, j. d., hockemeyer, d., doudna, j. a., bateup, h. s. & floor, s. n. widespread translational remodeling during human neuronal differentiation. cell reports. issn: ( ). . whipple, a. j. et al. imprinted maternally expressed micrornas antagonize paternally driven gene programs in neurons. english. molecular cell . publisher: elsevier, – .e . issn: - . https://www.cell. com/molecular-cell/abstract/s - ( ) - ( ) (apr. ). . fontes, m. m. et al. activity-dependent regulation of alternative cleavage and polyadenylation during hip- pocampal long-term potentiation. scientific reports. issn: ( ). . tiberi, s. & robinson, m. d. bandits: bayesian differential splicing accounting for sample-to-sample vari- ability and mapping uncertainty. genome biology. issn: x ( ). .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /nar/gkx https://www.nature.com/articles/nmeth. https://www.cell.com/molecular-cell/abstract/s - ( ) - https://www.cell.com/molecular-cell/abstract/s - ( ) - https://www.cell.com/molecular-cell/abstract/s - ( ) - https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / . lawrence, m. et al. software for computing and annotating genomic ranges. plos computational biology. issn: x ( ). . liao, y., smyth, g. k. & shi, w. featurecounts: an efficient general purpose program for assigning sequence reads to genomic features. bioinformatics. issn: ( ). . simes, r. j. an improved bonferroni procedure for multiple tests of significance. biometrika. issn: ( ). . šidák, z. rectangular confidence regions for the means of multivariate normal distributions. journal of the american statistical association. issn: x ( ). . frazee, a. c., jaffe, a. e., langmead, b. & leek, j. t. polyester: simulating rna-seq datasets with differential transcript expression. bioinformatics. issn: ( ). . stolarczyk, m., reuter, v. p., smith, j. p., magee, n. e. & sheffield, n. c. refgenie: a reference genome resource manager. gigascience. issn: x ( ). .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / data and code availability rnaseq data processing diffutr preparing bins quantification differential analysis simulated data '-seq analysis comparisons with alternatives particlechromo d: a particle swarm optimization algorithm for chromosome and genome d structure prediction from hi-c data particlechromo d: a particle swarm optimization algorithm for chromosome and genome d structure prediction from hi-c data david vadnais , michael middleton , and oluwatosin oluwadare * department of computer science, university of colorado, colorado springs, co, usa. * corresponding author email: ooluwada@uccs.edu (oo) abstract the three-dimensional ( d) structure of chromatin has a massive effect on its function. because of this, it is desirable to have an understanding of the d structural organization of chromatin. to gain greater insight into the spatial organization of chromosomes and genomes and the functions they perform, chromosome conformation capture techniques, particularly hi-c, have been developed. the hi-c technology is widely used and well-known because of its ability to profile interactions for all read pairs in an entire genome. the advent of hi-c has greatly expanded our understanding of the d genome, genome folding, gene regulation and has enabled the development of many d chromosome structure reconstruction methods. here, we propose a novel approach for d chromosome and genome structure reconstruction from hi-c data using particle swarm optimization approach called particlechromo d. this algorithm begins with a grouping of candidate solution locations for each chromosome bin, according to the particle swarm algorithm, and then iterates its position towards a global best candidate solution. while moving towards the optimal global solution, each candidate solution or particle uses its own local best information and a randomizer to choose its path. using several metrics to validate our results, we show that particlechromo d produces a robust and rigorous representation of the d structure for input hi-c data. we evaluated our algorithm on simulated and real hi-c data in this work. our results show that particlechromo d is more accurate than most of the existing algorithms for d structure reconstruction. our results also show that constructed particlechromo d structures are very consistent, hence indicating that it will always arrive at the global solution at every iteration. the source code for particlechromo d, the simulated and real hi-c datasets, and the models generated for these datasets are available here: https://github.com/oluwadarelab/particlechromo d introduction chromosome conformation capture ( c) and its subsequent derivative technologies are invaluable for describing chromatin's three-dimensional ( d) structure [ ]. c's biochemical approach to studying dna's topography within chromatin has outperformed the traditional microscopy approaches like fluorescence in situ hybridization (fish) due to c's systematic nature [ ]. as a side note, microscopy is still used in conjunction with c for verifying the actual d structure of chromatin against the predicted outcome [ ]. c was first described by [ ] dekker et al. ( ). since then, more technologies were developed [ ], such as the chromosome conformation capture-on-chip ( c) [ ], chromosome conformation capture carbon copy ( c) [ ], hi-c[ ], tcc[ ], and chromatin interaction analysis by paired-end tag sequencing chia-pet [ , ]. these derivative technologies were designed to augment c's in the following areas, measure spatial data within chromatin, increase measuring throughput, and analyze proteins and rna within chromatin instead of just dna. lieberman- aiden et al., [ ] designed hi-c as a minimally biased "all vs. all" approach. hi-c works by injecting biotin- labeled nucleotides during the ligation step [ ]. hi-c provides a method for finding genome-wide chromatin if data in the form of a contact matrix [ ]. hi-c analysis doubtlessly introduced great benefit to d genome research— they explain a series of events such as genome folding, gene regulation, genome stability, and the relationship between regulatory elements and structural features in the cell nucleus [ , , ]. importantly, it is possible to glean insight into chromatin's d structure using the hi-c data. however, to use hi-c data for d structure modeling, some pre-processing is necessary to extract the interaction frequencies (if) between the chromosome or genome’s interacting loci [ ]. this process involves quality control and mapping of the data [ ]. once these steps are completed, an if matrix, or called contact matrix or map, is generated. an if matrix is a symmetric matrix that records a one-to-one interaction frequency .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:ooluwada@uccs.edu https://github.com/oluwadarelab/particlechromo d https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / for all the intersecting loci [ , ]. the if matrix is represented as either a square contact matrix or as a three- column sparse matrix. each cell has genomic bins within these matrices that are the length of the data's resolution representing each cell [ ]. hence, the higher the resolution ( kb), the larger the contact matrix's size. and similarly, the lower the resolution ( mb), the smaller the contact matrix's size. next, this hi-c data is normalized to remove biases that next-generation sequencing can create [ , ]. an example of this type of bias would be copy number variation [ ]. other systematic biases introduced during the hi-c experiment are by external factors, such as dna shearing and cutting [ ]. today, several computational algorithms have been developed to remove these biases from the hi-c if data [ - ]. once the hi-c if matrix data is normalized, it is most suitable for d chromosome or genome modeling. some tools have been developed to automate this hi-c pre- processing steps; they include genomeflow [ ], hi-cpipe [ ], juicer [ ], hic-pro [ ], and hicup [ ]. to create d chromosome and genome structures from if data, many techniques can be used. oluwadare, o., et al. ( ) [ ] pooled the various developed analysis techniques into three buckets, which are distance- based, contact-based, and probability-based methods. the first method is a distance-based method that maps if data to distance data and then uses an optimizer to solve for the d coordinates [ ]. this type of analysis's final output will be (x, y, z) coordinates [ ]. however, the difficulty is picking out how to convert the if data and which optimization algorithm to use [ ]. the distance between two genomic bins is often represented as 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑖,𝑗 = /(𝐼𝐹𝑖,𝑗 𝛼 ) [ , ]. in this approach 𝐼𝐹𝑖,𝑗 is the number of times two genomic bins had contact and 𝛼 is a factor which is used for modeling, called the conversion factor. this distance can then be optimized against other genomic bins' other distance values to create a d model. several methods [ ] belong in this category include, chromsde[ ], autochrom d [ ], chromosome d [ ], dmax [ ], shrec d [ ], lordg [ ], infmod dgen [ ], hsa [ ], shneigh[ ]. the second classification for d genome structure modeling algorithms from if data is contact-based methods. this technique uses the if data directly instead of starting by converting the data to a (x, y, z) coordinate system [ ]. one way to model this data is with a gradient descent/ascent algorithm [ ]. this approach was explored by trieu t, and cheng j., through the algorithm titled mogen [ ]. mogen works by optimizing a scoring function that scores how well the chromosomal contact rules have been satisfied [ ]. another contact method was to take the interaction frequency and use it for spatial restraints [ ]. gen d [ ], chrom d [ ], and gem [ ] are other examples in this category. the third classification is probability-based. the advantages of probability-based approaches are that they easily account for uncertainties in experimental data and can perform statistical calculations of noise sources or specific structural properties [ ]. unfortunately, probability techniques can be very time-consuming compared to contact and distance methods. rousseau et al., created the first model in this category using a markov chain monte carlo approach called mcmc c [ ]. markov chain monte carlo was used due to its synergy with estimating properties' distribution [ ]. varoquaux. n., et al., [ ] extended this probability-based approach to modeling the d structure of dna. they used a poisson model and maximized a log-likelihood function [ ]. many other statistical models can still be explored. this paper presents particlechromo d, a new distance-based algorithm for chromosome d structure reconstruction from hi-c data. particlechromo d uses particle swarm optimization (pso) to generate d structures of chromosomes from hi-c data. here, we show that particlechromo d can generate candidate structures for chromosomes from hi-c data. additionally, we analyze the effects of parameters such as confidence coefficient and swarm size on the structural accuracy of our algorithm. finally, we compared particlechromo d to a set of commonly used chromosome d reconstruction methods, and it performed better than most of these methods. we showed that particlechromo d effectively generates dstructures from hi-c data and is highly consistent in its modeling performance. materials and methods the particle swarm optimization algorithm kennedy j., and eberhart r. ( ) [ ] developed the particle swarm optimization (pso) as an algorithm that attempts to solve optimization problems by mimicking the behavior of a flock of birds. pso has been used in the following fields: antennas, biomedical, city design/civil engineering, communication networks, combinatorial optimization, control, intrusion detection/cybersecurity, distribution networks, electronics and electromagnetics, engines and motors, entertainment, diagnosis of faults, the financial industry, fuzzy logic, .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / computer graphics/visualization, metallurgy, neural networks, prediction and forecasting, power plants, robotics, scheduling, security and military, sensor networks, and signal processing [ - ]. since pso has been used in so many disparate fields, it appears to be robust and flexible, which gives credence to the idea that it could be used in this use case of bioinformatics and many others [ ]. pso falls into the optimization taxonomy of swarm intelligence [ ]. pso works by creating a set of particles or actors that explore a topology and look for the global minimum of that topology [ ]. at each iteration, the swarm stores each particle's minimum result, as well as the global swarm's minimum, found. the particles explore the space with both a position and velocity, and they change their velocity based on three parameters. these three parameters are current velocity, distance to the personal best, and distance to the global best [ ]. position changes are made based on the calculated velocity during each iteration. the velocity function is as follows [ ]: 𝑉𝑛+ = 𝑤 ∗ 𝑉𝑛 + 𝑐 ∗ 𝑅 ∗ ( 𝑃𝑛 − 𝑋𝑛 ) + 𝑐 ∗ 𝑅 ∗ ( 𝐺𝑛 − 𝑋𝑛 ) ( ) then position is updated as follow: 𝑋𝑛+ = 𝑋𝑛 + 𝑉𝑛+ ( ) where: • 𝑉𝑛 is the current velocity at iteration 𝑛 • 𝑐 and 𝑐 are two real numbers that stand for local and global weights and are the personal best of the specific particle and the global best vectors, respectively, at iteration 𝑛 [ ]. • the 𝑅 and 𝑅 values are randomized values used to increase the explored terrain [ ]. • 𝑤 is the inertia weight parameter, and it determines the rate of contribution of a velocity [ ]. • 𝐺𝑛 represents the best position of the swarm at iteration 𝑛. • 𝑃𝑛 represents the best position of an individual particle. • 𝑋𝑛 is the best position of an individual particle at the iteration 𝑛. why pso this project's rationale is that using pso could be a very efficient method for optimizing hi-c data due to its inherent ability to hold local minima within its particles. this inherent property will allow sub-structures to be analyzed for optimality independently of the entire structure. in fig , particle one is at the global best minimum found so far. however, particle two has a better structure in its top half, and it is potentially independent of the bottom half. because particle one has a better solution so far, particle two will traverse towards the structure in particle one in the iteration 𝑛 + . while particle two is traversing, it will go along a path that maintains its superior d model sections. thus, it has a higher chance of finding the absolute minimum distance value. the more particles there are, the greater the time complexity of pso and the higher the chance of finding the absolute minimum. the inherent breaking up of the problem could lend itself to powerful d structure creation results. more abstractly relative to hi-c data but in the traditional pso sense, the same problem as above might look as follows (fig ) when presented in a topological map. fig . pso potential advantage for structure holding. the figure summarizes the pso algorithm performance expectation on the d genome structure reconstruction problem. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / from fig , in the 𝑛𝑡ℎ iteration, particle found a local minimum within this step. since of all the particles, this is the lowest point; particle two will search towards particle one with a random chance amount added to its velocity [ , , ]. the random chance keeps particle two from going straight to the optimal solution [ , ]]. in this case, particle two found the absolute minima, and from here on, all the particles will begin to migrate towards particle . we will test this hypothesis by analyzing its output with the evaluation metrics defined in the “results” section. in summary, we believe the particle-based structure of pso may lend itself well to the problem of converting hi-c if data into d models. we will test this hypothesis and compare our results to the existing modeling methods. fig . pso particle iteration description this figure explains the pso algorithm's search mechanism for determining the best d structure following the individual particles' modified velocity and position in the swarm. fig . pso for chromosome and genome d structure prediction we present a step-by-step illustration of the significant steps taken by particlechromo d for d chromosome and genome structure reconstruction from an input normalized if matrix. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pso for d structure reconstruction from hi-c data here we describe how we implemented the pso algorithm as a distance-based approach for d genome reconstruction from hi-c data. this algorithm is called particlechromo d. in this context, the input if data is converted to the distance equivalent using the conversion factor, 𝛼 , for d structure reconstruction. first, we initialize the particles' d (x,y,z) coordinates for each genomic bin or regions randomly in the range [- , ]. we used the sum of squared error function as the loss function to compute chromosome structures from a contact map. finally, we used pso to iteratively improve our function until it has converged on either an absolute or local minima. the full particlechromo d algorithm is presented in fig . some parameters are needed to use the pso algorithm for d structure reconstruction. this work has provided the parameter values that produced our algorithm's optimal results. the users can also provide their settings to fit their data where necessary. the results of the series of tests and validation performed to determine the default parameters are described in the "parameters estimation" section of the results section. model representation a particle is a candidate solution. a list of xyz coordinates represents each particle in the solution. the candidate solution's length in the number of regions in the input hi-c data. each particle's point is the individual coordinate, xyz, of each bead. a swarm consists of n candidate solution, also called the swarm size, which the user provides as program input. we provide more explanation in the "parameters estimation" section below for how to determine the swarm size. data our study used the yeast synthetic or simulated dataset from adhikari et al., [ ] to perform parameter tuning and validation. the simulated dataset was created from a yeast structure for chromosome at kb resolution [ ]. the number of genome loci in the synthetic dataset is . we used the gm cell hi- c dataset to analyze a real dataset, geo accession number gse [ ]. the normalized contact matrix was downloaded from the gsdb database with gsdb id: oo sf [ ]. results metrics used for evaluation to evaluate the structure’s consistency with the input hi-c matrix, we used the following metrics: pearson correlation coefficient (pcc) the pearson correlation coefficient is as follows [ ], 𝑃𝐶𝐶 = ∑((𝑑𝑖 − �̅�) ∗ (𝐷𝑖 − �̅�)) √∑(𝑑𝑖 − �̅�) ∗ ∑(𝐷𝑖 − �̅�) where: • 𝐷𝑖 and 𝑑𝑖 are instances of a distance value between two bins. • 𝐷 and �̅� are the means of the distances within the data set. • it measures the relationship between variables. values a between - to + • a higher value is better. spearman correlation coefficient (scc) spearman’s correlation coefficient is defined below [ ], 𝑆𝐶𝐶 = ∑(𝑥𝑖 − �̅�) ∗ (𝑦𝑖 − �̅�) √∑(𝑥𝑖 − �̅�) ∗ √∑(𝑦𝑖 − �̅�) where: .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / • xi and yi are the rank of the distances,𝐷𝑖 and 𝑑𝑖 , defined in the pcc equation above. • �̅� and �̅� are the sample mean tank of both x and y, respectively. • values a between - to + . a higher value is better. root mean squared error (rmse) root mean squared error follows the equation below [ ], 𝑅𝑀𝑆𝐸 = √ 𝑛 ∗ ∑(𝑑𝑖 − 𝐷𝑖 ) where: • di and di are instances of distance values from the data and another data source. • the value n is the size of the data set. • tm-score tm-score is defined as follows [ ][ ], 𝑇𝑀 − 𝑠𝑐𝑜𝑟𝑒 = 𝑀𝐴𝑋𝐼𝑀𝑈𝑀 [ 𝐿𝑇𝑎𝑟𝑔𝑒𝑡 ∗ ∑ + ( d𝑖 𝑑 ∗ 𝐿𝑇𝑎𝑟𝑔𝑒𝑡 ) 𝐿𝑎𝑙𝑖 𝑖 ] where: • ltarget is the length of the chromosome. • di is an instance of a distance value between two bins. • lali represents the count of all aligned residues. • d is a normalizing parameter. the tm-score is a metric to measure the structural similarity of two proteins or models [ , ]. a tm-score value can be between ( , ] were indicates two identical structures [ ]. a score of . indicates pure randomness, and a score above . indicates the two structures have mostly the same folds [ ]. hence the higher, the better. parameters estimation we used the yeast synthetic dataset to decide on particlechromo d's best parameters. we used this data set to investigate the mechanism for choosing the best alpha conversion factor for input hi-c data. also, determine the optimal swarm size; determine the best threshold value for the algorithm, inertia value(w), and the best coefficients for our pso velocity (𝑐 and 𝑐 ). we evaluated our reconstructed structures by comparing them with the synthetic dataset's true distance structure provided by adhikari et al., [ ]. we evaluated our algorithms with the pcc, scc, rmse, and tm-score metrics. based on the results from the evaluation, the default value for the particlechromo d parameters are set as presented below: conversion factor test (𝛂) the synthetic interaction frequency data set was generated from a yeast structure for chromosome at kb [ ] with an 𝛼 value of using the formula: 𝐼𝐹 = /𝐷𝛼. hence, the relevance of using this test data is to test if our algorithm can predict the alpha value used to produce the synthetic dataset. for both pcc and scc, our algorithm performed best at a conversion factor (alpha) of . (fig ). our algorithm's default parameter setting is that it searches for the best alpha value in the range [ . , . ]. side by side comparison of the true simulated data (yeast) structure and the reconstructed structure by particlechromo d shows that they are highly similar (fig ) .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig . a plot of the evaluation metric versus the conversion factors. (a) a plot of scc vs. conversion factor. (b) a plot of pcc vs. conversion factor. here, we show the performance of particlechromo d on the scc and pcc metric for the simulated dataset at 𝛼 value in the range . to . . the result shows the best result is recorded at 𝛼 = . the scc and pcc metric values were obtained by comparing the particlechromo d algorithm's output structure at each 𝛼 value with the true structure. in fig a and b, the y-axis denotes the scc and pcc scores, respectively, in the range [- , ], and the x-axis denotes the conversion factor values. a higher scc and pcc value is better. fig . a comparison of the simulated data true structure and reconstructed structure by particlechromo d. (a)true structure from duan et al. [ ] (b) reconstructed structures for the simulated data using particlechromo d. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / swarm size the swarm size defines the number of particles in the pso algorithm. we evaluated the performance of the particlechromo d with changes in swarm size (fig a, fig b, fig c). also, we evaluated the effect of an increase in swarm size against computation time (fig d). our result shows that computational time increases with increased swarm size. given the computational implication and the algorithm’s performance at various swarm size, we defined a swarm size of as our default value for this parameter. according to our experiments, the swarm size is most suitable if the user’s priority is saving computational time, and swarm size is suitable when the user's preference is algorithm performance over time. hence, setting the default swarm size gives us the best of both worlds. the structures generated by particlechromo d also shows that the result at swarm size (fig c) and (fig d) are most similar to the simulated data true structure represented in fig a (fig ). fig . a plot of the evaluation metric versus the swarm size parameter. (a) a plot of the scc vs. the swarm size. (b) a plot of pcc vs. the swarm size. (c) a plot of rmse vs. the swarm size. (d) a plot of the runtime, in seconds, vs. the swarm size. the scc, pcc, and rmse values were obtained by comparing the particlechromo d algorithm's output structure with the simulated data true structure. in fig a and fig b, the y-axis denotes the scc and pcc score in the range [- , ], and the x-axis denotes the swarm sizes values considered. a higher scc and pcc value is better. in fig c, the y-axis denotes the rmse score, and the x-axis denotes the swarm size values. a lower rmse value is better. in fig d, the y-axis denotes the running time in seconds, and the x-axis denotes the swarm size values. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig . structures generated by particlechromo d at different swarm size values. here, we show the structure generated at swarm size = (a) (represented with blue color), (b) (represented with magenta color), (c) (represented with red color), and (d) (represented with green color). as shown, the structure generated at swarm size is not smooth; it has a couple of rough edges (fig a). this correlates to the scc, pcc, and rmse recorded at this swarm size as it is the lowest at swarm size . next, at swarm size (fig b), we observe a smoother representation but still with some rough edges. the result here shows that the results were really similar at swarm size and (fig c, fig d). threshold the threshold parameter is designed to serve as an early stopping criterion if the algorithm converges before the maximum number of iterations is reached. hence, we evaluated the effect of varying threshold levels using the evaluation metrics(fig ). the output structures generated by each of the thresholds also allow a visual examination of a threshold value(fig ). we observed that the lower the threshold, the more accurate(fig ) and similar the structure is to the generated true simulated data structure in fig a(fig f). it worth noting that this does have a running time implication. reducing the threshold led to a longer running time. however, since this was a trade-off between a superior result and longer running time or a fairly good result and short running time, we chose the former for particlechromo d. the default threshold for our algorithm is . . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig . a plot of the evaluation metric versus the threshold parameter. (a) a plot of the scc versus different threshold levels. (b) a plot of pcc versus the different threshold levels. (c) a plot of rmse versus different threshold levels. the results show the performance of our algorithm at threshold values . , . , . , . , . , . . the scc, pcc, and rmse values reported were obtained by comparing the particlechromo d algorithm’s output structure to the simulated dataset's true structure. in fig a and b, the y-axis denotes the scc and pcc scores in the range [- , ], and the x-axis denotes the threshold values. a higher scc and pcc value is better. in fig c, y-axis denotes the rmse score, and the x-axis denotes the threshold values. a lower rmse value is better. fig . structures at a threshold of . , . , . , . , . , . respectively. (a) represents the structure produced using a threshold of . . (b) represents the structure produced using a threshold of . . (c) represents the structure produced using a threshold of . . (d) represents the structure produced using a .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / threshold of . . (e) represents the structure produced using a threshold of . . (f) represents the structure produced using a threshold of . . the results showed that the threshold value of . , fig f, produced the best result. confidence coefficient (𝒄𝟏 and 𝒄𝟐) the 𝒄𝟏 and 𝒄𝟐 parameters represent the local-confidence and local and global swarm confidence level coefficient. kennedy and eberhart, [ ] proposed that 𝒄𝟏= 𝒄𝟐 = 𝟐. we experimented with testing how this value's changes affected our algorithm's accuracy for local confidence coefficient (𝑐 ) . to . and global confidence values . to . (s and s fig). from our results, we found that a local confidence coefficient (𝒄𝟏) of . with a global confidence coefficient (𝒄𝟐) of . performed best (fig ). hence, these values were set as particlechromo d's confidence coefficient values. the accuracy results generated for all the local confidence coefficient (c ) at varying global confidence values is compiled in fig . fig . confidence coefficient test. (a) a plot of the scc by global confidence at local confidence . . (b) a plot of pcc by global confidence at local confidence . . the plot of the local confidence value local confidence coefficient (𝑐 ) = . against the varying level of global confidence coefficient (𝑐 ) values from . to . . the results show that the best result was obtained at 𝑐 = . . the scc and pcc values reported were obtained by comparing the particlechromo d algorithm’s output structure with the simulated dataset’s true structure. in fig a and b, the y-axis denotes the scc and pcc scores in the range [- , ], the x-axis denotes the global confidence values, and the colored plot denotes the local confidence values. a higher scc and pcc value is better. fig . a combined plot of different local confidences versus global confidences. the result's combined plot was obtained by comparing the particlechromo d algorithm’s output structure with the simulated dataset’s true structure for local confidence values of . to . and global confidence values of . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / to . . this plot shows the scc accuracy of the structures generated. the y-axis denotes the scc score in the range [- , ], the x-axis denotes the global confidence values, and the colored plot denotes the local confidence values. a higher scc value is better. random numbers (𝑹𝟏 and 𝑹𝟐) 𝑅 and 𝑅 are uniform random numbers between and [ ]. assessment on simulated data we evaluated how noise levels affect particlechromo d's ability to predict chromosome d structures in the presence of noise. using the yeast synthetic dataset from adhikari et al., [ ]. the data were simulated with a varying noise level. adhikari, et al. introduced noise into the yeast if matrix to make additional datasets with different levels of noise at %, %, %, %, %, %, %, %, %, %, %, and %. as reported by the authors, converting this if to their distance equivalent produced distorted distances that didn’t match the true distances. they were thereby simulating the inconsistent constraints that can sometimes be observed in un- normalized hi-c data. as shown, our algorithm performed the best with no noise in the data at (fig ). furthermore, the other result obtained by comparing the particlechromo d algorithm’s output structure from the noisy input datasets with the simulated dataset’s true structure shows that it can achieve a competitive result when dealing with un-normalized or noisy hi-c datasets(fig ). the result shows that our algorithm can achieve the results obtainable at reduced noise level even at increased noise as indicated by noise %(fig b) and %(fig c), respectively (fig ). also, the difference in performance between the best structure and the worst structure is ~ . . hence, our algorithm cannot be potentially be affected by the presence of noise in the input hi-c data. fig . assessment of the structures generated by particlechromo d for the simulated dataset on varying noise levels. (a) a plot of the scc versus noise level. (b) a plot of pcc versus the noise level. this plot shows the scc and pcc accuracy of the structures generated by particlechromo d at different noise levels introduced. in fig a and fig b, the y-axis denotes the metric score in the range [- , ]. the x-axis denotes the noise level. a higher scc and pcc value is better. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig . structures generated by particlechromo d at different noise levels. here, we show the structure generated by particlechromo d at noise level = (a) , that is no noise, (b) % ( ), (c) % ( ), and (d) % ( ) assessment on real hi-c data for evaluation on the real hi-c data, we used the gm b-lymphoblastoid cells line by rao et al., [ ]. the normalized mb and kb resolution interaction frequency matrices gm cell line datasets were downloaded from the gsdb repository under the gsdb id oo sf [ ]. the datasets were normalized using the knight-ruiz normalization technique [ ]. the performance of particlechromo d was determined by computing the scc value between the distance matrix of the normalized frequency input matrix and the euclidean distance calculated from the predicted d structures. fig shows the assessment of particlechromo d on the gm cell line dataset. the reconstructed structure by particlechromo d is compared against the input if expected distance using the pcc, scc, and rmsd metrics for the mb and kb resolution hi-c data. when particlechromo d performance is evaluated using both mb and kb resolution hic data of the gm cell, we observed some consistency in the algorithm’s performance for both datasets. chromosome had the lowest scc value of . and . at mb and kb resolutions, respectively, while chromosome had the highest scc value of . and . at mb and kb resolutions, respectively. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig . performance evaluation of particlechromo d using scc values for mb and kb resolution gm cell hi-c data. (a) a plot of particlechromo d scc performance on mb gm cell hi-c data chromosome to (b) a plot of particlechromo d scc performance on kb gm cell hi-c data for chromosome to . model consistency next, we assessed the consistency of our generated structures. we created structures for the chromosomes and then evaluated the structure’s similarity using the scc, pcc, rmse, and tm-score (fig ). we assessed the consistency for both the mb and kb resolution hi-c data of the gm cell. as illustrated for the tm-score, a score of . indicates pure randomness, and a score above . indicates the two structures have mostly the same folds. hence the higher, the better. our results show from the selected chromosomes that the structures generated by particlechromo d are highly consistent for both the mb (fig ) and kb (fig ) datasets. as shown in fig for the mb hi-c datasets, the average scc and pcc values recorded between the models for the selected chromosomes is >= . and >= . , respectively, indicating that chromosomal models generated by particlechromo d are highly similar. it also indicates that it finds an absolute d model solution on each run of the algorithm (fig c and fig d). similarly, as shown in fig , for the kb hi-c datasets, the average scc and pcc values recorded between the models for the selected chromosomes is >= . . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig . the model consistency check for mb resolution structures generated by particlechromo d using different evaluation metrics. (a) the average scc between structures per chromosome at mb resolution for the gm datasets. (b) the average pcc between structures per chromosome at mb resolution for the gm datasets.(c) the average tm-score between structure per chromosome at mb resolution for the gm datasets. (d) the boxplot shows the distribution of the structure's tm-score by chromosome for the gm datasets. the y-axis denotes the scc and pcc metric score in the range [- , ], and tm-score in the range [- , ]. the x- axis denotes the chromosome. a higher scc, pcc, and tm-score value is better. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig . the model consistency check for kb resolution structures generated by particlechromo d using different evaluation metrics. (a) the average scc between structures per chromosome at kb resolution for the gm datasets. (b) the average pcc between structures per chromosome at kb resolution for the gm datasets. (c) the average tm-score between structure per chromosome at kb resolution for the gm datasets. (d) the boxplot shows the distribution of the structure's tm-score by chromosome for the gm datasets. the y-axis denotes the scc and pcc metric score in the range [- , ], and tm-score in the range [- , ]. the x-axis denotes the chromosome. a higher scc, pcc, and tm-score value is better. comparison with existing chromosome d structure reconstruction methods here, we compared the performance of particlechromo d side by side with nine existing high-performing chromosome d structure reconstruction algorithms on the gm data set at both the mb and kb resolutions. the reconstruction algorithms are chromsde [ ], chromosome d [ ], dmax [ ], shrec d [ ], lordg [ ], gem [ ], hsa [ ], mogen [ ] and pastis [ ] (fig ). according to the scc value reported, we observed that particlechromo d outperformed most of the existing methods in many chromosomes evaluated at mb and kb resolution. at a minimum, particlechromo d secured the top-two best overall performance position among the ten algorithms compared. particlechromo d achieving these results against these methods and algorithms shows the robustness and suitability of the pso algorithm to be used to solve the d chromosome and genome structure reconstruction problem. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig . a comparison of the accuracy of nine existing methods and particlechromo d for d structure reconstruction on the mb and kb real hi-c dataset. (a) an scc comparison of d structure reconstruction methods on the gm hi-c dataset at mb resolution for chromosomes to . (b) an scc comparison of d structure reconstruction methods on the gm hi- c dataset at kb resolution for chromosomes to . the y-axis denotes the scc metric score in the range [- , ], and x-axis denotes the chromosome. a higher scc value is better. discussion we discussed the swarm size value's relevance in the parameters estimation section. we showed on the synthetic dataset that a swarm size (ss) value of did not produce satisfactory performance. however, it was the fastest considering the other swarm sizes. at ss = , the performance was significantly improved than at ss = , but with an increase in computation time as a consequence. ss values and similarly achieved better .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / performance, but the cost of this performance improvement similarly is an increase in the program running time. however, we settled for a ss = because it achieved one of the best performances, and the computational cost can be considered manageable. to investigate the implication of our choice, we carried out two tests discussed below: particlechromo d performance on different swarm size values first, we evaluated the performance of the particlechromo d algorithm on the gm data set on both the mb and kb resolutions at swarm sizes , , and to ensure that the performance at ss = that we observed on the synthetic dataset is carried over to the real dataset (fig ). the mb and kb dataset result shows that ss = achieved the best scc value mostly across the chromosomes (fig ). however, we observed that the result generated at ss = were also competitive and achieved an equal performance a few times with ss = . this shows us that choosing the ss = does not necessarily reduce the performance of our particlechromo d. there is an additional gain of saving on computational time if this value is used. fig . particlechromo d scc performance on swarm size values , and for mb and kb gm cell hi-c data. (a) comparing the performance by particlechromo d on the mb gm cell hi-c data at swarm size values , , and . (b) comparing the performance by particlechromo d on the kb gm cell hi-c data at swarm size values , , and . the y-axis denotes the scc metric score in the range [- , ], and x-axis denotes the chromosome. a higher scc value is better. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / computational time second, we evaluated the time it took our algorithm to perform the d reconstruction for select chromosomes of the mb and kb gm cell hi-c data set. the modeling of the structures generated by particlechromo d for the synthetic and real dataset was done on an amd ryzen x -core processor, . ghz with installed ram . gb. particlechromo d is programmed to multithread. it utilizes each core present on the user's computer to run a specific task, speeding up the modeling process and significantly reducing computational time. accordingly, the more the number of processors a user has, the faster particlechromo d will generate an output d structures. as mentioned earlier in the parameter estimation section, one of the default settings for particlechromo d is to automatically determine the best conversion factor that fits the data in the range [ . , . ]. even though this is one of our particlechromo d's strengths, this process has the consequence of increasing the algorithm's computational time. based on the real hi-c dataset analysis, our result shows that the swarm size consistently has a lower computational time than the ss = as speculated for the kb and mb hi-c datasets (fig ). these results highlight an additional strength of particlechromo d that it can achieve a competitive result in a lower time (fig ) without trading it off with performance (fig ). it is worth noting that we recommend that users can set the swarm size to the preferred value depending on the objective. in this manuscript, we favored the algorithm achieving a high accuracy over speed. we made up for this by making our algorithm multi-threaded, reducing the running time significantly. fig . particlechromo d computational time at swarm size (ss) and for mb and kb gm cell hi-c data. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / (a) comparing the running time for particlechromo d for select chromosomes for mb gm cell hi-c data (b)a comparison of the running time for particlechromo d for select chromosomes for kb gm cell hi-c data. the y-axis denotes the running time for particlechromo d in minutes, and x-axis denotes the chromosome. availability of data and materials the models generated, all the datasets used for all analysis performed, and the source code for particlechromo d are available at https://github.com/oluwadarelab/particlechromo d. conclusions we developed a new algorithm for d genome reconstruction called particlechromo d. particlechromo d uses the particle swarm optimization algorithm as the foundation of its solution approach for d chromosome reconstruction from hi-c data. the results of particlechromo d on simulated data show that with the best-fine-tuned parameters, it can achieve high accuracy in the presence of noise. we compared particlechromo d accuracy with nine ( ) existing high-performing methods or algorithms for chromosome d structure reconstruction on the real dataset. the results show that particlechromo d is effective and a high performer by achieving more accurate results over the other methods in many chromosomes; and securing the top-two best overall position in our comparative analysis with other algorithms. our experiments also show that particlechromo d can also achieve a faster computational run time without losing accuracy significantly. particlechromo d’s parameters have been optimized to achieve the best result for any input hi-c by searching for the best conversion factor (𝛼) and using the optimal pso hyperparameters for any given input automatically. this algorithm was implemented in python and can be run as an executable or as a jupyter notebook found at https://github.com/oluwadarelab/particlechromo d. acknowledgments not applicable. references . sati s, cavalli g. chromosome conformation capture technologies and their impact in understanding genome function. chromosoma. feb; ( ): - . . de wit e, de laat w. a decade of c technologies: insights into nuclear organization. genes & development. jan ; ( ): - . . dekker j, rippe k, dekker m, kleckner n. capturing chromosome conformation. science. feb ; ( ): - . . han j, zhang z, wang k. c and c-based techniques: the powerful tools for spatial genome organization deciphering. molecular cytogenetics. dec; ( ): - . . simonis, m., klous, p., splinter, e., moshkin, y., willemsen, r., de wit, e., van steensel, b. and de laat, w., . nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture–on-chip ( c). nature genetics, ( ), pp. - . . dostie j, richmond ta, arnaout ra, selzer rr, lee wl, honan ta, rubio ed, krumm a, lamb j, nusbaum c, green rd. chromosome conformation capture carbon copy ( c): a massively parallel solution for mapping interactions between genomic elements. genome research. oct ; ( ): - . . lieberman-aiden e, van berkum nl, williams l, imakaev m, ragoczy t, telling a, amit i, lajoie br, sabo pj, dorschner mo, sandstrom r. comprehensive mapping of long-range interactions reveals folding principles of the human genome. science. oct ; ( ): - . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/oluwadarelab/particlechromo d https://github.com/oluwadarelab/particlechromo d https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . kalhor r, tjong h, jayathilaka n, alber f, chen l. genome architectures revealed by tethered chromosome conformation capture and population-based modeling. nature biotechnology. jan; ( ): - . . li g, fullwood mj, xu h, mulawadi fh, velkov s, vega v, ariyaratne pn, mohamed yb, ooi hs, tennakoon c, wei cl. chia-pet tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. genome biology. feb; ( ): - . . oluwadare o, highsmith m, cheng j. an overview of methods for reconstructing -d chromosome and genome structures from hi-c data. biological procedures online. dec; ( ): - . . pal k, forcato m, ferrari f. hi-c analysis: from data generation to integration. biophysical reviews. feb; ( ): - . . mackay k, kusalik a. computational methods for predicting d genomic organization from high- resolution chromosome conformation capture data. briefings in functional genomics. jul; ( ): - . . cournac a, marie-nelly h, marbouty m, koszul r, mozziconacci j. normalization of a chromosomal contact map. bmc genomics. dec; ( ): - . . servant n, varoquaux n, heard e, barillot e, vert jp. effective normalization for copy number variation in hi-c data. bmc bioinformatics. dec; ( ): - . . imakaev m, fudenberg g, mccord rp, naumova n, goloborodko a, lajoie br, dekker j, mirny la. iterative correction of hi-c data reveals hallmarks of chromosome organization. nature methods. oct; ( ): - . . knight pa, ruiz d. a fast algorithm for matrix balancing. ima journal of numerical analysis. jul ; ( ): - . . yaffe e, tanay a. probabilistic modeling of hi-c contact maps eliminates systematic biases to characterize global chromosomal architecture. nature genetics. nov; ( ): . . imakaev m, fudenberg g, mccord rp, naumova n, goloborodko a, lajoie br, dekker j, mirny la. iterative correction of hi-c data reveals hallmarks of chromosome organization. nature methods. oct; ( ): - . . hu m, deng k, selvaraj s, qin z, ren b, liu js. hicnorm: removing biases in hi-c data via poisson regression. bioinformatics. dec ; ( ): - . . lyu h, liu e, wu z. comparison of normalization methods for hi-c data. biotechniques. feb; ( ): - . . trieu t, oluwadare o, wopata j, cheng j. genomeflow: a comprehensive graphical tool for modeling and analyzing d genome structure. bioinformatics. apr ; ( ): - . . castellano g, le dily f, beato m, roma g. hi-cpipe: a pipeline for high-throughput chromosome capture. . durand nc, shamim ms, machol i, rao ss, huntley mh, lander es, aiden el. juicer provides a one- click system for analyzing loop-resolution hi-c experiments. cell systems. jul ; ( ): - . . servant n, varoquaux n, lajoie br, viara e, chen cj, vert jp, heard e, dekker j, barillot e. hic-pro: an optimized and flexible pipeline for hi-c data processing. genome biology. dec; ( ): - . . wingett s, ewels p, furlan-magaril m, nagano t, schoenfelder s, fraser p, andrews s. hicup: pipeline for mapping and processing hi-c data. f research. ; . . zhang z, li g, toh kc, sung wk. inference of spatial organizations of chromosomes using semi-definite embedding approach and hi-c data. inannual international conference on research in computational molecular biology apr (pp. - ). springer, berlin, heidelberg. . peng c, fu ly, dong pf, deng zl, li jx, wang xt, zhang hy. the sequencing bias relaxed characteristics of hi-c derived data and implications for chromatin d modeling. nucleic acids research. oct ; ( ):e -. . adhikari b, trieu t, cheng j. chromosome d: reconstructing three-dimensional chromosomal structures from hi-c interaction frequency data using distance geometry simulated annealing. bmc genomics. dec; ( ): - . . oluwadare o, zhang y, cheng j. a maximum likelihood algorithm for reconstructing d structures of human chromosomes from chromosomal contact data. bmc genomics. dec; ( ): - . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . lesne a, riposo j, roger p, cournac a, mozziconacci j. d genome reconstruction from chromosomal contacts. nature methods. nov; ( ): . . trieu t, cheng j. d genome structure modeling by lorentzian objective function. nucleic acids research. feb ; ( ): - . . wang s, xu j, zeng j. inferential modeling of d chromatin structure. nucleic acids research. apr ; ( ):e -. . zou c, zhang y, ouyang z. hsa: integrating multi-track hi-c data for genome-scale reconstruction of d chromatin structure. genome biology. dec; ( ): - . . li fz, liu ze, li xy, bu lm, bu hx, liu h, zhang cm. chromatin d structure reconstruction with consideration of adjacency relationship among genomic loci. bmc bioinformatics. dec; ( ): - . . trieu t, cheng j. mogen: a tool for reconstructing d models of genomes from chromosomal conformation capturing data. bioinformatics. may ; ( ): - . . kalhor r, tjong h, jayathilaka n, alber f, chen l. solid-phase chromosome conformation capture for structural characterization of genome architectures. nature biotechnology. ; ( ): . . nowotny j, ahmed s, xu l, oluwadare o, chen h, hensley n, trieu t, cao r, cheng j. iterative reconstruction of three-dimensional models of human chromosomes from chromosomal contact data. bmc bioinformatics. dec; ( ): - . . paulsen j, sekelja m, oldenburg ar, barateau a, briand n, delbarre e, shah a, sørensen al, vigouroux c, buendia b, collas p. chrom d: three-dimensional genome modeling from hi-c and nuclear lamin- genome contacts. genome biology. dec; ( ): - . . zhu g, deng w, hu h, ma r, zhang s, yang j, peng j, kaplan t, zeng j. reconstructing spatial organizations of chromosomes through manifold learning. nucleic acids research. may ; ( ):e - . . rousseau m, fraser j, ferraiuolo ma, dostie j, blanchette m. three-dimensional modeling of chromatin structure from interaction frequency data using markov chain monte carlo sampling. bmc bioinformatics. dec; ( ): - . . varoquaux n, ay f, noble ws, vert jp. a statistical approach for inferring the d structure of the genome. bioinformatics. jun ; ( ):i - . . kennedy j, eberhart r. particle swarm optimization. inproceedings of icnn' -international conference on neural networks nov (vol. , pp. - ). ieee. . garcia-gonzalo e, fernandez-martinez jl. a brief historical review of particle swarm optimization (pso). journal of bioinformatics and intelligent control. jun ; ( ): - . . li mw, hong wc, kang hg. urban traffic flow forecasting using gauss–svr with cat mapping, cloud model and pso hybrid algorithm. neurocomputing. jan ; : - . . wang j, hong x, ren rr, li th. a real-time intrusion detection system based on pso-svm. inproceedings. the international workshop on information security and application (iwisa ) (p. ). academy publisher. . mohamed ma, eltamaly am, alolah ai. pso-based smart grid application for sizing and optimization of hybrid renewable energy systems. plos one. aug ; ( ):e . . zhang y, wang s, ji g. a comprehensive survey on particle swarm optimization algorithm and its applications. mathematical problems in engineering. feb; . . mansour n, kanj f, khachfe h. particle swarm optimization approach for protein structure prediction in the d hp model. interdisciplinary sciences: computational life sciences. sep; ( ): - . . mohapatra r, saha s, dhavala ss. adaswarm: a novel pso optimization method for the mathematical equivalence of error gradients. arxiv preprint arxiv: . . may . . bonyadi mr, michalewicz z. particle swarm optimization for single objective continuous space problems: a review. evolutionary computation. mar; ( ): - . . wang g, guo j, chen y, li y, xu q. a pso and bfo-based learning strategy applied to faster r-cnn for object detection in autonomous driving. ieee access. feb ; : - . . tu c, chuang l, chang j, and yang c, feature selection using pso-svm international journal of computer science. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . mohapatra r, saha s, dhavala ss. adaswarm: a novel pso optimization method for the mathematical equivalence of error gradients. arxiv preprint arxiv: . . may . . duan z, andronescu m, schutz k, mcilwain s, kim yj, lee c, shendure j, fields s, blau ca, noble ws. a three-dimensional model of the yeast genome. nature. may; ( ): - . . rao ss, huntley mh, durand nc, stamenova ek, bochkov id, robinson jt, sanborn al, machol i, omer ad, lander es, aiden el. a d map of the human genome at kilobase resolution reveals principles of chromatin looping. cell. dec ; ( ): - . . oluwadare o, highsmith m, turner d, lieberman-aiden e, cheng j. gsdb: a database of d chromosome and genome structures reconstructed from hi-c data. bmc molecular and cell biology. dec; ( ): - . . zhang y, skolnick j. scoring function for automated assessment of protein structure template quality. proteins: structure, function, and bioinformatics. dec ; ( ): - . . xu j, zhang y. how significant is a protein structure similarity with tm-score= . ?. bioinformatics. apr ; ( ): - . . wilke dn. analysis of the particle swarm optimization algorithm (doctoral dissertation, university of pretoria). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supporting information s fig. a plot of local confidence values . and . versus global confidence. (a) scc by global confidence at local confidence . (b) pcc by global confidence at local confidence . . (c) scc by global confidence at local confidence . (d) pcc by global confidence at local confidence . . each of the plots shows the scc and pcc results obtained by comparing the particlechromo d algorithm's output structure with the simulated dataset's true structure for local confidence values . to . and global confidence values . to . . the y-axis denotes the scc or pcc scores, respectively, as a label in the title, in the range [- , ], the x-axis denotes the global confidence values, and the colored plot denotes the local confidence values. a higher scc and pcc value is better. s fig. a plot of local confidence values . and . versus global confidence. (a) scc by global confidence at local confidence . . (b) pcc by global confidence at local confidence . . (c) scc by global confidence at local confidence . . (d) pcc by global confidence at local confidence . . each of the plots shows the scc and pcc results obtained by comparing the particlechromo d algorithm’s output structure with the simulated dataset’s true structure for local confidence values . to . and global confidence values . to . . the y-axis denotes the scc or pcc scores, respectively, as a label in the title, in the range [- , ], the x-axis denotes the global confidence values, and the colored plot denotes the local confidence values. a higher scc and pcc value is better. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / abstract introduction materials and methods the particle swarm optimization algorithm why pso pso for d structure reconstruction from hi-c data model representation data results metrics used for evaluation pearson correlation coefficient (pcc) spearman correlation coefficient (scc) root mean squared error (rmse) tm-score parameters estimation conversion factor test (𝛂) swarm size threshold confidence coefficient (,𝒄-𝟏. and ,𝒄-𝟐.) random numbers (,𝑹-𝟏. and ,𝑹-𝟐.) assessment on simulated data assessment on real hi-c data model consistency comparison with existing chromosome d structure reconstruction methods discussion particlechromo d performance on different swarm size values computational time availability of data and materials conclusions acknowledgments references supporting information a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing jacob househam, barts cancer institute, queen mary university of london, uk william ch cross , ucl cancer institute, university college london, uk (★) giulio caravagna , department of mathematics and geosciences, university of trieste, italy (★) joint last authors. (★) corresponding: (gc) gcaravagna@units.it. abstract. cancer is a global health issue that places enormous demands on healthcare systems. basic research, the development of targeted treatments, and the utility of dna sequencing in clinical settings, have been significantly improved with the introduction of whole genome sequencing. however the broad applications of this technology come with complications. to date there has been very little standardisation in how data quality is assessed, leading to inconsistencies in analyses and disparate conclusions. manual checking and complex consensus calling strategies often do not scale to large sample numbers, which leads to procedural bottlenecks. to address this issue, we present a quality control method that integrates point mutations, copy numbers, and other metrics into a single quantitative score. we demonstrate its power on , whole-genomes from a large-scale pan-cancer cohort, and on multi-region data of two colorectal cancer patients. we highlight how our approach significantly improves the generation of cancer mutation data, providing visualisations for cross-referencing with other analyses. our approach is fully automated, designed to work downstream of any bioinformatic pipeline, and can automatise tool parameterization paving the way for fast computational assessment of data quality in the era of whole genome sequencing. introduction cancer remains an unsolved problem, and a key factor is that tumours develop as heterogeneous cellular populations (greaves and maley ; mcgranahan and swanton , ). cancer genomes can harbour multiple types of mutations compared to healthy cells (macintyre et al. ; martincorena et al. , ; nik-zainal et al. ), and many of these events contribute to the pathogenesis of the disease, and therapeutic resistance. a popular design of studies intending to .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:gcaravagna@units.it https://paperpile.com/c/rqvmzs/pf t+ lh +zohm https://paperpile.com/c/rqvmzs/pf t+ lh +zohm https://paperpile.com/c/rqvmzs/p yv+ug x+ mqr+bhgv https://paperpile.com/c/rqvmzs/p yv+ug x+ mqr+bhgv https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. understand tumour development involves collecting tumour and matched-normal biopsies, and generating so-called “bulk” dna sequencing data for both (barnell et al. ). using bioinformatic tools to cross reference the normal genome against the aberrant one, the mutations and heterogeneity thereof found in the tumour sample can be derived and used in other analyses. these analyses include, but are not limited to, driver mutation identification (bailey et al. ; gonzalez-perez et al. ), which aims to discern the key aberrations that cause a tumour to grow, patient clustering, which aims to identify treatment groups with similar biological characteristics, and evolutionary inference (gerstung et al. ; nik-zainal et al. ; caravagna et al. ), which informs us how a particular tumour developed from normal cells. there are several types of mutations that we can retrieve from dna sequencing data (campbell et al. ). broadly these can be categorized as single nucleotide variants (snvs), copy number alterations (cnas) and other more complex changes such as structural variants (li et al. ). all types of mutations can drive tumour progression, and are therefore important entities to study (kent and green - ; levine, jenkins, and copeland ). luckily, the steady drop in sequencing costs is fueling the creation of large amounts of data, which are becoming increasingly available for researchers to access through public databases. notably, we are entering the era of high-resolution whole-genome sequencing (wgs), a technology that can read out the majority of a tumour genome, providing major improvements over whole-exome counterparts. generating some of these data, however, poses challenges. while snvs are the simplest type of mutations to detect using bioinformatic analysis and perhaps have the most well established supporting tools (li et al. ), cnas are particularly difficult to call since the baseline ploidy of the tumour (i.e., the number of chromosome copies) is usually unknown and has to be inferred from the data. cnas are important types of cancer mutations; large-scale gain and loss of chromosome arms or sections of arms can confer tumour cells with large-scale phenotypic changes, and are often important clinical targets (gerstung et al. ; watkins et al. ). snvs and cnas are intertwined mutation groups. they can overlap within a tumour cell’s genome, meaning the number of copies of an snv can be amplified or indeed reduced by cnas. this depends on the ploidy of the genome regions overlapping with the variants. for instance, for a clonal - meaning present in every cell of the tumour sample - heterozygous snv in a diploid tumour genome the expected variant allele frequency (vaf) is % (i.e., half of the reads from tumour cells will harbour the snv). alternatively, if each chromosome is present in three copies (triploid), the expected vaf is % - if the snv occurred after the amplification - or % - if the snv is on the amplified chromosome and occurred before the amplification. the theoretical .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/rqvmzs/j j https://paperpile.com/c/rqvmzs/j j https://paperpile.com/c/rqvmzs/ueke+glz https://paperpile.com/c/rqvmzs/vqgd+bhgv+chqb https://paperpile.com/c/rqvmzs/vqgd+bhgv+chqb https://paperpile.com/c/rqvmzs/cxxa https://paperpile.com/c/rqvmzs/tmou https://paperpile.com/c/rqvmzs/df v+sxxl https://paperpile.com/c/rqvmzs/df v+sxxl https://paperpile.com/c/rqvmzs/tmou https://paperpile.com/c/rqvmzs/vqgd+ncpj https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. frequencies are observed with a binomial noise model that depends on the depth of sequencing and the actual vaf (nik-zainal et al. ; caravagna et al. ). we note that these vafs hold for pure bulk tumour samples ( % tumour cells). realistically, most bulk samples contain normal cells, the percentage of which shifts these theoretical frequencies towards lower values. these ideas are leveraged by methods that seek to compute the cancer cell fractions (ccfs) of the tumour, i.e., a normalisation of the observed tumour vaf for the cna, the number of copies of a mutation (mutation multiplicity) and tumour purity (nik-zainal et al. ). many bioinformatics pipelines are designed to start from a bam formatted input file and, following variant calling, extract the vaf of mutations while calling cnas in parallel (boeva et al. ; cmero et al. ; zaccaria and raphael ; van loo et al. ). these analyses are nearly always decoupled, and can return inconsistent variant calls; i.e., cnas and purity that mismatch the empirical vaf from the bams. since cnas and purity are inferred through various measurements that are subject to noise - i.e., mutation allele ratios, tumour-normal depth ratios and b-allele frequencies are prime examples - they are the most likely cause of error. while in some cases these errors can be spotted and fixed by manual intervention, this process is also subject to inconsistencies in the absence of a proper statistical framework, and does not scale in studies seeking to generate datasets with millions of data points (campbell et al. ; priestley et al. ; turnbull et al. ). the intrinsic performance of a variant caller and sequencing noise therefore massively impacts cna calling and purity inferences, propagating errors in downstream analysis that eventually lead to incorrect biological conclusions, becoming a crucial computational bottleneck in the era of high-resolution whole-genome sequencing. to solve these problems we developed cnaqc ( data availability), a computational framework with a de novo statistical model to assess the conformance of expected snvs, cnas, and purity estimates. we strived to make the tool as simple to implement as possible, maximising compatibility across differing pipelines. cnaqc computes a quantitative quality check (qc) score for the overall agreement of the calls, which can be used to tune the parameters of callers (e.g., decrease purity or increase ploidy), or select among multiple cna profiles (e.g., tetraploid versus diploid tumours) until a fit is achieved. in cnaqc we also integrate these measures to determine ccf values (dentro, wedge, and van loo ). cnaqc is implemented as a highly optimised r package that can be used downstream of any cancer mutation calling pipeline. it can be run on wgs data, and can automatically compute a qc score in a matter of seconds, which is an extremely useful .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/rqvmzs/bhgv+chqb https://paperpile.com/c/rqvmzs/bhgv https://paperpile.com/c/rqvmzs/ix r+ydma+rmmc+yagn https://paperpile.com/c/rqvmzs/ix r+ydma+rmmc+yagn https://paperpile.com/c/rqvmzs/cxxa+ up+mwfz https://paperpile.com/c/rqvmzs/cxxa+ up+mwfz https://paperpile.com/c/rqvmzs/uxwc https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. feature for large-scale genomics consortia that analyse many samples per day. to demonstrate the tool we analysed bulk wgs datasets from two multi-region colorectal cancers, and analysed high-quality whole-genomes from the pan cancer analysis of whole genomes (pcawg) cohort (campbell et al. ). results the cnaqc framework cnaqc can perform different types of operations on cnas and somatic mutation calls obtained from bulk wgs. in what follows, we will refer explicitly to snvs as the main type of mutation used, but in principle other types of substitutions such as insertions or deletions also apply. the package supports the most common cna copy types found in cancers: heterozygous normal states ( : chromosome complement), loss of heterozygosity (loh) in monosomy ( : ) and copy-neutral ( : ) form, trisomy ( : ) or tetrasomy ( : ) gains. the tool also works with exome data, but the reduced mutational burden can, in general, lower the reliability of the qc score (supplementary figure s ). many metrics output by cnaqc are derived from the link between copy-state profiles (i.e., the copies of the major and minor alleles, which sum up to the ploidy of a segment) and allele frequencies that are explicit from read counts. combinatorial equations and frequency spectrum analysis can quantitatively determine if cnas and purity are consistent with the vaf distribution ( online methods ). this score also suggests “corrections” to automatically fine-tune and repeat cna calling runs. this works for tools that use either bayesian priors or point estimates of the parameters. the key equations for a somatic mutation link its vaf and ccf , to sample purity , tumour ploidy , and , the number of copies of a mutation ( figure a ). effectively, for complex : , : and : copy states, phases mutations that were acquired before or after the copy number event ( figure b ). we remark that we observe , and infer , and , finally deriving , which is difficult to estimate ( figure c ). in cnaqc we use the following formula for vaf (figure d) and ccf .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/rqvmzs/cxxa https://www.codecogs.com/eqnedit.php?latex=v# https://www.codecogs.com/eqnedit.php?latex=c# https://www.codecogs.com/eqnedit.php?latex=% cpi# https://www.codecogs.com/eqnedit.php?latex=p# https://www.codecogs.com/eqnedit.php?latex=m% cin% c% b % c % c% d# https://www.codecogs.com/eqnedit.php?latex=m# https://www.codecogs.com/eqnedit.php?latex=v# https://www.codecogs.com/eqnedit.php?latex=% cpi# https://www.codecogs.com/eqnedit.php?latex=p# https://www.codecogs.com/eqnedit.php?latex=m# https://www.codecogs.com/eqnedit.php?latex=c# https://www.codecogs.com/eqnedit.php?latex=v% % d% % cdfrac% b% cpi% d% b ( -% cpi)% % b% % cpi% p% d# https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. these formulas lead to other interesting quantities ( online methods ). for instance, if we know tumour purity and the ploidy of a cna segment, then the vaf mutations mapped to the segment must peak at a known location . the value for follows from x x combinatorial arguments relating all other variables (nik-zainal et al., ). from a qc perspective, the euclidean distance between the theoretical expectation and the x peaks observed from data is an error score that approaches for perfect calls, and grows otherwise. cnaqc can visualise the input segments ( figure a ) and read counts ( figure b-d ). other analysis such as ccfs computation and genome fragmentation analysis are also available, and have other visualisations (figure e). the scores of cnaqc can be used to determine a qc pass or fail status for every copy state within a tumour genome, weighting different evidence from the data. one score is for the quality of cna segmentation and tumour purity, and one for ccf values. the former is based on a density-based analysis of the vaf distribution, and uses both a non-parametric kernel density and a univariate binomial mixture to match peaks in the vaf data ( figure a-d ). the latter is based on the entropy of the latent variables in a binomial mixture model, whose components are peaked at the expected vaf. from this density we identify vaf ranges for which it is hard to estimate the mutation multiplicity, and therefore the ccf of the mutation ( figure e-h ). to the best of our understanding, this is the only framework providing quantitative metrics for all the most widespread types of tumour mutations. multi-region colorectal cancer data we have run cnaqc on previously published wgs multi-region data (cross et al. ; caravagna et al. ), which was collected from multiple regions of primary colorectal adenocarcinomas across two distinct patients. for all these samples we have high quality somatic mutation calls (cross et al. ) that were obtained using clonehd (fischer et al. ). we have re-called cnas with the sequenza cna caller (favero et al. ), and sought out to check the inferred copy states and tumour purity with cnaqc, along with snvs generated by mutect (benjamin et al. ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=c% % d% % cdfrac% bv% b(p- )% cpi% % b% % d% d% bm% cpi% d% % c% c% .# https://www.zotero.org/google-docs/?yap dc https://paperpile.com/c/rqvmzs/ic y+chqb https://paperpile.com/c/rqvmzs/ic y+chqb https://paperpile.com/c/rqvmzs/ic y https://paperpile.com/c/rqvmzs/a vg https://paperpile.com/c/rqvmzs/tcb https://paperpile.com/c/rqvmzs/bd o https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. sequenza was run using distinct parameterizations. we begun with the default range proposals for purity and ploidy , which we then improved in a final run following cnaqc analysis. we also forced a sequenza fit with constrained tetraploid genome (ploidy equal ), and one with low purity. all these steps could have been easily automatised in a procedure that runs the caller, obtains score metrics for the solution from cnaqc, and re-run the fits with adjusted parameters if required. the results for one sample of patient set - cancer in the original manuscript (cross et al. ) - are in figure ; the other samples for patient set are in supplementary figures s -s . all samples for patient set are in supplementary figures s -s . the peak detection scores produced by cnaqc invariably fail both the tetraploid and low-purity solutions, passing the others; the little adjustment suggested to the default parameters slightly improves the purity, but the overall quality is high even with just default parameters ( figure b ). the whole-genome cna profile for this sample shows some degree of aneuploidy ( figure c ), and it is easy with cnaqc to assess miscalled cna segments ahead of the vaf data ( figure d ). the analysis of all the samples available for set shows an overall cna profile with many diploid regions and mild aneuploidy ( figure e ), consistent with a microsatellite stable colorectal cancer (cross et al. ). large-scale pan cancer pcawg calls we have run cnaqc on a subset of the full pcawg cohort, which contains thousands of samples from multiple tumour types (campbell et al. ). the median coverage of this cohort is x, with purity ~ % (caravagna et al. ); a much lower resolution than the data available for the multi-region samples discussed in the previous section. because of this, peak detection from the vaf distribution across some of the samples would be challenged by signal quality; in practice, for genomes with complex aneuploidy and massive drops in purity and coverage the vaf distribution is unsuitable for peak-detection, leading to false-positives in the qc process. to avoid this and work with suitable samples, we identified cases adopting the following conditions: (i) the n = tumour type contains > samples, (ii) the tumour genome used for qc contains > % of the overall snvs in the tumour - so a substantial part of the overall mutational burden - and (iii) the purity of the sample is > % - so the signal is suitable for peak detection. on a standard cluster cnaqc ran in less than hour for these samples; notably the technically the default sequenza values for ploidy reach maximum at ; being unrealistic for our cases we limited the maximum ploidy to be . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/rqvmzs/ic y https://paperpile.com/c/rqvmzs/ic y https://paperpile.com/c/rqvmzs/ic y https://paperpile.com/c/rqvmzs/cxxa https://paperpile.com/c/rqvmzs/chqb https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. completion time (per sample) on a laptop is less than minute, meaning that preliminary analysis can be carried out very quickly and without large computing infrastructures. the calls in pcawg were obtained by consensus with multiple bioinformatics tools, and for this reason we expected them to be reliable. manual inspections of some patient data showed indeed many high-quality calls, but also highlighted a variety of interesting cases. for instance, tumours with extremely low mutational burden but high quality calls still yielded a useful report, suggesting that cnaqc can work also with mutational burden from whole-exome sequencing ( supplementary figure s ). for other tumours, we found high purity levels > %, which are probably overestimated ( supplementary figure s ) compared to others where purity is genuinely very high ( supplementary figure s ). overall, the scores from peak detection are reliable for the majority of the analysed samples ( figure a ) - the diploid % purity tumour in figures and is taken from this list - with only a few cases requiring further checks ( figure b ). the peak detection by cnaqc therefore confirms the calls reliability in terms of breakpoints, segments ploidy and tumour purity. ccf computations showed a higher rate of failures with cnaqc analysis ( figure a ). this is inevitably due to the lack of signal separability stemming from low coverage of these samples, even for high-quality genomes. therefore while peaks could be determined for these data, mutation multiplicity assessment would have required higher coverage than what was found available. in summary, from these analyses we revealed that the problem of validating cna calls, compared to determining ccf estimates, can be approached with lower coverage and purity values using cnaqc. discussion wgs is a powerful approach to detect extensive mutations that drive human cancers. many large-scale initiatives such as pcawg (campbell et al. ), the hartwig medical foundation (priestley et al. ) and genomics england (turnbull et al. ) have already generated wgs data for thousands of cancer patients, with many cancer institutes converging towards these efforts. calling mutations from wgs data requires complex bioinformatics pipelines (barnell et al. ; cmero et al. ; li et al. ) and any downstream analysis relies upon these calls, putting the quality of the generated data under the spotlight. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/rqvmzs/cxxa https://paperpile.com/c/rqvmzs/ up https://paperpile.com/c/rqvmzs/mwfz https://paperpile.com/c/rqvmzs/j j +ydma+tmou https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. cnaqc offers the first principle framework to control the quality of tumour mutation calls. the tool can analyse snvs and more general types of nucleotide substitutions; snvs are more reliable and depend less on alignment quality than other mutations, and therefore should be checked first. cnaqc uses a peak-detection analysis to validate cna segments and purity, exploiting a combinatorial model for cancer alleles. within the same framework, cnaqc also computes ccf values, highlighting mutations for which such values are uncertain. cnaqc features can be used to clean up data, automatising parameter choice for virtually any caller, prioritizing good calls and selecting information for downstream analyses. the cnaqc framework leverages the relationship between tumour vaf and ploidy. the quality of the control process itself depends on the ability to process the vaf spectrum and detect peaks. therefore, if the vaf quality is very low because, e.g., the sample has low purity or coverage, the overall quality of the check decreases, making it more difficult to completely automate quality checking. however, for the large majority of samples, cnaqc provides a very effective and fast way to integrate quality metrics in standard pipelines. generating high quality calls is just a prelude to more complex analyses that interpret cancer genotypes and their history, with and without therapy (ding et al. ; landau et al. ; caravagna et al. , ; jamal-hanjani et al. ; turajlic et al. ; caravagna et al. ). cnaqc can pass a sample at an early stage, leaving the possibility of assessing, at a later stage, whether the quality of the data is high enough to approach specific research questions. with the ongoing implementation of large-scale sequencing efforts, cnaqc provides a good solution for modular pipelines that self-tune parameters, based on quality scores. to our knowledge, this is the first stand-alone tool which leverages the power of combining the most common types of cancer mutations - snvs and cnas - to automatically control the quality of wgs assays. we believe cnaqc can help reduce the burden of manual quality checking and parameter tuning. references bailey, matthew h., collin tokheim, eduard porta-pardo, sohini sengupta, denis bertrand, amila weerasinghe, antonio colaprico, et al. . “comprehensive characterization of cancer driver genes and mutations.” cell ( ): – .e . https://doi.org/ . /j.cell. . . . barnell, erica k., peter ronning, katie m. campbell, kilannin krysiak, benjamin j. ainscough, lana m. sheta, shahil p. pema, et al. . “standard operating procedure for somatic variant refinement of sequencing data with paired tumor and normal samples.” genetics .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/rqvmzs/wpg +tqet+rl f+cimd+ji a+er s https://paperpile.com/c/rqvmzs/wpg +tqet+rl f+cimd+ji a+er s https://paperpile.com/c/rqvmzs/wpg +tqet+rl f+cimd+ji a+er s http://paperpile.com/b/rqvmzs/ueke http://paperpile.com/b/rqvmzs/ueke http://paperpile.com/b/rqvmzs/ueke http://paperpile.com/b/rqvmzs/ueke http://paperpile.com/b/rqvmzs/ueke http://paperpile.com/b/rqvmzs/ueke http://dx.doi.org/ . /j.cell. . . http://paperpile.com/b/rqvmzs/ueke http://paperpile.com/b/rqvmzs/j j http://paperpile.com/b/rqvmzs/j j http://paperpile.com/b/rqvmzs/j j http://paperpile.com/b/rqvmzs/j j https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. in medicine: official journal of the american college of medical genetics ( ): – . https://doi.org/ . /s - - -z. benjamin, david, takuto sato, kristian cibulskis, gad getz, chip stewart, and lee lichtenstein. . “calling somatic snvs and indels with mutect .” biorxiv, december, . https://doi.org/ . / . boeva, valentina, andrei zinovyev, kevin bleakley, jean-philippe vert, isabelle janoueix-lerosey, olivier delattre, and emmanuel barillot. . “control-free calling of copy number alterations in deep-sequencing data using gc-content normalization.” bioinformatics ( ): – . https://doi.org/ . /bioinformatics/btq . campbell, peter j., gad getz, jan o. korbel, joshua m. stuart, jennifer l. jennings, lincoln d. stein, marc d. perry, et al. . “pan-cancer analysis of whole genomes.” nature ( ): – . https://doi.org/ . /s - - - . caravagna, giulio, ylenia giarratano, daniele ramazzotti, ian tomlinson, trevor a. graham, guido sanguinetti, and andrea sottoriva. . “detecting repeated cancer evolution from multi-region tumor sequencing data.” nature methods ( ): – . https://doi.org/ . /s - - -x. caravagna, giulio, alex graudenzi, daniele ramazzotti, rebeca sanz-pamplona, luca de sano, giancarlo mauri, victor moreno, marco antoniotti, and bud mishra. , . “algorithmic methods to infer the evolutionary trajectories in cancer progression.” proceedings of the national academy of sciences of the united states of america ( ): e – . https://doi.org/ . /pnas. . caravagna, giulio, timon heide, marc j. williams, luis zapata, daniel nichol, ketevan chkhaidze, william cross, et al. . “subclonal reconstruction of tumors by using machine learning and population genetics.” nature genetics ( ): – . https://doi.org/ . /s - - - . cmero, marek, ke yuan, cheng soon ong, jan schröder, niall m. corcoran, tony papenfuss, christopher m. hovens, florian markowetz, and geoff macintyre. . “inferring structural variant cancer cell fraction.” nature communications ( ): . https://doi.org/ . /s - - - . cortés-ciriano, isidro, jake june-koo lee, ruibin xi, dhawal jain, youngsook l. jung, lixing yang, dmitry gordenin, et al. . “comprehensive analysis of chromothripsis in , human cancers using whole-genome sequencing.” nature genetics ( ): – . https://doi.org/ . /s - - - . cross, william, michal kovac, ville mustonen, daniel temko, hayley davis, ann-marie baker, sujata biswas, et al. . “the evolutionary landscape of colorectal tumorigenesis.” nature ecology & evolution ( ): – . https://doi.org/ . /s - - -z. dentro, stefan c., david c. wedge, and peter van loo. . “principles of reconstructing the subclonal architecture of cancers.” cold spring harbor perspectives in medicine ( ). https://doi.org/ . /cshperspect.a . ding, li, timothy j. ley, david e. larson, christopher a. miller, daniel c. koboldt, john s. welch, julie k. ritchey, et al. . “clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing.” nature ( ): – . https://doi.org/ . /nature . favero, f., t. joshi, a. m. marquard, n. j. birkbak, m. krzystanek, q. li, z. szallasi, and a. c. eklund. . “sequenza: allele-specific copy number and mutation profiles from tumor sequencing data.” annals of oncology: official journal of the european society for medical oncology / esmo ( ): – . https://doi.org/ . /annonc/mdu . fischer, andrej, ignacio vázquez-garcía, christopher j. r. illingworth, and ville mustonen. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/rqvmzs/j j http://paperpile.com/b/rqvmzs/j j http://paperpile.com/b/rqvmzs/j j http://dx.doi.org/ . /s - - -z http://paperpile.com/b/rqvmzs/j j http://paperpile.com/b/rqvmzs/bd o http://paperpile.com/b/rqvmzs/bd o http://paperpile.com/b/rqvmzs/bd o http://paperpile.com/b/rqvmzs/bd o http://paperpile.com/b/rqvmzs/bd o http://dx.doi.org/ . / http://paperpile.com/b/rqvmzs/bd o http://paperpile.com/b/rqvmzs/ix r http://paperpile.com/b/rqvmzs/ix r http://paperpile.com/b/rqvmzs/ix r http://paperpile.com/b/rqvmzs/ix r http://paperpile.com/b/rqvmzs/ix r http://dx.doi.org/ . /bioinformatics/btq http://paperpile.com/b/rqvmzs/ix r http://paperpile.com/b/rqvmzs/cxxa http://paperpile.com/b/rqvmzs/cxxa http://paperpile.com/b/rqvmzs/cxxa http://paperpile.com/b/rqvmzs/cxxa http://paperpile.com/b/rqvmzs/cxxa http://dx.doi.org/ . /s - - - http://paperpile.com/b/rqvmzs/cxxa http://paperpile.com/b/rqvmzs/er s http://paperpile.com/b/rqvmzs/er s http://paperpile.com/b/rqvmzs/er s http://paperpile.com/b/rqvmzs/er s http://paperpile.com/b/rqvmzs/er s http://paperpile.com/b/rqvmzs/er s http://dx.doi.org/ . /s - - -x http://paperpile.com/b/rqvmzs/er s http://paperpile.com/b/rqvmzs/rl f http://paperpile.com/b/rqvmzs/rl f http://paperpile.com/b/rqvmzs/rl f http://paperpile.com/b/rqvmzs/rl f http://paperpile.com/b/rqvmzs/rl f http://paperpile.com/b/rqvmzs/rl f http://dx.doi.org/ . /pnas. http://paperpile.com/b/rqvmzs/rl f http://paperpile.com/b/rqvmzs/chqb http://paperpile.com/b/rqvmzs/chqb http://paperpile.com/b/rqvmzs/chqb http://paperpile.com/b/rqvmzs/chqb http://paperpile.com/b/rqvmzs/chqb http://paperpile.com/b/rqvmzs/chqb http://dx.doi.org/ . /s - - - http://paperpile.com/b/rqvmzs/chqb http://paperpile.com/b/rqvmzs/ydma http://paperpile.com/b/rqvmzs/ydma http://paperpile.com/b/rqvmzs/ydma http://paperpile.com/b/rqvmzs/ydma http://paperpile.com/b/rqvmzs/ydma http://paperpile.com/b/rqvmzs/ydma http://dx.doi.org/ . /s - - - http://paperpile.com/b/rqvmzs/ydma http://paperpile.com/b/rqvmzs/fjzp http://paperpile.com/b/rqvmzs/fjzp http://paperpile.com/b/rqvmzs/fjzp http://paperpile.com/b/rqvmzs/fjzp http://paperpile.com/b/rqvmzs/fjzp http://paperpile.com/b/rqvmzs/fjzp http://dx.doi.org/ . /s - - - http://paperpile.com/b/rqvmzs/fjzp http://paperpile.com/b/rqvmzs/ic y http://paperpile.com/b/rqvmzs/ic y http://paperpile.com/b/rqvmzs/ic y http://paperpile.com/b/rqvmzs/ic y http://dx.doi.org/ . /s - - -z http://paperpile.com/b/rqvmzs/ic y http://paperpile.com/b/rqvmzs/uxwc http://paperpile.com/b/rqvmzs/uxwc http://paperpile.com/b/rqvmzs/uxwc http://paperpile.com/b/rqvmzs/uxwc http://paperpile.com/b/rqvmzs/uxwc http://dx.doi.org/ . /cshperspect.a http://paperpile.com/b/rqvmzs/uxwc http://paperpile.com/b/rqvmzs/wpg http://paperpile.com/b/rqvmzs/wpg http://paperpile.com/b/rqvmzs/wpg http://paperpile.com/b/rqvmzs/wpg http://paperpile.com/b/rqvmzs/wpg http://paperpile.com/b/rqvmzs/wpg http://dx.doi.org/ . /nature http://paperpile.com/b/rqvmzs/wpg http://paperpile.com/b/rqvmzs/tcb http://paperpile.com/b/rqvmzs/tcb http://paperpile.com/b/rqvmzs/tcb http://paperpile.com/b/rqvmzs/tcb http://paperpile.com/b/rqvmzs/tcb http://paperpile.com/b/rqvmzs/tcb http://dx.doi.org/ . /annonc/mdu http://paperpile.com/b/rqvmzs/tcb http://paperpile.com/b/rqvmzs/a vg https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. . “high-definition reconstruction of clonal composition in cancer.” cell reports ( ): – . https://doi.org/ . /j.celrep. . . . gerstung, moritz, clemency jolly, ignaty leshchiner, stefan c. dentro, santiago gonzalez, daniel rosebrock, thomas j. mitchell, et al. . “the evolutionary history of , cancers.” nature ( ): – . https://doi.org/ . /s - - - . gonzalez-perez, abel, christian perez-llamas, jordi deu-pons, david tamborero, michael p. schroeder, alba jene-sanz, alberto santos, and nuria lopez-bigas. . “intogen-mutations identifies cancer drivers across tumor types.” nature methods ( ): – . https://doi.org/ . /nmeth. . greaves, mel, and carlo c. maley. . “clonal evolution in cancer.” nature ( ): – . https://doi.org/ . /nature . jamal-hanjani, mariam, gareth a. wilson, nicholas mcgranahan, nicolai j. birkbak, thomas b. k. watkins, selvaraju veeriah, seema shafi, et al. . “tracking the evolution of non-small-cell lung cancer.” the new england journal of medicine ( ): – . https://doi.org/ . /nejmoa . kent, david g., and anthony r. green. - . “order matters: the order of somatic mutations influences cancer evolution.” cold spring harbor perspectives in medicine ( ). https://doi.org/ . /cshperspect.a . landau, dan a., scott l. carter, petar stojanov, aaron mckenna, kristen stevenson, michael s. lawrence, carrie sougnez, et al. . “evolution and impact of subclonal mutations in chronic lymphocytic leukemia.” cell ( ): – . https://doi.org/ . /j.cell. . . . levine, arnold j., nancy a. jenkins, and neal g. copeland. . “the roles of initiating truncal mutations in human cancers: the order of mutations and tumor cell type matters.” cancer cell ( ): – . https://doi.org/ . /j.ccell. . . . li, yilong, nicola d. roberts, jeremiah a. wala, ofer shapira, steven e. schumacher, kiran kumar, ekta khurana, et al. . “patterns of somatic structural variation in human cancer genomes.” nature ( ): – . https://doi.org/ . /s - - - . macintyre, geoff, teodora e. goranova, dilrini de silva, darren ennis, anna m. piskorz, matthew eldridge, daoud sie, et al. . “copy number signatures and mutational processes in ovarian carcinoma.” nature genetics ( ): – . https://doi.org/ . /s - - - . martincorena, iñigo, joanna c. fowler, agnieszka wabik, andrew r. j. lawson, federico abascal, michael w. j. hall, alex cagan, et al. . “somatic mutant clones colonize the human esophagus with age.” science ( ): – . https://doi.org/ . /science.aau . martincorena, iñigo, amit roshan, moritz gerstung, peter ellis, peter van loo, stuart mclaren, david c. wedge, et al. . “high burden and pervasive positive selection of somatic mutations in normal human skin.” science ( ): – . https://doi.org/ . /science.aaa . mcgranahan, nicholas, and charles swanton. . “biological and therapeutic impact of intratumor heterogeneity in cancer evolution.” cancer cell ( ): – . https://doi.org/ . /j.ccell. . . . ———. . “clonal heterogeneity and tumor evolution: past, present, and the future.” cell ( ): – . https://doi.org/ . /j.cell. . . . nik-zainal, serena, peter van loo, david c. wedge, ludmil b. alexandrov, christopher d. greenman, king wai lau, keiran raine, et al. . “the life history of breast cancers.” cell ( ): – . https://doi.org/ . /j.cell. . . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/rqvmzs/a vg http://paperpile.com/b/rqvmzs/a vg http://paperpile.com/b/rqvmzs/a vg http://paperpile.com/b/rqvmzs/a vg http://dx.doi.org/ . /j.celrep. . . http://paperpile.com/b/rqvmzs/a vg http://paperpile.com/b/rqvmzs/vqgd http://paperpile.com/b/rqvmzs/vqgd http://paperpile.com/b/rqvmzs/vqgd http://paperpile.com/b/rqvmzs/vqgd http://paperpile.com/b/rqvmzs/vqgd http://dx.doi.org/ . /s - - - http://paperpile.com/b/rqvmzs/vqgd http://paperpile.com/b/rqvmzs/glz http://paperpile.com/b/rqvmzs/glz http://paperpile.com/b/rqvmzs/glz http://paperpile.com/b/rqvmzs/glz http://paperpile.com/b/rqvmzs/glz http://paperpile.com/b/rqvmzs/glz http://dx.doi.org/ . /nmeth. http://paperpile.com/b/rqvmzs/glz http://paperpile.com/b/rqvmzs/pf t http://paperpile.com/b/rqvmzs/pf t http://paperpile.com/b/rqvmzs/pf t http://paperpile.com/b/rqvmzs/pf t http://dx.doi.org/ . /nature http://paperpile.com/b/rqvmzs/pf t http://paperpile.com/b/rqvmzs/cimd http://paperpile.com/b/rqvmzs/cimd http://paperpile.com/b/rqvmzs/cimd http://paperpile.com/b/rqvmzs/cimd http://paperpile.com/b/rqvmzs/cimd http://paperpile.com/b/rqvmzs/cimd http://dx.doi.org/ . /nejmoa http://paperpile.com/b/rqvmzs/cimd http://paperpile.com/b/rqvmzs/df v http://paperpile.com/b/rqvmzs/df v http://paperpile.com/b/rqvmzs/df v http://paperpile.com/b/rqvmzs/df v http://paperpile.com/b/rqvmzs/df v http://dx.doi.org/ . /cshperspect.a http://paperpile.com/b/rqvmzs/df v http://paperpile.com/b/rqvmzs/tqet http://paperpile.com/b/rqvmzs/tqet http://paperpile.com/b/rqvmzs/tqet http://paperpile.com/b/rqvmzs/tqet http://paperpile.com/b/rqvmzs/tqet http://paperpile.com/b/rqvmzs/tqet http://dx.doi.org/ . /j.cell. . . http://paperpile.com/b/rqvmzs/tqet http://paperpile.com/b/rqvmzs/sxxl http://paperpile.com/b/rqvmzs/sxxl http://paperpile.com/b/rqvmzs/sxxl http://paperpile.com/b/rqvmzs/sxxl http://paperpile.com/b/rqvmzs/sxxl http://dx.doi.org/ . /j.ccell. . . http://paperpile.com/b/rqvmzs/sxxl http://paperpile.com/b/rqvmzs/tmou http://paperpile.com/b/rqvmzs/tmou http://paperpile.com/b/rqvmzs/tmou http://paperpile.com/b/rqvmzs/tmou http://paperpile.com/b/rqvmzs/tmou http://dx.doi.org/ . /s - - - http://paperpile.com/b/rqvmzs/tmou http://paperpile.com/b/rqvmzs/p yv http://paperpile.com/b/rqvmzs/p yv http://paperpile.com/b/rqvmzs/p yv http://paperpile.com/b/rqvmzs/p yv http://paperpile.com/b/rqvmzs/p yv http://paperpile.com/b/rqvmzs/p yv http://dx.doi.org/ . /s - - - http://paperpile.com/b/rqvmzs/p yv http://paperpile.com/b/rqvmzs/ug x http://paperpile.com/b/rqvmzs/ug x http://paperpile.com/b/rqvmzs/ug x http://paperpile.com/b/rqvmzs/ug x http://paperpile.com/b/rqvmzs/ug x http://paperpile.com/b/rqvmzs/ug x http://dx.doi.org/ . /science.aau http://paperpile.com/b/rqvmzs/ug x http://paperpile.com/b/rqvmzs/ mqr http://paperpile.com/b/rqvmzs/ mqr http://paperpile.com/b/rqvmzs/ mqr http://paperpile.com/b/rqvmzs/ mqr http://paperpile.com/b/rqvmzs/ mqr http://paperpile.com/b/rqvmzs/ mqr http://dx.doi.org/ . /science.aaa http://paperpile.com/b/rqvmzs/ mqr http://paperpile.com/b/rqvmzs/zohm http://paperpile.com/b/rqvmzs/zohm http://paperpile.com/b/rqvmzs/zohm http://paperpile.com/b/rqvmzs/zohm http://paperpile.com/b/rqvmzs/zohm http://dx.doi.org/ . /j.ccell. . . http://paperpile.com/b/rqvmzs/zohm http://paperpile.com/b/rqvmzs/ lh http://paperpile.com/b/rqvmzs/ lh http://paperpile.com/b/rqvmzs/ lh http://paperpile.com/b/rqvmzs/ lh http://dx.doi.org/ . /j.cell. . . http://paperpile.com/b/rqvmzs/ lh http://paperpile.com/b/rqvmzs/bhgv http://paperpile.com/b/rqvmzs/bhgv http://paperpile.com/b/rqvmzs/bhgv http://paperpile.com/b/rqvmzs/bhgv http://paperpile.com/b/rqvmzs/bhgv http://dx.doi.org/ . /j.cell. . . http://paperpile.com/b/rqvmzs/bhgv https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. priestley, peter, jonathan baber, martijn p. lolkema, neeltje steeghs, ewart de bruijn, charles shale, korneel duyvesteyn, et al. . “pan-cancer whole-genome analyses of metastatic solid tumours.” nature ( ): – . https://doi.org/ . /s - - -y. turajlic, samra, hang xu, kevin litchfield, andrew rowan, stuart horswell, tim chambers, tim o’brien, et al. . “deterministic evolutionary trajectories influence primary tumor growth: tracerx renal.” cell ( ): – .e . https://doi.org/ . /j.cell. . . . turnbull, clare, richard h. scott, ellen thomas, louise jones, nirupa murugaesu, freya boardman pretty, dina halai, et al. . “the genomes project: bringing whole genome sequencing to the nhs.” bmj (april): k . https://doi.org/ . /bmj.k . van loo, peter, silje h. nordgard, ole christian lingjærde, hege g. russnes, inga h. rye, wei sun, victor j. weigman, et al. . “allele-specific copy number analysis of tumors.” proceedings of the national academy of sciences of the united states of america ( ): – . https://doi.org/ . /pnas. . watkins, thomas b. k., emilia l. lim, marina petkovic, sergi elizalde, nicolai j. birkbak, gareth a. wilson, david a. moore, et al. . “pervasive chromosomal instability and karyotype order in tumour evolution.” nature ( ): – . https://doi.org/ . /s - - - . zaccaria, simone, and benjamin j. raphael. . “accurate quantification of copy-number aberrations and whole-genome duplications in multi-sample tumor sequencing data.” nature communications ( ): . https://doi.org/ . /s - - -y. data availability multiregion colorectal cancer data is deposited in ega under accession number egas . pcawg calls are publicly available at ( https://dcc.icgc.org/), the icgc data portal. cnaqc is implemented as an open source r package that is hosted at the github space of the caravagna lab https://caravagnalab.github.io/cnaqc/. the tool webpage contains rmarkdown tutorial vignettes to run cnaqc analysis of a generic dataset, as well as documents that explain visualisation and parameterizations of the execution. all analyses in this paper can be replicated following the vignettes. authors contribution .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/rqvmzs/ up http://paperpile.com/b/rqvmzs/ up http://paperpile.com/b/rqvmzs/ up http://paperpile.com/b/rqvmzs/ up http://paperpile.com/b/rqvmzs/ up http://paperpile.com/b/rqvmzs/ up http://dx.doi.org/ . /s - - -y http://paperpile.com/b/rqvmzs/ up http://paperpile.com/b/rqvmzs/ji a http://paperpile.com/b/rqvmzs/ji a http://paperpile.com/b/rqvmzs/ji a http://paperpile.com/b/rqvmzs/ji a http://paperpile.com/b/rqvmzs/ji a http://paperpile.com/b/rqvmzs/ji a http://dx.doi.org/ . /j.cell. . . http://paperpile.com/b/rqvmzs/ji a http://paperpile.com/b/rqvmzs/mwfz http://paperpile.com/b/rqvmzs/mwfz http://paperpile.com/b/rqvmzs/mwfz http://paperpile.com/b/rqvmzs/mwfz http://paperpile.com/b/rqvmzs/mwfz http://paperpile.com/b/rqvmzs/mwfz http://dx.doi.org/ . /bmj.k http://paperpile.com/b/rqvmzs/mwfz http://paperpile.com/b/rqvmzs/yagn http://paperpile.com/b/rqvmzs/yagn http://paperpile.com/b/rqvmzs/yagn http://paperpile.com/b/rqvmzs/yagn http://paperpile.com/b/rqvmzs/yagn http://dx.doi.org/ . /pnas. http://paperpile.com/b/rqvmzs/yagn http://paperpile.com/b/rqvmzs/ncpj http://paperpile.com/b/rqvmzs/ncpj http://paperpile.com/b/rqvmzs/ncpj http://paperpile.com/b/rqvmzs/ncpj http://paperpile.com/b/rqvmzs/ncpj http://paperpile.com/b/rqvmzs/ncpj http://dx.doi.org/ . /s - - - http://paperpile.com/b/rqvmzs/ncpj http://paperpile.com/b/rqvmzs/rmmc http://paperpile.com/b/rqvmzs/rmmc http://paperpile.com/b/rqvmzs/rmmc http://paperpile.com/b/rqvmzs/rmmc http://dx.doi.org/ . /s - - -y http://paperpile.com/b/rqvmzs/rmmc https://dcc.icgc.org/ https://caravagnalab.github.io/cnaqc/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. all authors conceived the method, which gc formalised and implemented. all authors analysed the data and wrote the manuscript. competing interests. the authors declare no competing interests. online methods cnaqc supports two human genome references (grch and hg ), and the most common cna profiles found in cancers: ● heterozygous diploid states ( : ) ; ● loss of heterozygosity (loh) in monosomy ( : ) and copy-neutral ( : ) states; ● triploid (aab or : ) or tetraploid (aabb or : ) states. we make a simplifying assumption, whereby cnas have been acquired in one step, starting from a simple heterozygous diploid state (the germline). for this reason, for tetraploid segments we only consider copy state : , instead of : or : . this allows us to make simpler computations. in practice, we avoid working with copy states for which the computation of ccfs is very difficult, and that are quite unlikely to be observed in real data. also, we consider only clonal cna segments. while subclonal cna segments are certainly important for cancer genomics, the calls that we seek to quality check regard just clonal cna events; being the one most prevalent in the majority of cancer cells, they have to be prioritised, with subclonal cnas being only reliable for tumours with good clonal cna calls. cnaqc works primarily with whole-genome sequencing (wgs) data. for exome data, the reduced exonic mutation burden can make it more difficult to work with the spectrum of the vaf distribution. in general, the key determinant to detect peaks in the vaf, is the number of mutations per copy state. for tumours with strong endogenous mutant factors (e.g., smoking) or very high mutation rate (e.g., microsatellite unstable tumours), the number of exonic mutations could be high enough to use cnaqc. peak-detection qc the notation : is sometimes analogously expressed as genotype ab, : as a, : as aab and : as aabb. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. we consider a somatic mutation present in 𝑚 copies of the tumour genome, when the sample purity is 𝜋 and the segment ploidy is 𝑝. note that can be computed summing p the total number of copies of the minor and major allele at the mutation locus (figure ). the key equations for the expected vaf of a clonal mutation and its ccf are presented in the main text. here we discuss how peaks can be used to qc both tumour purity and cna segments and, consequently, overall tumour ploidy. from a qc perspective, if we solve for and the equations, we can get as which means that if we know tumour purity and cna, we expect a peak at vaf , for a given value of , in the data distribution ( figure a and b ). for instance, for a : segment ( ), the expected vaf for a heterozygous clonal ( ) mutation is % p = m = for a %-purity tumour, and % for a %-purity tumour. similarly, for a : genome ( ) of a tumour with % purity, the expected vaf for clonal mutations accruedp = before genome doubling and therefore visible in two copies ( ) is ~ %, while for m = those accrued after genome doubling, and therefore present in single copy ( ), we m = expect a ~ % vaf (dentro, wedge, and van loo ). cnaqc checks the data for peaks at these vafs, with a tolerance . from the distance between the theoretical expectation and the estimator derived from data, we obtain an error metric for the calls. cnaqc first performs peak detection from the input vaf with two, separate, methods: . via a kernel density estimation with fixed bandwidth, which is used to determine a smooth density profile. peaks are then estimated from the discretized smooth, using specialised r packages for peak-detection and removing peaks with density below a parameterized cutoff. . via binomial mixture from the bmix (caravagna et al. ) package ( https://caravagn.github.io/bmix/), a peak is associated with each binomial probability, for all mixture components . peaks are matched to the expected theoretical values based on their euclidean distance. a theoretical peak can be matched to the closest peak in the data, or the one to the most right side of the frequency spectrum. this latter strategy works only if there are no miscalled cnas. the first strategy (closest match), is the default cnaqc choice. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.codecogs.com/eqnedit.php?latex=m# https://www.codecogs.com/eqnedit.php?latex=% cpi# https://www.codecogs.com/eqnedit.php?latex=v# https://www.codecogs.com/eqnedit.php?latex=v% % d% % cdfrac% bv% b(p- )% cpi% % b% % d% d% bm% cpi% d% # https://www.codecogs.com/eqnedit.php?latex=v# https://www.codecogs.com/eqnedit.php?latex=m# https://paperpile.com/c/rqvmzs/uxwc https://www.codecogs.com/eqnedit.php?latex=% cepsilon% e # https://paperpile.com/c/rqvmzs/chqb https://caravagn.github.io/bmix/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. for every peak a qc value (pass or fail) is determined based on some tolerance . the overall qc status of copy states with multiple peaks is the qc of the peakε > with most mutations underneath. the overall qc status for a sample with many copy states is determined by summing up the qc status of individual copy states, and weighting them by the number of mutations associated (majority rule). ccf estimation cnaqc can compute ccfs in two ways. one of the two uses the idea of the mixture highlighted in figure c , the other is simpler and works better when data resolution is low, and the entropy of the mixture model would leave too many mutations unassigned. for the mixture approach, we build a -components binomial mixture from the theoretical expectations and the data. this implicitly assumes that peaks have been qced first. we constraint the success parameters to match the expected vaf, and use the proportion of mutations that appear underneath a peak as mixing proportions . π then, from the latent variables of the model we compute the probability of assigning a z mutation with vaf to cluster ,xn c .(z | θ, )p n,k = c π from this information we obtain the entropy of , which is low for values that are (z)h z assignable to only one cluster. recall in this respect that the maximum entropy distribution is the uniform one, which is when a mutation can be equally likely in or copies, based on vaf. we use a simple peak detection heuristic to find points of changes in ; in (z)h between those values we cannot reliably assess , i.e. assess if the mutation is in m single or double copy. for these cnaqc leaves the ccf value as na. the alternative approach uses a simpler idea, still working on the expected theoretical vaf. here instead of fitting a mixture we determine the midpoint , between the two o expected theoretical vaf peaks. the midpoint is computed by weighting each of the two peaks proportionally to the number of mutations that appear underneath each peak. the midpoint is a cut: values below are in single copy, values above in two. this o procedure requires data with good sequencing coverage, and a good general quality. when mutation multiplicities have been determined, ccf computation is trivial, and follows the formula presented in the main text. a qc pass status is assigned to the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. ccf values for a copy state, if less than % (or any custom threshold) are unassigned. the overall sample is given a qc status based on a majority policy. genome fragmentation some recently identified patterns of somatic cna changes can be attributed to the presence of highly fragmented tumour genomes, termed chromothripsis and chromoplexy, or localised hypermutation patterns, termed kataegis (cortés-ciriano et al. ). while these can be identified using dedicated bioinformatics tools, cnaqc offers a simple statistical test to detect the presence of over-fragmentation in a chromosome arm, a prerequisite that could point to the presence of such patterns. the test works at the level of each chromosome arm ( p, q, p, q, etc.), and uses the length of each input cna segment to assign a “long segment” or “short segment” status. this is determined by a cut parameter that is set, by default, to % (i.e., ). μ . μ = then, a null hypothesis is used to compute a p-value. that is defined using a binomial test based on , the number of trials given by the total segment counts in the arm, and k the observed number of short segments . the binomial distribution for is defined s h by , and the null is the probability of observing at least short segments, a one-tailed μ s test for whether the observations are biased towards shorter segments. the p-value is adjusted for family-wise error rate by bonferroni, dividing the desired -value by the α number of tests. this test is applied to a subset of chromosome arms with a minimum number of segments, and that “jump” in ploidy by a minimum amount (empirical default values estimated from trial data). the arm-level jump is determined as the sum of the difference between the ploidy of two consecutive dna segments. these covariates are similar to those used to infer cna signatures from single-cell low-pass wgs (macintyre et al. ) . other features cnaqc contains multiple functions to subset the data (i.e., select mutations that map only to certain copy states, subset cnas with a total ploidy, etc.), visualise the data (i.e., plot mutational burden by tumour genome) or smooth the input cna segments. smoothing is an operation that can be carried out before testing for over-fragmentation. in cnaqc, by smoothing we obtain that two contiguous segments are merged if they have exactly the same ploidy profile (i.e. same numbers for the major and minor .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/rqvmzs/fjzp https://paperpile.com/c/rqvmzs/fjzp https://paperpile.com/c/rqvmzs/p yv https://paperpile.com/c/rqvmzs/p yv https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. alleles), and if they are a maximum distance apart (e.g. megabase). this operation does not affect the ploidy profile of the calls, but reduces the amount of breakpoints that would inflate the p-value of the binomial over-fragmentation test. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. main text figures figure . a. theoretical vaf histogram for diploid : mutations in a tumour. a clonal heterozygous mutation has % vaf; all mutations are observed with some binomial sequencing noise. the clonal mutations form a peak at % ccf, plus other features that characterise the tumour clonal composition (e.g., the tail). the expected theoretical vaf decreases if sample purity reduces. b. the case of a : tumour genome, where we expect peaks in the vaf originating from mutations present in one (orange) or two copies (purple). the multiplicity of a mutation can phase whether it happened before or after the cna. for : we expect peaks at % and % vaf, both clonal mutations ( % ccf). c. computing ccfs requires caution for mutations with different multiplicities; we support : , : and : copy states in cnaqc, and offer two methods to compute ccfs. the one depicted is based on the entropy of a binomial mixture. from the expected vaf peaks we construct a mixture density and use the entropy of its latent variables to capture uncertainty in the multiplicities. at the crossing of the components we cannot easily assign multiplicities, and therefore ccfs; the entropy peaks at the top of the uncertainty by definition. d. heatmap expressing the relationship between copy states, mutation multiplicity and sample purity. the color reflects the expected vaf for the corresponding mutations, and can be used to qc both cnas and purity estimates. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. figure . a. genome-wide total clonal copy number segments for a pcawg cancer sample with overall ploidy , and sample purity ~ %. the panel is composed of three illustrations. the bottom plot reports the copies of the major and minor alleles in each segment, and some genome areas are shaded. the central plot shows genome-wide somatic mutations with their depth of sequencing, and the top plot shows the total number of mappable mutations binned every megabase. b. variant allele frequencies (vafs) for the mutations that map to the input segments (note that these are all snvs). c. depth of sequencing (dp) for every snv. d. number of reads (nv) with the variant allele for every snv. e. cancer cell fractions (ccf) estimation for this sample, obtained from cnaqc. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. figure . a-d. peak detection analysis assessing the quality of cna segments (split by copy state), and tumour purity. the shaded gray area are input mutations, and the thin black profile is its kernel density estimation (kde). the black circles represent the peaks detected from the kde, and the vertical dashed lines are the expected peaks, given the tumour purity. if the data peaks fall within the shaded area surrounding the vertical line, the estimates are consistent and the plot is therefore green (qc pass). for copy states with total copy number > , multiple peaks are checked independently. in that case the overall qc status for the copy state is a linear combination of the results, weighted by the number of mutations assignable to each peak. e-h. cancer cell fractions (ccf) estimation for each tumour genome, using the entropy method. each plot shows both ccf, and the vaf from which mutation multiplicities are computed. in the rightmost panel we overlay the entropy profile computed by a -dimensional binomial mixture. areas within the red vertical dashed lines are those for which cnaqc cannot assign a confident ccf value. for copy states : and : the mutation multiplicity is fixed to by definition. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. figure . a. circos plot for four possible whole-genome cna segmentations determined by sequenza with wgs data (~ x median coverage, purity %). the input sample is set _ , one of four multi-region biopsies for colorectal cancer patient set . the first run is with default sequenza parameters. with cnaqc, we slightly adjust purity estimation and obtain a final run of the tool. we also one run forcing overall tumour ploidy to (tetraploid), and one with maximum tumour purity %. b. purity and ploidy estimation for the four sequenza runs. arrows show the adjustment proposed by cnaqc, the default and final runs are the only ones to pass qc. c. final run with perfect results for set _ : copy number segments, depth of coverage per mutation and mutation density per megabase. d. miscalled copy-neutral loh segment, obtained by forcing a tetraploid solution in sequenza. for a : segment with the estimated sequenza .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. purity we expected peaks at ~ % and ~ % vaf, which cannot be matched. e. cna calling with cnaqc and sequenza for wgs biopsies of the primary colorectal cancer set . figure . a. summary cnaqc pass or fail barplot for top-quality pcawg samples n = across distinct tumour types. failures for peaks are with a % error tolerance, and ccfs with % of snvs not assignable, per copy state. b. zoom peak analysis with a scatter showing, for .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. every tumour type, the total cases per tumour against the proportion of pass or fails; each dot size is proportional to the error measure from mismatched peaks. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figures supplementary figure s . pcawg sample with low mutational burden. supplementary figure s . sample set _ (multi-region). supplementary figure s . sample set _ (multi-region). supplementary figure s . sample set _ (multi-region). supplementary figure s . sample set _ (multi-region). supplementary figure s . sample set _ (multi-region). supplementary figure s . sample set _ (multi-region). supplementary figure s . sample set _ (multi-region). supplementary figure s . sample set _ (multi-region). supplementary figure s . sample set _ (multi-region). supplementary figure s . pcawg sample with overstimated % purity. supplementary figure s . pcawg sample with true % purity. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure s . example pcawg medulloblastoma sample with low-mutational burden, which passes data qc with cnaqc. a. data for the sample (genome-wide cna segments, ccf and read counts distribution). note that this sample has only snvs in diploid tumour regions, like we observe in whole-exome assays. b,c. peak analysis and ccf computation for diploid snvs. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure s . colorectal multi-region sample set _ for patient set (see also main text figure ). a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b,c. peak analysis and ccf computation for the sample. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure s . colorectal multi-region sample set _ for patient set (see also main text figure ). a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b,c. peak analysis and ccf computation for the sample. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure s . colorectal multi-region sample set _ for patient set (see also main text figure ). a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b,c. peak analysis and ccf computation for the sample. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure s . colorectal multi-region sample set _ for patient set . a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b,c. peak analysis and ccf computation for the sample. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure s . colorectal multi-region sample set _ for patient set . a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b,c. peak analysis and ccf computation for the sample. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure s . colorectal multi-region sample set _ for patient set . a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b,c. peak analysis and ccf computation for the sample. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure s . colorectal multi-region sample set _ for patient set . a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b,c. peak analysis and ccf computation for the sample. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure s . colorectal multi-region sample set _ for patient set . a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b,c. peak analysis and ccf computation for the sample. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure s . colorectal multi-region sample set _ for patient set . a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b,c. peak analysis and ccf computation for the sample. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. supplementary figure . example pcawg sample with purity of %. a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b. this sample has % of its snvs in diploid tumour regions, where a small peak is detectable at the expected purity. the vaf clearly peaks at ~ %, possibly suggesting a purity of % or lower, rather than %. further doubts about the current purity come from non-diploid regions, where all peaks are mismatched; for this sample cnas called with a low-purity solution should be compared to the % purity solution. c. ccf computation for the sample. notice that in triploid and tetraploid tumour genomes we do not find mutations present in copies. was this true then the tumour did not acquire any snv right before the cna. also, here we are not cross-checking qc results from peak .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. detection; for instance we could decide to use only mutations that map to pass states ( : , : ), and reject all others. supplementary figure . example pcawg pancreatic adenocarcinoma with % purity (and possible driver snvs, of them involving tumour suppressor genes in loh regions). a. data for the sample (genome-wide cna segments, ccf and read counts distribution). b. this sample has % of its snvs in diploid tumour regions, and the others in a variety of distinct cna segments. from a peak analysis point of view, all the calls are validated. c. ccf values for this sample are also good. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / househam et al. a fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / simultaneous estimation of per cell division mutation rate and turnover rate from bulk tumor sequence data simultaneous estimation of per cell division mutation rate and turnover rate from bulk tumour sequence data gergely tibély , , dominik schrempf , imre derényi , , gergely j. szöllősi , , mta-elte “lendület” evolutionary genomics research group, pázmány p. stny. a, h- budapest, hungary department of biological physics, eötvös university, pázmány p. stny. a, h- budapest, hungary mta-elte statistical and biological physics research group, pázmány p. stny. a, h- budapest, hungary institute of evolution, centre for ecological research, konkoly-thege m. út - . h- budapest, hungary february , abstract tumors often harbor orders of magnitude more mutations than heal thy tis- sues. the increased number of mutations may be due to an elevated mutation rate or frequent cell death and correspondingly rapid cell turnover leading to an increased number of cell divisions and more mutations, or some combina- tion of both these mechanisms. it is difficult to disentangle the two based on widely available bulk sequencing data where mutations from individual cells are intermixed. as a result, the cell linage tree of the tumor cannot be resolved. here we present a method that can simultaneously estimate the cell turnover rate and the rate of mutations from bulk sequencing data by averaging over ensembles of cell lineage trees parameterized by cell turnover rate. our method works by simulating tumor growth and matching the observed data to these simulations by choosing the best fitting set of parameters according to an ex- plicit likelihood-based model. applying it to a real tumor sample, we find that both the mutation rate and the intensity of death is high. author summary tumors frequently harbor an elevated number of mutations, compared to healthy tissue. these extra mutations may be generated either by an in- creased mutation rate or the presence of cell death resulting in increased cellular turn over and additional cell divisions for tumor growth. sepa- rating the effects of these two factors is a nontrivial problem. here we present a method which can simultaneously estimate cell turnover rate and genomic mutation rate from bulk sequencing data. our method is based on the maximum likelihood estimation of the parameters of a gen- erative model of tumor growth and mutations. applying our method to a .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / human hepatocellular carcinoma sample reveals an elevated per cell divi- sion mutation rate and high cell turnover. introduction cancer is an evolutionary phenomenon within a host organism that unfolds on the timescale of years or more. new mutations can appear with each cell di- vision, while cells can also die for reasons such as lack of nutrients or immune reactions. due to the limitations of bulk sequencing, which only essays muta- tion frequencies for a population of cells from each tumor sample and does not resolve individual cells’ genotype, basic evolutionary parameters. in particular, the cell turnover rate and per cell division mutation rate remain unknown, with estimated values spanning several orders of magnitudes [ ]. while tumors can contain a large number of mutations, it is not clear whether this is due to an elevated mutation rate or frequent cell death, as frequent cell death results in more cell divisions, which, in turn, gives rise to more mutations. there are arguments for both cases [ , , , , ], but distinguishing between these two alternatives is difficult becasue we cannot resolve the tumor’s cell lineage tree from bulk sequencing data. in previous work [ , ], an elevated number of mutations was observed, but only the combined effect of the mutation rate and the death rate could be estimated. williams et al. [ ] targeted the problem of separating these two quantities by separately sequencing in bulk multiple samples from the same tumor thus resolving a coarse grained cell linage tree. however, it is not clear whether this approach resolves the cell lineage tree in sufficient detail to identify the regime of frequent cell death when the number of mutations is orders of magnitude larger than under growth without cell death. here, we describe a method to simultaneously estimate the per cell division mutation rate and the turnover rate (the ratio of death and birth rates) of a tumor from bulk sequencing data. the estimation is based on a maximum like- lihood fit of the parameters of a birth-death model to the measured mutant and wild-type read counts. while requiring only a single tumor-normal sample pair, the fitting procedure can differentiate between death rates, which are extremely close to the critical value where the birth rate equals the death rate, resulting in accurate estimation of the mutation rate across orders of magnitudes. the rest of the paper is structured as follows. after introducing our model and the fitting procedure in sec. methods, we assess model accuracy on simulated data in sec. results on synthetic data. results on empirical data are described in sec. results on empirical data, and conclusions are given in sec. discussion. methods model we describe the evolution of tumor cells with the cell linage tree, i.e. the bi- furcating tree traced out by cell divisions. as cells that have died cannot be observed by sequencing we consider the tree spanned by surviving cells. the leaves of this tree correspond to extant cells and internal nodes to observed cell .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / divisions. to model the descendance of the extant cells, we employ the con- ditioned birth-death process with birth rate α and death rate β and a fixed number of cells n sampled [ ]. we measure branch lengths in numbers of ex- pected birth events (the product of birth rate and time). consequently, the role of the birth rate α can be considered as a scaling constant that sets the unit of time and we consider it to be equal to without loss of generality. as result the death rate determines to the turnover rate: t = β/α = β. mutations occur with a rate µ per site per cell division, mutations are con- sidered neutral and we neglect the probability that a site is hit by a mutation by more than one time, in accordance with the infinite site hypothesis. the data available from bulk sequencing is the mutant and wildtype read- counts of sites. therefore, we will use the site frequency spectrum, which can be estimated from readcount data, to separate the effects of the mutation rate and cell death intensity. the site frequency spectrum reflects the branch length distribution of the tumors cell lineage tree. the tree’s leaves are the sequenced cells, and its root is the most recent common ancestor of these cells. chang- ing the turnover rate modifies the shape of the cell linage tree by changing the relative lengths of branches closer to the root compared to terminal ones and as a result modifying the site frequency spectrum. changing the mutation rate, on the other hand, simply results in more mutations, thus leaving the shape of the tree, and by extension the overall shape of the site frequency spectrum unchanged (see fig. ). it should be noted, however, that the information we will use is more detailed than the site frequency spectrum, namely, the read count pairs of mutated sites, which contain more information than just one rational number. e.g., mutant reads out of reads and out of both contribute mutation to the frequency / , while the uncertainty of the first case is significantly lower than for the second case. it also makes quite straight- forward to include nucleotide-dependent transition probabilites, or trinucleotide context-based effects. site frequency spectra derived from tumors alos contain the effects of the ploidy of the sites and the contamination of the sample by normal cells. the corresponding spectrum is termed variant allele frequency (vaf) spectrum. vaf frequencies are also affected by the finite sequencing depth, which gives rise to a stochastic variation in the observed allele frequencies. throughout the paper, the following notation is used: branches of the cell lineage tree are denoted by the index k, the length of branch k is denoted by lk and l = ∑ k lk denotes the sum of all branch lengths in the tree. the numbers of mutations per site from cell divisions along branch k are poisson distributed and their sum is also poisson distributed. the number of expected cell divisions along branch k is lk, therefore, the distribution of the number of mutations per site on branch k is a poisson distribution with parameter lkµ. similarly, the total number of mutations is poisson distributed with parameter lµnsites. inference to compare different combinations of mutation and turnover rates describing the observed empirical data we employ a maximum likelihood approach. first, we derive the likelihood of the observed data, l(d|µ,t), as a function of the mutation rate and the turnover rate. as described below we maximize this likelihood function averaged over a random sample of cell lineage trees with .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : two possible scenarios for the generation of mutations along cell lin- eage trees. a): different turnover rates lead to different lineage tree shapes. bifur- cations are cell divisions, leaves are cells comprising the bulk sequencing sample. note that the (surviving) tree topologies are the same, only branch lengths dif- fer. b): mutations, symbolized by purple stars, accumulate at cell divisions. high turnover rate and low mutation rate can lead to the same number of observed mutations as low turnover rate and high mutation rate, however, the mutation spectrum of the trees are different. c): for simulated trees of leaves, the differences in the branch length distribution are clearly visible. d): vafs of the mutation spectra. fractions of mutant cells are binned (note the logarithmic scale). ploidy is set to two, contamination is zero. simulated sequencing depth is . fixed turnover rate t in order to estimate the parameters t and µ that are most likely to have generated the observed data. .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / first we derive l(d|µ,t ) the likelihood of the observed data for a fixed cell lineage tree t . it is assumed that sites collect mutations independently of each other, consequently, l(d|µ,t ) takes the form of a product over sites: l(d|µ,t ) = ∏ i in sites p(mi|µ,t ,ri) ( ) where mi is the number of reads exhibiting a mutation at site i, and ri is the total number of reads covering site i. to calculate the probability of observing mi mutant reads out of a total of ri reads we consider the following alternatives: i) if mi = either a mutation occurred, with probability pmut(µ,l) = −exp(−µl) (see also sec. methods), but no mutant read, i.e. mi = was observed out of ri reads with probability f[ ,ri,t ], or no mutation occurred with probability − pmut(µ,l) or ii) a mutation occurred with probability pmut(µ,l) and mi mutant reads where observed out of ri, with probability f(mi,ri,t ): p(mi|µ,t ,ri) = = { pmut(µ,l) ·f( ,ri,t ) + ( −pmut(µ,l)) , mi = pmut(µ,l) ·f(mi,ri,t ) mi > ( ) to compute the probability f(m,r,t ) of observing m mutant reads out of r total reads given the cell linage tree t , we assume that the mutant reads descend from a single mutation that occurred at somepoint along branch k, which has a length lk and from which a fraction fk of sequenced cells descend, and take the sum over all branches: f(m,r,t ) = ∑ k lk l · binom(m,r,fk) = ∑ k lk l · ( m r ) ·fmk ( −fk) r−m, ( ) where l = ∑ k lk and binom(m,r,fk) is the probability mass function of the binomial distribution, i.e. the probability of getting exactly m successes in r independent bernoulli trials with a probability of success fk. we consider mul- tiple mutations at the same site as a single mutation, and neglect all subsequent mutations after the first one. in all our applications we verified that µl � is fulfilled. to take into consideration sequencing errors, we must consider that they can lead to an excess of false mutation reads. to account for sequencing error, we introduce a parameter ε denoting the probability of a sequencing error at each position of each read. for ε > new possibilities arise: mutant reads can be real mutants or false mutants, due to sequencing errors. it is also possible that all mutant reads at a site are false mutants, and there may or may not be a real mutation. we neglect the case when two or more mutations happen at the same site. each position in each read can now be either wild type, false wild type, real mutant or false mutant, .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / p(mi|µ,t ,ri,ε) ≈ ≈ pmut(µ,l) ∑ k lk l ri! (ri −mi)!mi ( fk( −ε) + ( −fk)ε )mi· · ( ( −fk)( −ε) + fkε) )ri−mi + ( −pmut(µ,l)) ( ri mi ) εmi ( −ε)ri−mi ( ) note that eq. contains the mi = case. finally, we introduce the ability to differentiate mutant types, to conform the case of real dna, which has possible mutant types. so far, it was assumed that each site can have states, wild type or mutant, corresponding to a dna consisting of only two types of nucleotides, instead of four. therefore, instead of the mutant read count m, we introduce three mutant read counts, corresponding to the three possible mutant types, m( ),m( ),m( ). consequently, the input data now consists of triplets of mutant read counts, instead of one scalar mutant readcount. this leads to the use of a multinomial distribution, with four states: wild type and mutant types. the possibility of more than one real mutant types at the same site is still neglected, being very rare, technically a second-order process in the mutation probability of one site. we also neglect the probability of more than one error hitting the same site. the likelihood function at a single site is then p(m ( ) i ,m ( ) i ,m ( ) i |µ,t ,ri,ε) ≈ ≈ pmut(µ,l) ∑ k lk l ∑ j= mult (( m (j) i ,m (j+ ) i ,m (j+ ) i ,ri − ∑ j′ m (j′) i ) ; ri; ( p (j) m (fk,ε),p (j+ ) m (fk,ε),p (j+ ) m (fk,ε),pw(fk,ε) )) + + ( −pmut(µ,l)) · mult (( m ( ) i ,m ( ) i ,m ( ) i ,ri − ∑ j m (j) i ) ; ri; ( p (j) m ( ,ε),p (j+ ) m ( ,ε),p (j+ ) m ( ,ε),pw( ,ε) )) ( ) and p (j) m (fk,ε) = fk( −ε) + ( −fk)ε/ ( ) p (j+ ) m (fk,ε) = fkε/ + ( −fk)ε/ ( ) p (j+ ) m (fk,ε) = fkε/ + ( −fk)ε/ ( ) pw(fk,ε) = fkε/ + ( −fk)( −ε) ( ) where (j + ) and (j + ) denote the other two possible mutant types with cyclic notation (j) = (j + ), and mult is a multinomial distribution the arguments of which denote (random variables), ndraws, (event probabilities). the factor of / is due to the assumption that only one true mutation can be present at one .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / site, and each of the possible mutated forms has the probability / . each mult(m(j) . . .) term is a conditional probability conditioned on the true mutant type being (j). the straight-forward approach for treating the unknown cell linage tree as a nuisance parameter would be to average over all trees t : l(d|µ,t) = ∑ t l(d|µ,t ) ·pbd(t |t), ( ) where pbd(t |t) is the probability of the cell lineage tree t given conditioned birth-death process with turnover rate t. due to the very large number of possi- ble trees the above average, however, is intractable and we must results to sam- pling a finite number of trees drawn from the conditioned birth-death process with fixed t. based on empirical experience we found that using the geometrical mean of l(d|µ,t ) for a finte sample of trees sampled according to pbd(t |t) results in more robust inference. the geometric mean approximates the aver- age probability of inference [ ] or equivalently the average surprisal [ ] of cell linage trees given the turnover rate t, which we denote l̄(d|µ,t) = ∏ t l(d|µ,t )pbd(t |t) = exp (∑ t pbd(t |t) lnl(d|µ,t ) ) . ( ) in practice, during inference of the turnover rate t and mutation rate µ the log-average over a finte number of trees drawn from the conditioned birth-death process with turnover rate t is maximized: lnl̄(d|µ,t) = ntrees ∑ t lnl(d|µ,t ) ( ) generating trees to generate cell division trees from birth-death conditioned process, we use the elynx software suite [ ], which allows freely adjustable birth rates, death rates, and tree sizes. generating synthetic samples for generating synthetic samples of read counts of mutated dna sites, trees simulated by elynx are used as genealogical trees of hypothetical tumors. for each site, first we determine the total readcount at that site. then, we draw random numbers to check whether any of the branches contributes a mutation, according to the poisson process described in sec. inference. if there is a muta- tion, the true mutant readcount is drawn from a hypergeometric distribution. the number of successes of the hypergeometric distribution is the number of leaves of the selected branch. the number of failures is the total number of leaves multiplied by ploidy and divided by the hypothetical purity of the sam- ple, minus the number of successes. the number of trials is the total readcount. finally, errors are introduced by drawing a quadruplet of readcounts (“wildtype errors”) from a multinomial distribution, with probabilities (ε/ ,ε/ ,ε/ , −ε), the number of trials is the wildtype readcount. “mutant errors” are drawn from another multinomial distribution, with probabilites ( − ε,ε/ ,ε/ ,ε/ ), the .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / number of trials is the mutant readcount. the final readcounts are given by the sum of the two drawn quadruplets. the mutation rate for different turnover rates is chosen such that the total number of observed real mutations should remain close to each other, i.e., the estimation algorithm should have a similar amount of input data. calculating the likelihood the goal is to find the maximum of the likelihood as the function of the mutation rate, turnover rate, and the error rate, to be able to use it for estimating the mutation and turnover rates by eq. . the input is the read counts of the dna sites. we use pre-generated division trees from the elynx suite at pre- determined turnover rate values. between these pre-determined turnover rate values, the likelihood function is interpolated using cubic splines. in the case of synthetic input datasets, the tree used to generate the test dataset is never included in the likelihood calculation. the maximum of the likelihood function for each fixed turnover rate value is obtained by optimizing the error rate using brent’s method, implemented in julia’s optim package, and estimating the mutation rate from the number of input mutations and the branch lengths of the currently fitted tree. only mutations having read counts high enough to exclude sequencing errors are taken into account in the mutation rate estimation. the estimated mutation rate is averaged over trees, using uniform weights. therefore, the likelihood function is optimized for ε at different t values, which come from a pre-defined set, for which the trees can be generated in advance, avoiding the need for new trees at each step of the optimization process. results on synthetic data no sequencing errors figs. shows the estimated turnover rates and mutation rates as functions of the true turnover rates and mutation rates. the method can reasonably differ- entiate between datasets with different true turnover rates-mutation rates and estimate their values. fig. shows the joint estimation of mutation rate-turnover rate parameter pairs. the data points are arranged into lines, corresponding to constant numbers of observed mutations, obeying nobs mut(µ,t) = µ ·e (∑ k l(k) ∗ ccdf(binom(nseq,fk), ) ) trees(t) ( ) where ccdf(. . . ) is the complementary cumulative distribution function of a bi- nomial distribution, evaluated at . the parameters of the binomial distribution are the average sequencing depth nseq and the fraction of leaves under branch k, fk. the expected value is taken over the trees generated with turnover rate t. the lines defined by eq. can be numerically approximated for any µ,t points, averaging over a number of trees. we checked the dependency of the results on the sizes of the trees. according to fig. , estimation of large true turnover rates becomes increasingly harder as the tree sizes decrease. this can be attributed to the fact that differences .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / - - - - - - - - - e st im a te d -t dataset -t - - - - - - e st im a te d µ dataset µ figure : estimated turnover rates (left) and mutation rates (right) for different true values. synthetic datasets for each true value. the trees used for fitting have leaves, trees were used for each t value of the loglikelihood(t) measurements (see fig. ). the continuous line is a guide for the eye, correspond- ing to y = x. points are slightly dispersed horizontally for clarity. horizontal ordering of the data points is the same for both subplots, e.g., the rightmost point in each group of points corresponds to dataset no. in both plots. figure : joint estimations of mutation rate-turnover rate parameter pairs. synthetic datasets for each true parameter pair, each of which is denoted by one color. true parameter values are indicated by large full circles. solid lines show the numerical approximation of µ( − t), for nobs mut = · , · , · . between the effects of very high turnover rates are observable on branches having a very small relative number of leaves, therefore large trees are required to .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / distinguish between high turnover rates. figure : the effect of tree sizes on the estimations. estimated turnover rates for different fitted trees sizes: (top left), (top right), (bottom left), (bottom right). the accuracies of the estimates for different datasets are not equal, besides the effect of the size of the trees in the case of high turnover rates. differences between estimates for different datasets can be due to possible factors: the trees used in the fitting process, the generated input data, or the tree used for generating the data. to check these factors, we chose a dataset which resulted in a turnover rate estimation deviating from the true value (dataset no. for − t = on fig. , − t = . ). we calculated the estimated turnover rate values using independent sets of trees. the estimations ranged from −t = . to . . therefore, the deviation of the estimate from . cannot be attributed to the sample of fitting trees. then, we generated more datasets using the same tree as for the original dataset. the estimated turnover rate values were between −t ∈ [ . , . ], even more closely matching the original estimate. consequently, the effect does not depend on the generated data but on the tree used to generate the data. it seems that the deviation of the estimates from the true turnover rates is due to the fluctuation of the shapes of the trees used for sample generation. .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / effects of sequencing errors to estimate the effect of sequencing errors, we calculated the estimations of the turnover rates, applying different amounts of errors to the same data (exactly the same mutant and wild type readcounts for each mutation). in this case, the error rate of the data was also estimated by the fitting procedure, along with the mutation and turnover rates. the influence of sequencing errors on the estimation of the turnover rates is shown on fig. . for an error rate of − , which is frequently cited as the error rate of the illumina sequencing technology [ ], the estimated turnover rates can have significant deviations from the true values. for lower error rates, the estimations approach the true values, however, outliers remain even for ε = − . we note that the loglikelihoods of the false estimations were better than those of the true values. we tried estimating the parameters while leaving out the least frequent mutations, to reduce the effect of errors, but the estimated parameters deviated significantly more from their true values. results on empirical data to estimate the turnover and mutation rates of real tumors, a real human tumor sample is required. due to the fitting method’s sensitivity to high sequencing error rates, we need a sample which is sequenced using a very low error rate technology. such samples are much less ubiquitous than those by the standard technology, and are usually restricted to very short genome segments, mostly nonhuman. nevertheless, we found a sample of a human hepatocellular carci- noma (hcc) [ ], which was sequenced using the o n sequencing technology [ ], providing error rates between − - − , which is significantly lower than the − rate of the standard illumina process. besides the low error rate, the amount of sequenced positions is enough to cover the targeted region x [ ], which is also much better than those of standard quality datasets (typical se- quencing depth is around ). high sequencing depth results in more identified mutations and more precise mutation frequencies. sequencing was targeted at a k basepair wide region of the genome, which is much narrower than the whole human genome. this is a typical shortcoming of low error rate sequencing methods. still, as our fitting procedure is much more sensitive to the sequencing error rate than to the amount of the input mutations (compare the deviations on figs. and ), the dataset provides significantly better input than typical sequencing data. the raw sequencing data was preprocessed according to [ ], using the code provided by the authors. the dna contents of cells were sequenced [ ], along with a sample of neighboring normal tissue. mutations were called using varscan [ ], which is flexible and easy to adapt to the requirements of the fitting procedure. sites remained after preprocessing, with sequencing depth being at least (default for varscan ) in both tumor and normal samples. the distribution of sequencing depths is wide, ranging from to , with a mean of . for mutation calling by varscan , the minimum number of mutant reads was set to and the strand filter switched off. although the number of false positives increases with these parameter choices, the resulting called mutations .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : the effect of varying the error rate. sequencing error rates are ε = − (top left), − (top right), − (middle left), − (middle right), − (bottom left), − (bottom right). trees, tree size = . x coordi- nates are slightly dispersed for clarity. open circles are results corresponding to error rates used in the fit fixed to their true values, crosses correspond to error rates estimated by the parameter fit. vertical lines show the ranges between the first and last -quantiles, based on beta value estimations. each open circle-cross pair corresponding to the same dataset is vertically aligned. correspond better to the error model of our fitting procedure than an error rate which changes sharply with threshold frequency or readcount values. the minimum variant frequency was set to − to include even the least frequent mutation. purity was set to . , in accordance with [ ]. we also checked that the default somatic p-value threshold does not exclude any candidate somatic .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / mutations. other parameter settings were unaltered from their default values. mutation frequencies were corrected for copy number variation (cnv), using varscan with default parameters, and for ploidy of the sex chromosomes. cnv detection for targeted sequencing data is a more difficult task than for whole genome data, and varscan was found to be a stable performer [ ]. sites having multiple variant types (i.e., number of reads of wildtype plus most frequent mutant type being lower than the sequencing depth) were checked manually. readcounts of all possible genotypes were identified for all variant sites. after all these steps, mutations were identified. the variant allele fre- quency spectrum is shown of fig. . . . . . c o u n t mutation frequency . . . . mutation frequency figure : variant allele frequency spectrum of a human hepatocellular carcinoma sample [ ], obtained by o n sequencing [ ]. to estimate the sequencing error rate, the fitting procedure was applied to the mutation data with various fixed error rates in the range − - − . the maximum of the loglikelihood corresponded to ε = − . it is a plausible value, as [ ] estimated the error rate between − - − . based on estimating best error rates, the error rate should be between . · − and . · − . having determined the error rate, we estimated the mutation and turnover rates, with the error rate fixed, using trees. fig. shows the estimated turnover rate, −t = . · − , and the mutation rate, µ = . · − per site per cell division. corresponding to the range of the error rate, the turnover rate ranges within . · − - . · − . the mutation rate ranges between . · − - . · − . neglecting mutations over frequency . does not alter the results. for illustration, fig. shows the vaf of a synthetic sample, generated using the tree fitting best the empirical data. the estimated mutation rate is rather high, compared to estimations of the order of − - − per site per cell division for healthy human somatic tissues [ ]. in comparison with mutation rates of tumors, it is not an outstanding value [ ]. meanwhile, the turnover rate is also high, being very close to the birth rate. possible causes include the effect of the immune system, the deleterious nature of driver mutations or competition for resources among tumor cells. in conclusion, for this tumor sample, the high number of mutations is due to a combination of an elevated mutation rate and a high turnover rate. the results allow estimating the number of cell division rounds from the founding cell to the biopsied tumor. the average height of simulated trees with the estimated parameters is cell divisions. it should be noted that a naive .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / - - - - - - - e- e- e- lo g li k e li h o o d µ figure : loglikelihood-turnover rate and loglikelihood-mutation rate curves of the hcc data. interpolation between data points is by cubic splines. . . . . c o u n t mutation frequency figure : vaf of a synthetic sample, generated using the tree which fits the empirical data the best. the grey outline shows the empirical vaf. estimation of tree height using log ( . · ) successive branches of average length /( . · − ) is wrong, due to the very different shapes of surviving trees compared to all trees, most of which go extict before reaching /( . · − ) size. it is also possible to estimate the lifetime of the hcc sample and the cell division rate of the hcc tumor. the diameter of the tumor is mm, while the length of a hcc cell is µm [ ]. this gives a total number of . · cells in the whole tumor. the median hcc tumor volume doubling time is days [ ]. based on these figures, the lifetime of the analysed sample is around years, and the cell division rate is estimated to be around / /hour. .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / discussion in summary, we described a method to simultaneously estimate the mutation rate and the turnover rate, making it possible to answer the question which of them is responsible for the elevated number of mutations in tumors. in par- ticular, the mutated sites’ read counts, which are closely related to the shape of the site frequency spectrum, contain useful information about the turnover rate (death rate, relative to the birth rate), even in the presence of a moderate sequencing error rate. the sequencing error rate can also be estimated. it is also quite straightforward to elaborate the model by including nucleotide-dependent transition probabilites, or trinucleotide context-based effects. the accuracy of the estimation is influenced by factors. first, the sharp- ness of the peak of the loglikelihood function, which tends to be narrow, for the expected amount of input data. second, the finite amount of trees used in the fit, causing only a slight dispersion for trees. third, the shape of the true lineage tree, which, for death rates extremely close to the birth rate, can distort the estimation by one order of magnitude. finally, the assumptions behind the model (birth-death process with constant rates, neutral mutations from a pois- son process) also contribute to the uncertainty of the estimations. according to the results, the estimation method works sufficiently well to discern cases of small difference between the birth and death rates (α − β � α) and cases of the death rate being much lower than the birth rate. without such capability, the answer to the question “is it the mutation rate or the death rate?” would always be “mutation rate”. although the method is presented in the context of human tumors, it can handle healthy tissues, and samples from other species, too. in theory, any pop- ulation, descending from one ancestor and possessing genetic material, can be analyzed, however, lengthier genomes giving rise to more mutations are easier, due to the increase in the input data for the estimation. on the more practical side, we also discovered that averaging the loglikeli- hoods over trees, instead of the likelihoods, gives a significant improvement in the robustness of the results. concerning sequencing errors, the noise level in the standard illumina tech- nology makes applying the method to typical samples impractical. one solution is to use a sequencing technology with much lower error rates, e.g., [ , ], or even below the − error rate of the pcr process, [ , ]. it should be noted, however, that these technologies have been applied to short dna segments only, resulting in a reduced number of mutations as input data. another possibility is to apply noise filtering to standard sequencing data, e.g., deepsnv [ ], and modify the error model of the fitting process accordingly. furthermore, when there is no estimation of the order of magnitude of the sequencing error rate, variance in the accuracy of the method can be quite large in individual cases, despite the much better behavior of the averaged results. despite the shortcomings, it is clear that the signal does exist in the site frequency spectrum, the mean of the estimated turnover rates monotonically changes with the testing datasets true turnover rates, and are clearly not inde- pendent of them. besides successes on synthetic data, we were also able to analyze an em- pirical sample of a hepatocellular carcinoma. we simultaneously estimated the mutation rate and the turnover rate. both quantities were estimated to be much .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / higher than for healthy tissues, mutation rate being . · − per site per cell division, and turnover rate t = . . in other words, the high number of mu- tations in this tumor is caused by a combination of both high mutation and turnover rates. using the turnover rate, we also estimated the number of cell division rounds in the tumor’s lifetime, and its cell division rate. the results suggest that tumor cells are constantly dividing, but the growth of the tumor is limited by other factors changing on a much longer timescale, e.g., securing sufficient blood supply, which cause most new tumor cells to die and slow down tumor growth. with such a high turnover rate, the ability of limitless replication is essential for tumor growth. it is interesting to note that high turnover rates are able to reproduce sub- clonal peaks in the vaf, using a purely neutral birth-death process. in other words, subclonal peaks are not necessarily the consequence of selection, neutral processes can also produce them, indicating strong cell death. on this basis, it is possible to give a definition of subclones as branches in the lineage tree, close to the root and long enough to appear as peaks on the vaf spectrum. using this definition, there is no need to explain different parts of the vaf spectrum with different models[ ]. in this work, the turnover rate was held constant during the evolutionary process. there are signs that it is more realistic to assume a turnover rate which changes during tumor growth [ ]. in our case, the estimated strong cell death suggests that the tumor reached a slowly growing phase, in line with a gompertzian model of tumor growth [ , ], which is corroborated by the large sizes of observed tumors (diameter ≥ cm) used in the doubling time estimation [ ]. it is possible that in the earlier stages of tumor development, cell death was less frequent and doubling time was shorter. it might be the case that the rate of cell division is constant during tumor growth, and doubling time is set by the turnover rate, which is, in turn, limited by external factors. it is an interesting direction for future work to extend the model by allowing the turnover rate to be time-dependent. the combined effect of the estimated mutation and turnover rates is a very high effective mutation rate between cell divisions where both daughter branches survive, µeff being in the order of − - − per site per surviving cell division. while this value looks suprisingly large, it is logical that the combination of a slowly growing tumor and fast dividing tumor cells leads to a very large number of mutations. currently, the method uses a simple birth-death model for tumor growth. in the future, a more realistic growth model, including e.g., spatial effects [ , ], would enhance the applicability of the method. another possibility for improve- ment is to model spatial sampling of tissues, in which the measured mutation frequencies intertwine the correlated ancestry of sampled cells with the preva- lence of the mutations. author contributions (according to https://journals.plos.org/ploscompbiol/s/authorship#loc-author- contributions.) gjsz conceptualized the research project. gt, id and gjsz performed the formal analysis. funding was acquired by id and gjsz. gt carried out the .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / investigation. gt, id and gjsz contributed to the methodology. gt and ds developed the necessary software. gt provided the visualization. gt wrote the draft. ds, id and gjsz reviewed and commented on the manuscript. acknowledgements gt and gjsz received funding from the european research council under the european unions horizon research and innovation programme under grant agreement no. . gjsz was also supported by the grant ginop- . . .– – – . references [ ] williams mj, werner b, barnes cp, graham ta, sottoriva a. iden- tification of neutral tumor evolution across cancer types. nat gen. ; : – . [ ] bozic i, gerold jm, nowak ma. quantifying clonal and sub- clonal passenger mutations in cancer evolution plos comput biol. ; ( ):e . [ ] tomlinson i, sasieni p and bodmer w. how many mutations in a cancer? am j pathol. ; : - . [ ] araten dj, golde dw, zhang rh, thaler ht, gargiulo l, notaro r et al. a quantitative measurement of the human somatic mutation rate. cancer research ; : . [ ] loeb la, bielas jh and beckman ra. cancers exhibit a mutator pheno- type: clinical implications. cancer research ; : . [ ] williams mj, werner b, heide t, curtis c, barnes cp, sottoriva a, et al. quantification of subclonal selection in cancer from bulk sequencing data. nat genet. ; : - . [ ] werner b, case j, williams mj, chkhaidze k, temko d, fernández-mateos j, et al. measuring single cell divisions in human tissues from multi-region sequencing data. nat comm. ; : . [ ] gernhard t. the conditioned reconstructed process. j theor biol. ; : . [ ] maruvka ye, kessler da, shnerb nm. the birth-death-mutation process: a new paradigm for fat tailed distributions. plos one ; ( ):e . [ ] kessler da, levine h: scaling solution in the large population limit of the general asymmetric stochastic luria-delbrück evolution process. j stat phys. ; : – . [ ] schrempf d. the elynx suite; [cited sept ] repository: github [internet]. available from: https://github.com/dschrempf/elynx .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / [ ] höhna s, may mr, moore br. tess: an r package for efficiently sim- ulating phylogenetic trees and performing bayesian inference of lineage diversification rates. bioinformatics ; ( ): - . [ ] kennedy sr, schmitt mw, fox ej, kohrn bf, salk jj, ahn eh, et al. detecting ultralow-frequency mutations by duplex sequencing. nat protoc. ; : . [ ] ling s, hu z, yang z, yang f, li y, lin p, et al. extremely high ge- netic diversity in a single tumor points to prevalence of non-darwinian cell evolution. pnas ; :e -e . [ ] wang k, lai s, yang x, zhu t, lu x, wu c, et al. ultrasensitive and high-efficiency screen of de novo low-frequency mutations by o n-seq. nat comm. ; : . [ ] koboldt dc, zhang q, larson de, shen d, mclellan md, lin l, et al. varscan : somatic mutation and copy number alteration discovery in can- cer by exome sequencing. genome res. ; : - . [ ] zare f, dow m, monteleone n, hosny a, nabavi s. an evaluation of copy number variation detection tools for cancer using whole exome sequencing data. bmc bioinformatics ; : . [ ] lynch m. evolution of the mutation rate. trends genet. ; : . [ ] an c, chou ya, choi d, paik yh, ahn sh, kim m-j, et al. growth rate of early-stage hepatocellular carcinoma in patients with chronic liver disease. clin mol hepatol. ; : . [ ] stadler t. on incomplete sampling under birth-death models and connect- sion to the sampling-based coalescent. j theor biol. ; : - . [ ] lou di, hussmann ja, mcbee rm, acevedo a, andino r, press wh, et al. high-throughput dna sequencing errors are reduced by orders of magnitude using circle sequencing. pnas ; : - . [ ] kinde i, wu j, papadopoulos n, kinzler kw, vogelstein b. detection and quantification of rare mutations with massively parallel sequencing. pnas ; : - . [ ] gerstung m, beisel c, rechsteiner m, wild p, schraml p, moch h, et al. reliable detection of subclonal single-nucleotide variants in tumour cell populations. nat comm. ; : . [ ] caravagna g, heide t, williams mj , zapata l, nichol d, chkhaidze k, et al. subclonal reconstruction of tumors by using machine learning and population genetics. nat gen. ; : – . [ ] laird ak. dynamics of tumour growth: comparison of growth rates and extrapolation of growth curve to one cell. br j cancer ; : – . [ ] lo cf. a modified stochastic gompertz model for tumour cell growth. comp math methods medicine ; : - . .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / [ ] antal t, krapivsky pl, nowak ma. spatial evolution of tumors with suc- cessive driver mutations. phys rev e ; : . [ ] noble r, burri d, kather jn, beerenwinkel n. spatial structure governs the mode of tumour evolution. biorxiv [preprint]. biorxiv [posted mar ; revised apr , cited sept ]: [ p.]. available from: https://doi.org/ . / [ ] nelso kp. assessing probabilistic inference by comparing the generalized mean of the model and source probabilities. entropy ; : . [ ] nelso kp. inference assessment on a probability scale. st annual conference on information sciences and systems (ciss) ; - . . /ciss. . . .cc-by-nc . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction methods results on synthetic data results on empirical data discussion mutations in bdca and vals correlate with quinolone resistance in wastewater escherichia coli malekian et al. research mutations in bdca and vals correlate with quinolone resistance in wastewater escherichia coli negin malekian , ali al-fatlawi , thomas u. berendonk and michael schroeder * abstract single mutations can confer resistance to antibiotics. identifying such mutations can help to develop and improve drugs. here, we systematically screen for candidate quinolone resistance-conferring mutations. we sequenced highly diverse wastewater e. coli and performed a genome-wide association study (gwas) correlating over , mutations against quinolone resistance phenotypes. we uncovered statistically significant mutations including one located at the active site of the biofilm dispersal genes bdca and six silent mutations in the aminoacyl-trna synthetase vals. the study also recovered the known mutations in the topoisomerases gyra and parc. in summary, we demonstrate that gwas effectively and comprehensively identifies resistance mutations without a priori knowledge of targets and mode of action. the results suggest that bdca and vals may be novel resistance genes with biofilm dispersal and translation as novel resistance mechanisms. keywords: e coli; quinolone; antibiotic resistance; genome-wide association study (gwas) background in the sixties, an impurity during the synthesis of the anti-malarial chloroquine led to the discovery of nalidixic acid [ , ]. two years after its introduction to the market, resistances were observed, but it took another ten years before the drug’s target and mecha- nism of action were understood [ ]. subsequently, im- proved derivatives of nalidixic acid were found, such as norfloxacin and ciprofloxacin and then levofloxacin. today, there are over fluoroquinolones on the mar- ket. generally, fluoroquinolones act by converting their targets, gyrase (gyra) and topoisomerase iv (parc), into toxic enzymes that fragment the bacterial chro- mosome [ ]. with the wide use of quinolones, however, bacteria developed resistances through several routes such as increased expression of efflux pumps, which transport drugs outside the bacterial cell, or horizontal gene transfer of resistance genes, whose gene products bind to the quinolone targets [ ]. however, the most direct route to resistance is mutations in the drug tar- gets gyra and parc. specifically, changes in the amino * correspondence: michael.schroeder@tu-dresden.de biotechnology center (biotec), technische universität dresden, tatzberg - , dresden, germany full list of author information is available at the end of the article acids ser and asp of gyra and ser of parc con- fer resistance [ , ] to quinolones. the discovery of these mutations was driven by a deep understanding of the mechanism of action of quinolones. already over years ago, crumplin et al. suggested that “a comparative study of [...] mu- tants and otherwise isogenic bacteria should facilitate identification of the hitherto unknown [...] target” [ ], which was at the time not possible on a genome-wide scale. this changed with the advent of deep sequencing technology. thus, we want to complement the original hypothesis-driven approach to understand resistance [ ] with a hypothesis-free, high-throughput approach, in which we systematically evaluate the mutational landscape of resistant and susceptible bacteria. instead of investigating the quinolone targets in depth for resistance-conferring mutations, we screen entire bacterial genomes of many isolates and corre- late them to patterns of the isolates’ susceptibility and resistance. this approach termed genome-wide associ- ation study, gwas, rose with the advent of deep se- quencing and was initially applied to human genomes and disease phenotypes [ ]. recently, the success of hu- man gwas sparked interest in microbial gwas [ , ]. genome-wide associations in bacteria are challenging, as clonal reproduction in bacteria leads to population .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:michael.schroeder@tu-dresden.de https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of stratification and a non-random association of alleles at different loci (linkage disequilibrium or ld) [ , ]. e. coli’s population structure is predominantly clonal, allowing the delineation of major phylogenetic groups, the largest being a ( %), b ( %), and b and d (both %) [ ]. therefore, any model of a genome-wide association study in e. coli should ac- commodate these groups. interestingly, the groups also relate to pathogenicity: commensal e. coli, as e.g. found in human intestines, are more likely to belong to a and b and pathogenic to b and d. generally, e. coli genomes vary in size between to genes, of which only half are shared by all e. coli [ ]. these genes, which are common to all e. coli, define the core-genome. it can be approximated as the intersection of genes present in a set of genomes. in contrast to the core-genome, the pan-genome is defined as the union of genes in a population. the e. coli pan- genome exceeds genes and has possibly no limit due to their ability to absorb genetic material [ ]. parallel to the core and pan-genome, we coin the core and pan-variome. the former is defined as the intersec- tion and the latter as the union of all mutations across all genomes. mutations correlating with resistance will - by definition - not be part of the core-variome. hence, it is important for a genome-wide association study that there is a significant gap in size between core and pan-variome. a second major challenge besides population strat- ification is the dependencies of loci (linkage disequi- librium). the mutations in gyra and parc correlate with each other, as they belong to the same resistance mechanism. however, following terminology from can- cer biology, all of them are driver mutations, which cause clonal expansion in contrast to passenger mu- tations, which do not influence the fitness of a clone [ ]. driver mutations may impact clonal expansion di- rectly by changing the amino acid sequence (non- synonymous mutations) and thus protein structure or function. as an example, the gyra and parc muta- tions are located at the drug’s binding site and there- fore influence binding. driver mutations may also act indirectly as synonymous mutations without changes to the amino acid sequence. synonymous mutations may have an effect on splicing, rna stability, rna folding, translation, or co-translational protein fold- ing [ ]. as an example, kimchi et al. showed that a synonymous mutation in the multi-drug resistance gene mdr altered drug and inhibitor interactions. the authors argue that the reason may be a changed timing of co-translational folding and insertion into the membrane [ ]. thus, a genome-wide association study aiming to uncover novel resistance mechanisms should consider both non-synonymous and synonymous mu- tations, which are independent of already known mech- anisms. to date, it is not fully understood, how antibiotic resistance develops. it is ancient and inherent to bac- teria [ ] and can therefore be found in the natural en- vironment. but with the wide use of antibiotics, major sources of resistant bacteria are clinics and wastewater [ ]. in particular, the latter plays an important role, since treatment plants act as melting pots for bacteria of human, clinical, animal, and environmental origin [ ]. the high genetic diversity of a clinical e. coli population was substantially exceeded by a wastewa- ter population [ ], which makes wastewater e. coli a suitable source for a gwas analysis. in summary, we aim to show that a bacterial genome- wide association study can effectively and compre- hensively identify targets relevant to antibiotic resis- tance. we aim to recover the known mutations in gyra and parc together with novel candidate mutations. to maximise genomic diversity, we investigate wastewa- ter e. coli. we employ a computational approach and implement variant calling on these genomes and then correlate the identified mutations against resistance levels of four quinolones covering first to third gen- eration (nalidixic acid, norfloxacin, ciprofloxacin, and levofloxacin). we apply stringent filtering and cater for missing and rare data, population effects, and depen- dencies among mutations. building on gyra and parc mutations as controls, we expect to characterise the quantity and quality of the mutational resistance land- scape. we will answer the question of whether there are resistance mutations beyond gyra and parc and whether they may open new avenues for future drug discovery. methods sequencing and phenotyping. mahfouz et al. col- lected e. coli isolates from the inflow and outflow of the municipal wastewater treatment plant in dres- den, germany. based on representative resistance phe- notypes, the authors selected isolates for sequenc- ing with illumina miseq, of which are available from ncbi’s assembly database (prjna : https:/ /www.ncbi.nlm.nih.gov/assembly/?term=prjna ) and the rest by the authors. phage and virus sequences were removed [ ]. the unbiased sampling and selection of represen- tative phenotypes were important for the subsequent gwas analysis, which requires both resistant and sus- ceptible isolates. the isolates were phenotyped using the agar diffusion method measuring the diameters of inhibition zone for commonly prescribed antibi- otics, including the four quinolones nalidixic acid, nor- floxacin, ciprofloxacin, and levofloxacin [ ]. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of variant calling, quality control, and func- tional annotation. reads were mapped onto e. coli k mg with the burrow-wheeler aligner (bwa) v . . and sorted with picard v . . vari- ants were called using the genomic analysis toolkit gatk . . . [ ] with e. coli k mg as ref- erence. we combined them into a single vcf file and re-genotyped them. next, we filtered variants following standard protocols [ ] and settings according to the gatk . . . website (for snps qd < . , qual < . , or fs > . and for indels qd < . , qual < . , or fs > . ). variants with low genotype qual- ity (gq < ) and variants with > % of missing data were removed. after normalisation with bcftools . [ ], rare variants with minor allele frequency (maf) < % were excluded with pyseer . . . finally, vari- ants were functionally annotated using snpeff . t [ ]. genome-wide association study (gwas). we performed a gwas study by pyseer . . [ ], using a generalized linear model for each variant. we built a phylogenetic tree from the vcf file with vcf- kit . . [ ]. using multidimensional scaling (mds) on the distances in the phylogenetic tree, four outlier isolates were removed. for the remaining isolates, we drew a scree plot for the eigenvalues of the mds model and picked four components, which we used as covariates for the regression model to control for pop- ulation structure. finally, we calculated a bonferroni- corrected significance threshold for our gwas analysis with pyseer. meta-analysis. we visualized gwas results with quantile-quantile (qq) and manhattan plots using the r package qqman. roc curve and area under the curve (auc) were calculated using the matplotlib and scikit-learn python packages. we calculated the link- age disequilibrium (ld) between the loci of significant variants using plink v . b . [ ]. the r package ldheatmap [ ] was used to visualize ld results. we applied and visualized mds on the phylogenetic dis- tances between the samples using the cmdscale and scatter d functions from the stats and plot d r pack- ages, respectively. we drew a heatmap with dendro- gram on the binary matrix of presence/absence of vari- ants for different samples using the heatmap function from the r package stats. d structures. the d structure of bdca was re- trieved from protein databank pdb ( pcv). the d structure of vals was retrieved from swiss-model (based on pdb structure pdbid ivs). the d struc- tures were visualized using pymol . . . conservation across other bacterial genomes. we retrieved the multiple sequence alignment enog rq s for bdca across all gammaproteobacteria from eggnog . [ ]. residue in the ungapped bdca sequence was shifted to position in the gapped multiple sequence alignment. conservation across other bacterial genomes. to check the frequency of bdca g s in other e. coli genomes, we downloaded e. coli genomes from ncbi (https://www.ncbi.nlm.nih.gov/) (accessed on th of october ) and identified the locus in each genome by searching for an exact match of the ten nucleotide long sequence attcacggag, which fol- lows after the locus of the bdca mutation and which is conserved across all the retrieved genomes. results we aimed to identify mutations, which correlate with quinolone resistance. after extracting raw variants from wastewater e. coli genomes, we proceeded in two steps: first, we reduced raw to high-quality and then high-quality to highly significant variants. from raw to high-quality variants. from the genomes, we extracted , raw variants, which we subjected to five quality control steps resulting in , high-quality variants. rare variants, which ap- pear in less than % of isolates, led to the greatest reduction of mutations of nearly % (table ). the pan- and core variome. for a genome-wide association study, it is vital that the mutations spread across the isolates. to characterise the distribution and diversity of the high-quality mutations, we computed the core and the pan-variome (see figure ). the core- variome reflects the number of variants shared by a given number of genomes. in contrast, the pan-variome consists of the union of all variants, thus reflecting the total diversity of variants present in all genomes. as expected, the pan-variome grows fast and the core- variome tails off fast. for genomes, the pan-variome consists already of some , variants, while the core variome is reduced to some variants. this means that there are only very few variants that are shared across many or even all of the genomes. simi- larly, the graph for the pan-variome continually grows. each added genome contributes new variants until the pan-variome reaches , variants ( , high- quality plus , rare variants) in total. overall, the distribution of variants is thus suitable for gwas. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of from high-quality to highly significant vari- ants next, we carried out a gwas study correlat- ing the high-quality variants against resistance levels of the four quinolones investigated (nalidixic acid, nor- floxacin, ciprofloxacin, and levofloxacin). two aspects were important: we wanted to control the population structure and ensure the independence of the novel mutations from the known resistance-conferring mu- tations. to assess the control of the study over the popu- lation structure, we plotted p-values expected under randomness against observed p-values (see qq plots in figure ). the plots confirm that the correction for population structure was satisfactory, as a deviation from the null hypothesis (the identity line) is only ev- ident at the tail of the plots. next, we visualized the results of the gwas us- ing manhattan plots, which reveal that there are some highly significant variants passing the rigorous bonferroni-corrected p-value (the horizontal line). to confirm the level of significance, we evaluated how well these variants predict resistance. to this end, we plot- ted a receiver operating characteristic (roc) curve and calculated the area under the curve (auc) as a measure of predictive performance. the auc for most of the significant variants was above % (see figure ) reflecting that the identified variants very accurately predict resistance. summary statistics of the gwas analysis. in total, we obtained highly significant variants, three in gyra and parc and ten novel candidate variants in the five genes bdca, vals, lptg, lptf, and ivy. the variant in bdca leads to an amino acid change, while the remaining nine do not. across all four quinolones, the mutations in gyra and parc ranked highest thus confirming the validity of the approach taken (table ). as shown in the table, the frequency and effect sizes of the novel candidate variants are on a par with the positive controls. this means that the existence of an effect (p-value) and the size of the effect (beta) are both given. while all vari- ants pass the bonferroni-corrected p-value threshold ( . e- ), the positive controls exceed it very sub- stantially (table ). novel candidate variants are independent of controls. to check the independence of the signifi- cant variants from one another, we measured the link- age disequilibrium (ld) for the loci of these vari- ants (see figure ). the known quinolone resistance- conferring variants, gyra s l, gyra d n, and parc s i are in ld. they are located at the drugs’ binding sites to gyra and parc and ensure the correct function of the gene products despite treatment. the known resistance-conferring variants are not in ld with the ten novel loci, which suggests that they confer resistance by a different mechanism from gyra and parc. among the novel loci, there are de- pendencies. in particular, the non-synonymous vari- ant in bdca is in ld with synonymous mutations in vals. this may mean that these novel variants act in a shared mechanism, which raises the question of whether the biological functions of the novel loci can be linked to antibiotic resistance. biological function of bdca. the bdca gene plays a role in biofilm dispersal [ , ] and gener- ally, biofilm formation increases antimicrobial resis- tance [ , ]. it could be hypothesised that a variant in this gene disrupts biofilm dispersal and leads to biofilm formation and resistance. however, while this may happen in nature, it is unclear whether this effect is also present in the disk diffusion assay underlying the present data. this gene is present in nearly all isolates ( - % in our data and ncbi data), which means that is close to being a core gene, but that it is not essential for survival. biological function of vals. the vals gene prod- uct is an aminoacyl-trna synthetase (aars), which charges trna encoding valine with the valine amino acid. the aars enzymes are promising targets for an- timicrobial development [ , ] as targeting them can inhibit the translation process, cell growth, and finally cell viability. although aars enzymes are not known as direct quinolone targets, there is evidence that non-synonymous mutations in aars enzymes increase ciprofloxacin resistance by upregulating the expression of efflux pumps [ ]. in our data, we found synony- mous vals mutations for ciprofloxacin just below the p-value cut-off. for levofloxacin and norfloxacin, they were above the cut-off. vals provides a very basic func- tion and is a core gene present in all isolates. biological function of ivy. the gene product of ivy is a strong inhibitor of lysozyme c. expression of ivy protects porous cell-wall e. coli mutants from the lytic effect of lysozyme, suggesting that it is a response against the permeabilizing effects of the innate verte- brate immune system. as such, ivy acts as a virulence factor for a number of gram-negative bacteria-infecting vertebrates [ ]. biological function of lptg and lptf. the gene products of lptg and lptf are part of the abc trans- porter complex lptbfg involved in the translocation of lipopolysaccharide from the inner membrane to the outer membrane. thus, there is no direct connection .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of to antibiotic resistance, however, the link to transport is in line with other resistance mechanisms such as in- creased expression of efflux pumps [ ]. structural analysis of bdca and vals. to shed more light on the possible causality of the gwas can- didate variants, we explored their protein structures (figure ). the variant gly ser in bdca is in the vicinity of the active site residues ser and tyr [ ]. serine is bigger than glycine and it may influ- ence a loop formed by the residues - and thus regulate the active site, which may influence biofilm dispersal. in vals, the identified variants are synonymous and thus have no direct impact on the structure of the protein. however, for some loci, there were non- synonymous variants such as e.g. d e. therefore, we wanted to understand, where the vals mutations are located in the d structure. figure shows the structure of a model for vals in e. coli, which is gener- ated by swiss-model based on a template in thermus thermophilus. the model reveals that the vals muta- tions are on the surface of the protein. variant bdca g s wrt. other antibiotics, other e. coli, and other bacterial sequences. for the non-synonymous variant bdca g s, we wanted to understand whether its role in antibiotic resistance is limited to quinolones or not. for other antibiotics, [ ] there are variants, which significantly correlated with resistance (data not shown). for all antibiotics but tobramycin, the bdca mutation is not significant. this suggests, that bdca g s may act independently of fluoroquinolone, which would be con- sistent with biofilm formation being a general mecha- nism independent of fluoroquinolone. next, we wanted to know whether the prevalence of bdca g s in our data is representative of other e. coli genomes. in complete e. coli genomes available from the ncbi, we could find the bdca gene in genomes and bdca g s in . thus, about % of genomes carry this mutation, which is slightly less, but comparable to the % present in our data. bdca is present in other bacteria. we investigated gammaproteobacteria, which comprise pseudomon- adaceae besides enterobacteria. we analysed bdca sequences retrieved from eggnog . and found ala- nine most frequently ( %) and glycine less frequently ( %). serine appeared in % of the species, which may mean that the resistance mechanism is not lim- ited to e. coli. phylogenetic groups. a key ingredient of the gwas model is the population structure. we ap- plied dimension reduction and hierarchical cluster- ing to isolates represented as high-dimensional bi- nary vectors, where each dimension corresponds to one of the , mutations. we identified four clusters (figure ), which broadly correspond to phylogenetic groups a, b , b , and d. thus, our gwas model correctly caters for the main e. coli lineages. discussion and conclusion it took over a decade to move from the discovery of nalidixic acid to the discovery of its target and mech- anism of action. here, we have shown that sequencing and phenotyping data of a small number of genomes from a single site are sufficient for a gwas model to reveal the quinolone targets with a very high statis- tical significance. furthermore, the gwas model re- vealed ten new mutations, which correlate significantly with quinolone resistance. a key to the success of the gwas model was an unbiased sampling of isolates, which contained resistant and susceptible isolates. the most promising mutation is g s in the biofilm dispersal gene bdca, which is present in nearly all isolates, but which is not essential for e. coli sur- vival [ ]. mapping the bdca mutation onto a pro- tein structure of bdca revealed its location on the surface of the protein and close to the active site. hence, this suggests an impact on enzymatic activity, which may influence biofilm dispersion and hence indi- rectly relate to antibiotic resistance. in fact, ma et al. could show that e. coli bdca controls biofilm disper- sal in pseudomonas aeruginosa [ ], which were the most abundant gammaproteobacteria containing bdca in our analysis. this indicates that mutations in e. coli bdca may act indirectly on antibiotic resistance. if consequently, bdca emerges as a novel drug target, then the next steps in drug development could target the active site with residues s and y , which are in direct proximity to the mutation bdca g s. im- portantly, bdca g s is a novel candidate resistance mutation as it is not in ld with the known mutations in gyra and parc. we found bdca g s in % of the analysed genomes, which appears in line with a prevalence of % in other e. coli genomes obtained from the ncbi. we also checked the presence of these muta- tions in other gammaproteobacteria and revealed that bdca is present and well conserved, but that the mu- tation appears specific to e. coli. furthermore, we also checked whether bdca g s correlates with re- sistance to non-quinolone antibiotics. this was the case for tobramycin, an aminoglycoside, but not for all other examined antibiotics. isolates with the bdca g s mutation belonged to the phylogenetic group a, which is less likely to contain pathogenetic isolates. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of phylogroup a is equally abundant in human faeces and wastewater [ ], which may point to an origin of the mutation in a human rather than a natural envi- ronment. besides bdca g s, we found nine mutations, which are synonymous, whose mechanism of action is likely to be indirect. most interesting are the abun- dant mutations in the aminoacyl-trna synthetase vals, which has an essential role in protein synthesis and which is part of the core-genome and is therefore present in all isolates. furthermore, it is classified as an essential gene [ ]. it may be a suitable drug tar- get [ ] due to their evolutionary divergence between prokaryotic and eukaryotic enzymes, high conservation across different bacterial pathogens, as well as solubil- ity, stability, and ease of purification. however, since the mutations in vals were synonymous, they will not exert a direct structural or functional effect on their gene product but may act indirectly. in summary, bdca g s and the discovered silent mutations are statistically significant correlating with quinolone resistance in wastewater e. coli. they ap- pear to be mostly specific to e. coli and to quinolones and independent of known resistance-conferring mu- tations. further research is needed to corroborate the correlation between these mutations and quinolone re- sistance and to shed light on the molecular mechanism leading to resistance .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of bdca mutation(s) vals mutation(s)gwas resistance phenotyping sequencinge. coli wastewater variant calling g t g a t c t a c a . . . g t g a t c t a c a . . . figure : wastewater e. coli were phenotyped and sequenced. variants were called and correlated to quinolone resistance in a gwas study resulting in novel candidate resistance mutations. table : quality control (qc): reduction of some . raw variants to . high-quality variants. rare variants (maf) is the main filter. step change mutations . variant calling , . hard filters - % , . gq filter and missingness - % , . normalisation by allele + % , . minor allele frequency (maf) - % , number of genomes n u m b er o f v a ri a n ts number of genomes n u m b er o f v a ri a n ts a) pan-variome b) core-variome figure : pan-variome (union of variants) and core-variome (intersection of variants) of , high-quality and , rare variants ( , in total). most variants appear only in a few of the isolates. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of expected -log (p-value) o b se rv ed -l o g (p -v a lu e) position o b se rv ed -l o g (p -v a lu e) vals r false positive rate t ru e p o si ti v e ra te gyra d n, parc s i gyra s l bdca g s, lptg v , lptf q , other vals mutations vals n vals e a) levofloxacin expected -log (p-value) o b se rv ed -l o g (p -v a lu e) position o b se rv ed -l o g (p -v a lu e) false positive rate t ru e p o si ti v e ra te gyra d n, parc s i gyra s l vals r bdca g s, lptg v , lptf q , ivy t , other vals mutations vals n vals e b) norfloxacin expected -log (p-value) o b se rv ed -l o g (p -v a lu e) position o b se rv ed -l o g (p -v a lu e) gyra d n, parc s i gyra s l false positive rate t ru e p o si ti v e ra te c) ciprofloxacin expected -log (p-value) o b se rv ed -l o g (p -v a lu e) position o b se rv ed -l o g (p -v a lu e) gyra s l false positive rate t ru e p o si ti v e ra te d) nalidixic acid figure : gwas analysis. left: qq plots of observed vs. expected p-values show a few highly significant p- values. middle: manhattan plots of chromosomal position vs. p-value show mutations passing the bonferroni- corrected threshold as dots above the red line. right: area under the roc curves show that the significant mutations predict resistance well (most auc > %). .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of table : mutations significantly correlating with quinolone resistance. freq. is the relative frequency among isolates and beta the effect size. effect size is similar for all, p-values differ. quinolone position allele gene effect freq. beta se call rate p-value auc levofloxacin a parc s i . - . . % . e- % t gyra d n . - . . % . e- % a gyra s l . - . . % . e- % t bdca g s . - . . % . e- % a vals r . - . . % . e- % a vals n . - . . % . e- % t vals e . - . . % . e- % a vals d . - . . % . e- % a vals v . - . . % . e- % t vals l . - . . % . e- % a lptf q . - . . % . e- % a lptg v . - . . % . e- % norfloxacin a parc s i . - . . % . e- % t gyra d n . - . . % . e- % a gyra s l . - . . % . e- % t bdca g s . - . . % . e- % a vals r . - . . % . e- % t vals e . - . . % . e- % a vals n . - . . % . e- % a vals d . - . . % . e- % a vals v . - . . % . e- % t vals l . - . . % . e- % a lptf q . - . . % . e- % a lptg v . - . . % . e- % t ivy t . - . . % . e- % ciprofloxacin a parc s i . - . . % . e- % t gyra d n . - . . % . e- % a gyra s l . - . . % . e- % nalidixic acid a gyra s l . - . . % . e- % table : ranking of mutations significantly correlating with quinolone resistance. levofloxacin norfloxacin ciprofloxacin nalidixic acid position allele gene effect rank/p-value rank/p-value rank/p-value rank/p-value a parc s i / . e- / . e- / . e- / . e- t gyra d n / . e- / . e- / . e- / . e- a gyra s l / . e- / . e- / . e- / . e- t bdca g s / . e- / . e- / . e- / . e- a vals r / . e- / . e- / . e- / . e- a vals n / . e- / . e- / . e- / . e- t vals e / . e- / . e- / . e- / . e- a vals d / . e- / . e- / . e- / . e- a vals v / . e- / . e- / . e- / . e- t vals l / . e- / . e- / . e- / . e- a lptf q / . e- / . e- / . e- / . e- a lptg v / . e- / . e- / . e- / . e- t ivy t / . e- / . e- / . e- / . e- .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of (g y ra d n ) (g y ra s l ) (p a rc s i) (b d ca g s ) (v a ls d ) (l p tf q ) (l p tg v ) (v a ls e ) (v a ls n ) (v a ls r ) (v a ls l ) (v a ls v ) (i v y t ) figure : linkage disequilibrium. high values (red) indicate a dependence of the loci. as expected, the loci in gyra and parc are in linkage disequilibrium. importantly, they are not in ld with the remaining novel candidate loci. interestingly, there is some dependence within the novel loci, in particular, bdca is in ld with vals. a) bdca b) vals figure : d structures of bdca and vals. significant mutations (red) are at the surface and bdca g s is near the active site (green). .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of p h y lo g ro u p s a b b d a) mds plot a b b d p h y lo g ro u p s none b) hierarchical clustering figure : a) dimension reduction of isolates represented as high-dimensional vectors of all mutations. four clusters are found, which reflect the population structure in the gwas model and which broadly coincide with phylogroups a, b , b , and d. b) same as a) but hierarchical clustering. here, the presence of a mutation is shown by black and its absence by gray. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of competing interests the authors declare that they have no competing in- terests. author’s contributions nm,tb, ms conceived the idea, tb contributed data, nm, aa, ms analysed data, nm, ms wrote the article. acknowledgements we would like to thank norhan mahfouz, eric achatz, and serena caucci for an initial analysis of the data and valuable input and magali de la cruz barron, uli klümper, amay ajaykumar agrawal, aldo acevedo, claudio duran, and mahmood nazari for feedback. funding of the acras-r project is kindly acknowl- edged. author details biotechnology center (biotec), technische universität dresden, tatzberg - , dresden, germany. institute of hydrobiology, technische universität dresden, germany,. references . emmerson, a., jones, a.: the quinolones: decades of development and use. journal of antimicrobial chemotherapy (suppl ), – ( ) . bisacchi, g.s.: origins of the quinolone class of antibacterials: an expanded “discovery story” miniperspective. journal of medicinal chemistry ( ), – ( ) . crumplin, g., smith, j.: nalidixic acid and bacterial chromosome replication. nature ( ), – ( ) . aldred, k.j., kerns, r.j., osheroff, n.: mechanism of quinolone action and resistance. biochemistry ( ), – ( ) . conrad, s., saunders, j.r., oethinger, m., kaifel, k., klotz, g., marre, r., kern, w.: gyra mutations in high-level fluoroquinolone-resistant clinical isolates of escherichia coli. journal of antimicrobial chemotherapy ( ), – ( ) . hirschhorn, j.n., daly, m.j.: genome-wide association studies for common diseases and complex traits. nature reviews genetics ( ), ( ) . power, r.a., parkhill, j., de oliveira, t.: microbial genome-wide association studies: lessons from human gwas. nature reviews genetics ( ), ( ) . chen, p.e., shapiro, b.j.: the advent of genome-wide association studies for bacteria. current opinion in microbiology , – ( ) . lees, j.a., vehkala, m., välimäki, n., harris, s.r., chewapreecha, c., croucher, n.j., marttinen, p., davies, m.r., steer, a.c., tong, s.y., et al.: sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. nature communications , ( ) . tenaillon, o., skurnik, d., picard, b., denamur, e.: the population genetics of commensal escherichia coli. nature reviews microbiology ( ), – ( ) . rasko, d.a., rosovitz, m., myers, g.s., mongodin, e.f., fricke, w.f., gajer, p., crabtree, j., sebaihia, m., thomson, n.r., chaudhuri, r., et al.: the pangenome structure of escherichia coli: comparative genomic analysis of e. coli commensal and pathogenic isolates. journal of bacteriology ( ), – ( ) . greenman, c., stephens, p., smith, r., dalgliesh, g.l., hunter, c., bignell, g., davies, h., teague, j., butler, a., stevens, c., et al.: patterns of somatic mutation in human cancer genomes. nature ( ), – ( ) . sharma, y., miladi, m., dukare, s., boulay, k., caudron-herger, m., groß, m., backofen, r., diederichs, s.: a pan-cancer analysis of synonymous mutations. nature communications ( ), – ( ) . kimchi-sarfaty, c., oh, j.m., kim, i.-w., sauna, z.e., calcagno, a.m., ambudkar, s.v., gottesman, m.m.: a” silent” polymorphism in the mdr gene changes substrate specificity. science ( ), – ( ) . d’costa, v.m., king, c.e., kalan, l., morar, m., sung, w.w., schwarz, c., froese, d., zazula, g., calmels, f., debruyne, r., et al.: antibiotic resistance is ancient. nature ( ), – ( ) . berendonk, t.u., manaia, c.m., merlin, c., fatta-kassinos, d., cytryn, e., walsh, f., bürgmann, h., sørum, h., norström, m., pons, m.-n., et al.: tackling antibiotic resistance: the environmental framework. nature reviews microbiology ( ), – ( ) . mahfouz, n., caucci, s., achatz, e., semmler, t., guenther, s., berendonk, t.u., schroeder, m.: high genomic diversity of multi-drug resistant wastewater escherichia coli. scientific reports ( ), ( ) . mckenna, a., hanna, m., banks, e., sivachenko, a., cibulskis, k., kernytsky, a., garimella, k., altshuler, d., gabriel, s., daly, m., et al.: the genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. genome research ( ), – ( ) . van der auwera, g.a., carneiro, m.o., hartl, c., poplin, r., del angel, g., levy-moonshine, a., jordan, t., shakir, k., roazen, d., thibault, j., et al.: from fastq data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. current protocols in bioinformatics ( ), – ( ) . narasimhan, v., danecek, p., scally, a., xue, y., tyler-smith, c., durbin, r.: bcftools/roh: a hidden markov model approach for detecting autozygosity from next-generation sequencing data. bioinformatics ( ), – ( ) . cingolani, p., platts, a., wang, l.l., coon, m., nguyen, t., wang, l., land, s.j., lu, x., ruden, d.m.: a program for annotating and predicting the effects of single nucleotide polymorphisms, snpeff: snps in the genome of drosophila melanogaster strain w ; iso- ; iso- . fly ( ), – ( ) . lees, j.a., galardini, m., bentley, s.d., weiser, j.n., corander, j.: pyseer: a comprehensive tool for microbial pangenome-wide association studies. bioinformatics ( ), – ( ) . cook, d.e., andersen, e.c.: vcf-kit: assorted utilities for the variant call format. bioinformatics ( ), – ( ) . purcell, s., neale, b., todd-brown, k., thomas, l., ferreira, m.a., bender, d., maller, j., sklar, p., de bakker, p.i., daly, m.j., et al.: plink: a tool set for whole-genome association and population-based linkage analyses. the american journal of human genetics ( ), – ( ) . shin, j.-h., blay, s., mcneney, b., graham, j., et al.: ldheatmap: an r function for graphical display of pairwise linkage disequilibria between single nucleotide polymorphisms. journal of statistical software ( ), – ( ) . huerta-cepas, j., szklarczyk, d., heller, d., hernández-plaza, a., forslund, s.k., cook, h., mende, d.r., letunic, i., rattei, t., jensen, l.j., et al.: eggnog . : a hierarchical, functionally and phylogenetically annotated orthology resource based on organisms and viruses. nucleic acids research (d ), – ( ) . lord, d.m., baran, a.u., wood, t.k., peti, w., page, r.: bdca, a protein important for escherichia coli biofilm dispersal, is a short-chain dehydrogenase/reductase that binds specifically to nadph. plos one ( ), ( ) . ma, q., yang, z., pu, m., peti, w., wood, t.k.: engineering a novel c-di-gmp-binding protein for biofilm dispersal. environmental microbiology ( ), – ( ) . evans, d., allison, d., brown, m., gilbert, p.: susceptibility of pseudomonas aeruginosa and escherichia coli biofilms towards ciprofloxacin: effect of specific growth rate. journal of antimicrobial chemotherapy ( ), – ( ) . høiby, n., bjarnsholt, t., givskov, m., molin, s., ciofu, o.: antibiotic resistance of bacterial biofilms. international journal of antimicrobial agents ( ), – ( ) . manickam, y., chaturvedi, r., babbar, p., malhotra, n., jain, v., sharma, a.: drug targeting of one or more aminoacyl-trna synthetase in the malaria parasite plasmodium falciparum. drug discovery today ( ), – ( ) . agarwal, v., nair, s.k.: aminoacyl trna synthetases as targets for antibiotic development. medchemcomm ( ), – ( ) . garoff, l., huseby, d.l., praski alzrigat, l., hughes, d.: effect of aminoacyl-trna synthetase mutations on susceptibility to ciprofloxacin .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / malekian et al. page of in escherichia coli. journal of antimicrobial chemotherapy ( ), – ( ) . abergel, c., monchois, v., byrne, d., chenivesse, s., lembo, f., lazzaroni, j.-c., claverie, j.-m.: structure and evolution of the ivy protein family, unexpected lysozyme inhibitors in gram-negative bacteria. proceedings of the national academy of sciences ( ), – ( ) . ruiz, n., gronenberg, l.s., kahne, d., silhavy, t.j.: identification of two inner-membrane proteins required for the transport of lipopolysaccharide to the outer membrane of escherichia coli. proceedings of the national academy of sciences ( ), – ( ) . luo, h., lin, y., liu, t., lai, f.-l., zhang, c.-t., gao, f., zhang, r.: deg , an update of the database of essential genes that includes built-in analysis tools. nucleic acids research ( ) . ma, q., zhang, g., wood, t.k.: escherichia coli bdca controls biofilm dispersal in pseudomonas aeruginosa and rhizobium meliloti. bmc research notes ( ), ( ) . stoppe, n.d.c., silva, j.s., carlos, c., sato, m.i., saraiva, a.m., ottoboni, l.m., torres, t.t.: worldwide phylogenetic group patterns of escherichia coli from commensal human and wastewater treatment plant isolates. frontiers in microbiology , ( ) . hurdle, j.g., o’neill, a.j., chopra, i.: prospects for aminoacyl-trna synthetase inhibitors as new antimicrobial agents. antimicrobial agents and chemotherapy ( ), – ( ) .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / abstract background methods results discussion and conclusion triku: a feature selection method based on nearest neighbors for single-cell data ascensión et al. software triku: a feature selection method based on nearest neighbors for single-cell data alex m. ascensión , †, olga ibañez-solé , †, inaki inza , ander izeta and marcos j. araúzo-bravo , * *correspondence: mararabra@yahoo.co.uk biodonostia health research institute, computational biology and systems biomedicine group, paseo dr. begiristain, s/n, , donostia-san sebastian, spain full list of author information is available at the end of the article †equal contributor abstract feature selection is a relevant step in the analysis of single-cell rna sequencing datasets. triku is a feature selection method that favours genes defining the main cell populations. it does so by selecting genes expressed by groups of cells that are close in the nearest neighbor graph. triku efficiently recovers cell populations present in artificial and biological benchmarking datasets, based on mutual information and silhouette coefficient measurements. additionally, gene sets selected by triku are more likely to be related to relevant gene ontology terms, and contain fewer ribosomal and mitochondrial genes. triku is available at https://gitlab.com/alexmascension/triku. keywords: scrnaseq; feature selection; bioinformatics; python background single-cell rna sequencing (scrna-seq) is a powerful technology to study the bi- ological heterogeneity of tissues at the individual cell level, allowing the characteri- zation of new cell populations and cell states–i.e. cell types responding to different environmental stimuli– previously undetected due to their low frequency within the tissue and the lack of individual resolution of bulk methods [ , ]. scrna-seq datasets are multidimensional, i.e. the expression profile per cell con- sists of multiple genes. two common characteristics of multidimensional datasets is their high dimensionality and their sparsity, which are worsened in single-cell datasets due the high proportion of zeros from low signal recovery [ ]. this spar- sity affects downstream methods such as cell type detection or differential gene expression [ ]. a common task when working with multidimensional datasets is feature selection (fs). fs, alongside with feature extraction (fe), responds to the need of obtaining a reduced dataset with a smaller dimensionality [ ]. while fe methods like principal component analysis (pca) extract new features based on combinations of the original features, fs methods aim to select a subset of the features that best explains the original dataset. there are three main types of fs methods: filter, wrapper and embedded methods [ ]. current fs methods in scrna-seq analysis are filter methods because common downstream analysis steps do not embed the fs within the pipeline [ ]. fs methods represent a key step in processing pipelines of bioinformatic datasets and provide several advantages [ ]: they reduce model overfitting risk, improve clustering qual- ity, and favour a deeper insight into the underlying processes that generated the data (features –genes– that contain random noise do not contribute to the biology of .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:mararabra@yahoo.co.uk https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of the dataset and are removed). specifically, in scrna-seq, removing non-informative features can improve results in downstream analyses such as differential gene ex- pression. early methods for fs in scrna-seq data were based on the idea that genes whose expression show a greater dispersion across the dataset are the ones that best capture the biological structure of the dataset. conversely, genes that are evenly expressed across cells are unlikely to define cell types or cell functions in a heterogeneous dataset. the most straightforward way of selecting genes that are not evenly expressed is to look at a measure of dispersion of the counts of each gene and to select those genes that have a dispersion over a threshold. however, the correlation between mean expression and dispersion introduces a bias whereby genes with higher expression are more likely to be selected by fs methods. however, biological gene markers that define minor cell types are usually expressed in a medium to small subset of cells. therefore, new fs methods based on dispersion are designed to correct for this dispersion/expression correlation to select genes with a broader expression spectrum. brennecke et al. [ ] developed a fs method that introduces a correction over the dispersion that accounts for differences in the mean expression of genes. it does so by setting a threshold to the correlation between the average gene expression and its coefficient of variation across cells. newer fs methods have arisen after different corrections, like the one originally described by satija et al. [ ] implemented in seurat, later adapted to scanpy [ ], or the one implemented in scry [ ]. a new generation of fs methods emerged when svensson discovered that the proportion of zeros in droplet-based scrna-seq data, originally assumed to be dropouts, was tightly related to the mean expression of genes, following a nega- tive binomial (nb) curve [ ]. genes with an expected lower percentage of zeros tend to have an even expression across the entire set of cells. conversely, genes with a higher than expected percentage of zeros might possess biological relevance because they are expressed in fewer cells than expected, and these cells might be associated to a specific cell type or state. this finding opened the path for new fs methods that would rely on genes that showed a greater than expected proportion of zeros, according to their mean ex- pression. these methods are based on a null distribution of some property of the dataset, and genes whose behavior differs from the expected are selected. the fs method nbumi, a negative binomial method based on m drop [ ], works under this premise. nbumi fits the nb zero-count probability distribution to the dataset, and selects genes of interest calculating p-values of observed dropout rates. m drop works similarly by fitting a michaelis-menten model instead of the nb from nbumi. in summary, existing fs methods assume that an unexpected distribution of counts for a particular gene in a dataset is explained by cells belonging to different cell types. however, we observe that there are three main patterns of expression according to the distribution of zeros of a particular gene and overall transcriptional similarity (expression of all genes), as explained in detail in figure : a) a gene evenly expressed across cells, or a gene expressed by a subset of cells, which can be b ) transcriptionally separate or b ) transcriptionally similar. thus, in some cases a particular gene shows an unexpected distribution of counts because a subset of cells are expressing it but those cells might not be transcriptionally similar. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of here we present triku, a fs method that selects genes that show an unexpected distribution of zero counts and whose expression is localized in cells that are tran- scriptomically similar. figure summarizes the feature selection process. triku identifies genes that are locally overexpressed in groups of neighboring cells by inferring the distribution of counts in the vicinity of a cell and computing the ex- pected distribution of counts. then, the wasserstein distance between the observed and the expected distributions is computed and genes are ranked according to that distance. higher distances imply that the gene is locally expressed in a subset of transcriptionally similar cells. finally, a subset of relevant features is selected using a cutoff value for the distance. triku outperforms other feature selection methods on benchmarking and artificial datasets, using unbiased evaluation metrics such as normalized mutual information (nmi) or silhouette. of note, features selected by triku are more biologically meaningful. results the objective of fs methods is to select the features that are the most relevant in order to understand and explain the structure of the dataset. in the context of single-cell data, this means finding the subset of genes that, when given as input to a clustering method, will yield a clustering solution where each cluster can be annotated as a putative cell type. initially, we generated artificial datasets with the splatter package [ ], so that cells belonging to the same cluster have a similar gene expression. all datasets contained the same number of genes, cells and populations, but differed in the de.prob parameter value. this parameter was set so that higher values indicate a higher probability of genes being differentially expressed, resulting in more resolved populations. a combination of de.prob values, from . to . were used (see methods). in addition, we tested triku on two biological benchmarking datasets by ding et al. [ ] and mereu et al. [ ] that have been expert-labeled using a semi-supervised procedure. both benchmarking datasets are composed of individual subsets of data with different library preparation methods ( x, smart-seq , etc.) in human peripheral blood mononuclear cells (pbmcs) (mereu and ding) and mouse colon (mereu) and cortex (ding) cells. we have evaluated the relevance of the features selected by triku by comparing them to the ones selected using other feature selection methods. the relevance of the features was first measured using metrics associated to the efficacy of clustering, and then using metrics to evaluate the quality of the genes selected. we made six types of comparisons between the subsets of genes selected by each feature selection method: ) the ability to recover basic dataset structure (main cell types) in artificial and biological datasets, ) the ability to obtain transcriptomically distinct cell clusters, ) the overlap of features between different fs methods, ) the localized pattern of expression of the features selected, ) the ability to avoid the overrepresentation of mitochondrial and ribosomal genes and ) the biological relevance of the genes by studying the composition and quality of the gene ontology (go) terms obtained. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of . triku efficiently recovers cell populations present in sc-rnaseq datasets the first set of metrics evaluates the ability to recover the original cell types based on the nmi index, and the cluster separation and cohesion using the silhouette coefficient. . . nmi nmi measures the correspondence between a labelling considered as the ground truth and the clustering solution that we obtained using the genes selected by triku and other fs methods (scanpy, std, scry, brennecke, m drop, nbumi). first, we evaluated how well the clustering using the genes selected by the fs methods was able to recover the same populations that were defined when gen- erating the artificial datasets. figure shows that triku is among the best three feature selection methods for a wide range of de.prob values. for low values of de.prob –below . –, where the selection of genes that lead to a correct recovery of cell populations is more challenging, triku notably outperforms the rest of the fs methods. nmi values obtained with triku are . to . higher than the second and third best fs methods. in addition, the results obtained when using the first selected genes were comparable to those obtained when selecting genes. we also studied how well the genes selected led to a clustering solution that was similar to the manually-assigned cell labels in the biological benchmarking datasets, as shown in figure . for each dataset, the variability between nmi scores was quite low, meaning that features selected with the different methods yielded clustering so- lutions that were quite similar to the manually-labeled cell types, although there are some exceptions to this rule–e.g. brennecke in ding datasets, which showed notably reduced nmi values–. in some datasets, for instance, x human, quartzseq hu- man and smartseq human from mereu’s benchmarking set, features selected by fs methods did not lead to increased nmi values as compared with randomly selected genes. despite the differences in nmi between methods being small for each particular dataset, post-hoc analysis revealed that triku is significantly the best ranked method across all datasets. to do the post-hoc analysis, we ranked for each dataset the nmi of each fs method. figure (left) shows the mean rank of each fs method across datasets. triku is the best-ranked fs method in both mereu’s and ding’s benchmarking datasets, with a mean rank of . and . , respectively. m drop is the second best-ranked fs method and triku is in both cases statistically significantly better (quade test, p < . ). . . silhouette coefficient another important aspect of the genes selected by fs methods in scrna-seq data analysis is their ability to cluster data into well-separated groups that are transcrip- tomically similar. we used the silhouette coefficient to measure the compactness and separation-degree of cell communities obtained with a clustering method. when the same clustering algorithm is used on a dataset but using two different fs meth- ods, the differences in the resulting silhouette coefficients can be entirely attributed to the features selected by those methods. we assume that fs methods that increase the separation between clusters and the compactness within clusters are better at recovering the cell types present in the dataset. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of figure shows the silhouette coefficients obtained with the different fs meth- ods. for the mereu and ding datasets, we observed that triku was the best-ranked method–mean rank of . and . –, and the second best-ranked methods were m drop and scanpy with a mean rank of . and . , respectively. in both cases, the difference between triku and the second-ranked method was statistically significant (quade test, p < . ). we performed an additional analysis using the labels obtained with leiden clus- tering instead of the manually curated cell types (figure s ). again, triku outper- formed the rest of the fs methods showing a statistically significant best mean-rank. . genes selected by different fs methods show limited overlap next, we studied the characteristics of the genes selected by triku and compared them to the genes selected by other methods. initially, we studied the level of consistency between the results obtained using different fs methods by studying their degree of overlap, as shown in figure . in order to compare between equally sized gene lists, we ranked the genes based on p-values or scoring value from each fs method and set the number of genes selected by triku as a cutoff to select the first genes. although the genes selected by the different methods yielded clustering solutions that are highly consistent, as shown in the previous section, we did not see any clear gene overlap pattern between pairs of fs methods. in fact, there is no correlation between the degree of overlap between the genes selected by the different methods and the clustering solutions that are obtained when using those genes as input. for instance, we found an overlap of % between the genes selected by scanpy and std for the x mouse dataset, yet the nmi between the clustering solutions obtained with each of them and the expert-labeled cell types was . . on the other hand, the overlap between scanpy and brennecke is one of the highest across datasets (ranging from to %), yet the differences between their corresponding nmi scores are . . . triku selects genes that are biologically relevant based on these results, we studied the biological relevance of the genes selected by different fs methods in three alternative ways. genes whose expression, or lack thereof, is limited to a single population are more likely to be cell-type specific and thus might be better candidates as positive or negative cell population markers. therefore, we studied which are the best fs methods to select genes showing a localized expression pattern. mitochondrial and ribosomal genes are usually highly expressed and many fs methods tend to overselect them despite them not being particularly relevant in most single-cell studies and are commonly excluded from downstream analysis [ , , ]. assuming that in these benchmarking datasets ribosomal and mitochondrial genes are not as relevant to the biology of the dataset, we measured the percentage of these genes in the subset of genes selected by triku and compared it to other fs methods. lastly, we analyzed the biological pertinence of the selected genes by performing gene ontology enrichment analysis (goea) on a dataset of immune cell popu- lations whose underlying biology is well understood, as a robust indicator of fs quality. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of . . selection of locally-expressed genes we first studied the expression pattern of genes selected by triku and other methods, as shown in figure s . we observed that out of the populations of the artificial dataset, when a gene is selected by triku–exclusively or together with other fs methods–, one of the populations had a markedly higher or lower expression com- pared to the rest. on the other hand, when a gene is selected by other fs methods and not by triku, we do not observe any population-specific expression pattern. for instance, genes exclusively selected by scanpy had a wide expression variation across clusters, but they were not exclusive of one or two clusters. features selected by std and scry showed some variation, but it was overshadowed by the high expression of the gene, and therefore not relevant under the previous premise. to evaluate the cluster expression of selected genes in benchmarking datasets, for each gene we scaled its expression to the - range, and sorted the clusters so that the first one had the greatest expression. figure s shows the expression patterns for several benchmarking datasets. we see that, in most datasets, triku showed more biased expression patterns, that is, genes selected by triku were expressed, on average, on fewer clusters than the genes selected by other fs methods. the second and third best methods were scanpy and brennecke, with similar or slightly less biased expression patterns as compared to triku. with these methods, up to % of the expression of the gene was usually restricted to the to clusters that most express it. m drop and nbumi performed similarly, and showed an expression distribution across clusters similar to a random selection of genes, which was slightly biased towards to clusters accumulating up to % of the expression of the gene. lastly, std and scry methods were the least biased, and showed almost a linear decrease of expression percentage across clusters, with to clusters accumulating up to % of the expression of the gene. . . avoidance of mitochondrial and ribosomal genes table shows the percentage of genes that code for ribosomal and mitochondrial proteins within the genes selected by different fs methods in the two sets of bench- marking datasets. we observed that std and scry were the only methods that tended to overselect mitochondrial and ribosomal genes. among the rest of the methods, triku showed percentages that were comparable to the rest of the methods, and slightly lower for the ding datasets. . . selection of genes based on gene ontologies we assessed the quality of the go output by studying its term composition. we se- lected two pbmc datasets from the ding datasets: the x human and the dropseq human. we used pbmc datasets for this analysis because their cell-to-cell variabil- ity has been extensively studied using single-cell technologies as fluorescence acti- vated cell sorting (facs) and scrna-seq [ , , , , ]. using these datasets, we measured the proportion of go terms obtained in the output that were tightly related to the biological system under study. figures and s show the first go terms obtained with the genes selected by each fs method on the two pbmc datasets, where the terms tightly related to .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of immune processes–chosen by three independent assessors–have been highlighted. we observed that triku was the fs method that yielded the most terms directly related to immune processes, with / and / related terms in the ding dropseq and x datasets, respectively. examples of terms that we considered to be tightly related to immune processes included b cell receptor signalling pathway, neutrophil degranulation and t cell proliferation. the next methods were scanpy and m drop, whose performances were comparable to that of triku for the x dataset ( / ) but less robust for the dropseq dataset ( / and / related terms). the rest of the fs methods mainly selected genes that were related to general cell functions such as rna processing, protein processing and cell-cycle regulation. discussion fs methods are a key step in any scrna-seq sequencing analysis pipeline as they help us obtain a dimensionally reduced version of the dataset that captures the most relevant information and eases the interpretation and understanding of its under- lying biology. however, every fs method relies on a set of assumptions regarding what characteristics make a gene relevant. fs methods that sort genes according to their dispersion assume that gene expression variability is indicative of its biological relevance. fs methods like nbumi and m drop assume that genes showing a propor- tion of zero-counts that is greater than expected (according to a null distribution) are more likely to be informative. triku assumes that genes that have a localized expression in a subset of cells that share an overall transcriptomic similarity are more likely to define cell types. a general trend in fs method design has been to refine the requirements that a gene must meet in order for it to be selected, from the more general dispersion-based to more sophisticated formulations. it is noteworthy that the requirements in triku are consistent with the previous dispersion-based and zero-count-based formulations, but involve a new aspect that we consider essential for an accurate gene selection: a localized expression in neighboring cells. another important advantage of triku over fs methods that consider the zero-count dis- tribution is that, unlike m drop and nbumi, triku does not assume gene counts to follow any particular distribution, since it estimates the null distribution from the dataset, thus extending the range of single-cell technologies that it can use beyond droplet-based technologies. we verified the locality of the genes selected by triku in different artificial and real scrna-seq datasests and concluded that, on average, the expression of triku- selected genes is restricted to fewer, well-defined clusters. in addition, the clusters obtained when using triku-selected genes as input for unsupervised clustering in both artificially generated and biological datasets have a better resolved pattern structure, as shown by their increased silhouette coefficients. in the case of artificial datasets, where the degree of mixture between clusters can be predefined, triku proved to be able to recover the originally-defined cell populations. in fact, we found that the higher the degree of mixture between clusters, the more obvious the advantage of triku over the rest of the fs methods tested. an important difficulty in the interpretation of single-cell data is that we must consider that cell-to-cell variability has both technical and biological components. i.e., it is difficult to know whether a set of genes is differentially expressed between .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of cell clusters due to technical reasons (differences in the efficiency of mrna cap- ture, amplification and sequencing) or if it constitutes a biological signal. moreover, there is a wide range of sources of biological variability within a dataset, some of which might not be of interest depending on the experimental context. for instance, fluctuations in genes that regulate the cell cycle constitute a source of biological vari- ability that is often disregarded. this has been extensively studied and addressed in a number of ways: normalization, regression of unwanted sources of variation, etc. [ , , , ]. the expression of genes whose variability is associated with technical reasons tend to have a high dispersion but their expression is usually not restricted to a few clusters. a good example of these genes are the ribosomal and mitochondrial genes, which are expressed across all cell types at different levels. our results show that these genes are in fact selected by the majority of compared fs methods due to their high expression and cell-to-cell variability, but are less likely to be selected by triku, since they do not usually meet the locality requirement. additionally, when performing goea, we observed that the list of genes obtained with triku were more enriched for terms that are specifically related to a biological process of the system under study. in our work, we have observed that the genes selected by different fs methods might show little overlap between them. this phenomenon has been described else- where [ ]. in fact, gene covariation and redundancy is a well characterized effect that has been observed in omics studies. the effect of redundancy arises from the fact that different cell types must have a common large set of pathways to be ac- tive. the difference between cell type and cell state is that two cell types might have large sets of pathways that are different between each other, and two cell states will only differ in a few pathways. since pathways are composed of many genes, only choosing a reduced set of genes from a set of pathways from cell type a and b might be enough to differentiate them, and we might not need to select all genes from all pathways. this “paradigm” explains several effects. qiu et al. described that scrna-seq datasets could preserve basic structure after gene expression bina- rization [ ] or by conducting very shallow sequencing experiments [ ]. this can be explained by the fact that only a few genes are necessary to describe the main cell populations in a single-cell dataset, and the presence/absence of a certain marker is often more informative than its expression level. this is related to the notion that despite the high dimensionality of omics studies, most biological systems can be ex- plained in a reduced number of dimensions. moreover, some authors have claimed this low dimensionality to be a natural and fundamental property of gene expres- sion data [ ]. this highlights the importance of designing accurate fs methods that extract the fundamental information from single-cell datasets. triku python package is available at https://gitlab.com/alexmascension/ triku and can be downloaded using pypi. triku has been designed to be com- patible with scanpy syntax, so that scanpy users can easily include triku into their pipelines. methods the triku workflow is further described in suplementary methods. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://gitlab.com/alexmascension/triku https://gitlab.com/alexmascension/triku https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of . artificial and benchmarking datasets in order to perform the evaluation of the fs methods we used a set of artificial and biological benchmarking datasets. artificial datasets were constructed using splatter r package (v . . ). each dataset contains , cells and , genes, and consists of populations with abundances in the dataset of { %, %, %, %, %, %, . %, %, . %} of the cells. each dataset contains a parameter, de.prob, that controls the probability that a gene is differentially expressed. lower de.prob values (< . ) imply that different populations have fewer differentially expressed genes between them and, therefore, are more difficult to be differentiated. selected values of de.prob are { . , . , . , . , . , . , . , . }. populations in datasets with de.prob values above . are completely separated in the low-dimensionality representation with umap, even without feature selection (figure s ). regarding biological datasets, two benchmarking datasets have been recently pub- lished by mereu et al. [ ] and ding et al. [ ]. the aim of these two works is to analyze the diversity of library preparation methods, e.g. x, smart-seq , cel-seq , single nucleus or indrop. mereu et al. use mouse colon cells and human pbmcs to perform the benchmarking, whereas ding et al. use mouse cortex and human pbmcs. there are a total of datasets in mereu et al. and in ding et al. an additional characteristic of these datasets is that they have been manually annotated, and this annotation is useful as a semi ground truth. ding dataset files were downloaded from single cell portal (accession numbers scp and scp ), and cell type metadata is located within the downloaded files. mereu datasets were downloaded from geo database (accession gse ), and cell type metadata was obtained under personal request. . fs methods triku is compared to the following fs methods: • standard deviation (std). computed directly using numpy (v . . ). • brennecke [ ]: fits a curve based on the square of the coefficient of variation (cv ) versus the mean expression (µ) of each gene and selects the features with higher cv and µ. the features are selected with the brenneckegetvari- ablegenes function from m drop r package (v . . ). • scry [ ]: computes a deviance statistic for counts based on a multinomial model that assumes each feature has a constant rate. the features are selected with the deviancefeatureselection function from scry r package (v . . ). • scanpy [ ]: selects features based on a z-scored deviation, adapted from seu- rat’s method. the features are selected with the sc.pp.highly_variable_genes function from scanpy (v . . ). • m drop [ ]: fits a michaelis-menten equation to the percentage of zeros ver- sus µ, and selects features with higher percentages of zeros than expected. the features are selected with the m dropfeatureselection function from m drop r package. • nbumi: it acts in the same manner as m drop, but fitting a negative binomial equation instead of a michaelis-menten equation. the features are selected with the nbumifeatureselectioncombineddrop function. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of . . fs and dataset preprocessing to make the comparison between fs methods, each feature is ranked based on the score provided by each fs method. calculating the ranking instead of just selecting the features allow us to select different numbers of features when needed. by default, the number of features is the one automatically selected by triku. additionally, in some contexts, analyses are performed with all features or with a random selection of features. after the ranking of genes is computed, dataset processing is performed equally for all methods, in artificial and benchmarking datasets. datasets are first log trans- formed –if required by the method–, and pca with components is calculated. then, the k-nearest neighbors (knn) matrix is computed setting k as √ ncells. uni- form manifold approximation and projection (umap) (v . . ) is then applied to reduce the dimensionality for plotting. if community detection is required, leiden (v . . ) is applied selecting the resolution that matches the number of cell types manually annotated in the dataset. this procedure is repeated with different seeds. this conditions the output of triku, random fs, pca projection, neighbor graph, leiden community detection, and umap. . . nmi calculation in artificial and benchmarking datasets in order to compare the leiden community detection results with the ground-truth labels from artificial and biological datasets, we used the normalized mutual infor- mation (nmi) score [ ]. if t and l are the labels of the cell types (true populations) and leiden commu- nities respectively, the nmi between t and l is: nmi(t, l) = i(t ; l) h(t) + h(l) where h(x) is the entropy of the labels, and i(t ; l) is the mutual information between the two sets of labels. this value is further described in [ ]. we used scikit- learn (v . . ) implementation of nmi, sklearn.metrics.adjusted_mutual_info_score. one of the advantages of nmi against other mutual information methods is that it performs better with label sets with class imbalance, which are common in single- cell datasets, where there are differences in the abundance of cell types. on artificial datasets, leiden was applied using the first and selected features, and the resulting community labels were compared with the population labels from the dataset. on benchmarking datasets, leiden was applied with the manually-curated cell types. . . silhouette coefficient in benchmarking datasets in order to assess the clustering performance of the communities obtained with benchmarking datasets we used the silhouette coefficient. the silhouette coefficient compares the distances of the cells within each cluster (intra-cluster) and between clusters (inter-cluster) within a measurable space. the distance between two cells is the cosine distance between their gene expression vectors, considering only the genes selected by each fs method. the greater the distance between cells that belong to .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of different clusters and the smaller the distance between cells from different cluster, the greater the silhouette score. in order to calculate the silhouette coefficient for a cell c within cluster ci (out of n clusters), the mean distance between the cell and the rest of the cells within the cluster is computed using the gene expression: a(c) = |ci| − ∑ j∈ci,c̸=j d(c, j) then, the minimum mean distance between that cell and the rest of cells from other clusters is computed: b(c) = min ck ̸=ci { ck ∑ j∈ck d(i, j) } k ∈ , · · · , n then the silhouette coefficient is computed as s(c) = b(c) − a(c) max b(c), a(c) higher silhouette scores imply a better separation between clusters and, therefore, a better performance of the fs method. we used scikit-learn implementation of silhouette, sklearn.metrics.silhouette_score. . . overlap between gene lists in order to calculate the overlap between selected features for each fs method, we applied the jaccard index [ ]: jaccard(i, j) = |i∩j||i∪j| , where i, j are the sets of genes selected by the two fs methods. . . performance of gene selection and locality measures in order to assess the performance of different fs methods selecting genes that are relevant for the dataset, we applied two different strategies for artificial and biological datasets. for artificial datasets, we selected representative genes of each of the combi- nations of genes shown in figure s . then we calculated the mean expression of each of the for genes in each population, and we represent this information in the barplots. for benchmarking datasets, in order to represent the figure s , for each dataset and fs method we used the following procedure: for each gene, the expression was scaled to sum across all cells. then, leiden clustering was run with resolution pa- rameter value . . for each cluster, the proportion of the expression was calculated, and the clusters were ordered so that the first cluster is the one that concentrates the majority of the expression. to create figure s , the average value of the proportion of expression is calculated. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of . . proportion of ribosomal and mitochondrial genes when calculating the proportion of mitochondrial and ribosomal genes, the list of existing ribosomal and mitochondrial proteins was calculated by extracting the genes starting with rps, rpl or mt-. the proportion of mitochondrial or riboso- mal genes is the quotient between the genes of the previous list that appear selected by that fs method, and the genes in the list. . . go enrichment analysis in order to calculate the sets of gene ontologies enriched for the selected features of each fs method, we used python gseapy (v . . ) module gseapy.enrichr function with the list of the first selected features against the go_biological_process_ ontology. from the list of enriched ontologies, the with the smallest adjusted p-value were selected. . . ranking and cd during calculation of nmi and silhouette coefficients, to evaluate the overall per- formance of the fs methods across different datasets, the fs methods are ranked –where is the best rank–. the methodology proposed by demšar [ ] is used to test for significant differences among fs methods in the datasets: the fried- man rank test is applied to test whether the mean rank values for all fs methods are similar (null hypothesis). if the friedman rank test rejects the null hypothesis (α < . ), this implies a statistically significant difference among at least two fs methods. if the null hypothesis is refuted we apply the quade post-hoc test be- tween all pairs of fs methods to check which pairs of fs methods are significantly different (α < . ). these results are then plotted in a critical difference diagram. abbreviations single-cell rna sequencing: scrna-seq; feature selection: fs; feature extraction: fe, principal component analysis: pca, negative binomial: nb, normalized mu- tual information (nmi); fluorescence activated cell sorting: facs; gene ontol- ogy: go; gene ontology enrichment analysis: goea; peripheral blood mononu- clear cells: pbmc; uniform manifold approximation and projection: umap; k- nearest neighbors: knn. declarations ethics approval and consent to participate not applicable. consent for publication not applicable. availability of data and software ding dataset files were downloaded from single cell portal (accession numbers scp and scp ), and cell type metadata is located within the downloaded files. mereu datasets were downloaded from geo database (accession gse ), and cell type metadata was obtained under personal request. triku software and analysis notebooks are available at https://www.gitlab.com/alexmascension/triku. competing interests the authors declare that they have no competing interests. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.gitlab.com/alexmascension/triku https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of funding this work was supported by grants from instituto de salud carlos iii (ac / and pi / ), cofunded by the european union (european regional development fund/ european science foundation, investing in your future) and the d-healing project (era-net program eracosysmed, jtc- ); diputación foral de gipuzkoa, and the department of economic development and infrastructures of the basque government (kk- / , kk- / ). ama was supported by a basque government postgraduate diploma fellowship (pre_ _ _ ), and ois was supported by a postgraduate diploma fellowship from la caixa foundation (identification document ; code lcf/bq/in / ). author’s contributions conceptualization: ama; funding acquisition: mja-b, ama, oi-s; investigation: ama, oi-s, mja-b, ai; methodology: ama, oi-s, ii; project administration: ai, mja-b; resources: mja-b; software: ama, oi-s; supervision: ii, ai, mja-b; visualization: ama, oi-s; writing - original draft preparation: ama, oi-s; writing - review and editing: ama, oi-s, ii, mja-b, ai. acknowledgements we would like to thank amaia elícegui, ainhoa irastorza and paula vázquez for the assessment of the immune gene ontology terms. author details biodonostia health research institute, computational biology and systems biomedicine group, paseo dr. begiristain, s/n, , donostia-san sebastian, spain. tissue engineering group, biodonostia health research institute, paseo dr. begiristain, s/n, , donostia-san sebastian, spain. intelligent systems group, computer science faculty, university of the basque country, donostia-san sebastian, spain. max planck institute for molecular biomedicine, roentgenstr. , , muenster, germany. references . trapnell, c.: defining cell types and states with single-cell genomics. genome research ( ), – ( ). doi: . /gr. . . maclean, a.l., hong, t., nie, q.: exploring intermediate cell states through the lens of single cells. current opinion in systems biology , – ( ). doi: . /j.coisb. . . . bzdok, d., altman, n., krzywinski, m.: statistics versus machine learning. nature methods ( ), – ( ). doi: . /nmeth. . heimberg, g., bhatnagar, r., el-samad, h., thomson, m.: low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. cell systems ( ) ( ). doi: . /j.cels. . . . saeys, y., inza, i., larrañaga, p.: a review of feature selection techniques in bioinformatics. bioinformatics ( ), – ( ). doi: . /bioinformatics/btm . https://academic.oup.com/bioinformatics/article-pdf/ / / / /btm .pdf . luecken, m.d., theis, f.j.: current best practices in single‐cell rna‐seq analysis: a tutorial. molecular systems biology ( ) ( ). doi: . /msb. . brennecke, p., anders, s., kim, j.k., kołodziejczyk, a.a., zhang, x., proserpio, v., baying, b., benes, v., teichmann, s.a., marioni, j.c., et al.: accounting for technical noise in single-cell rna-seq experiments. nature methods ( ), – ( ). doi: . /nmeth. . stuart, t., butler, a., hoffman, p., hafemeister, c., papalexi, e., mauck, w.m., hao, y., stoeckius, m., smibert, p., satija, r., et al.: comprehensive integration of single-cell data. cell ( ) ( ). doi: . /j.cell. . . . wolf, f.a., angerer, p., theis, f.j.: scanpy: large-scale single-cell gene expression data analysis. genome biology ( ) ( ). doi: . /s - - - . townes, f.w., hicks, s.c., aryee, m.j., irizarry, r.a.: feature selection and dimension reduction for single-cell rna-seq based on a multinomial model. genome biology ( ) ( ). doi: . /s - - - . svensson, v.: droplet scrna-seq is not zero-inflated. nature biotechnology ( ), – ( ). doi: . /s - - - . andrews, t.s., hemberg, m.: m drop: dropout-based feature selection for scrnaseq. bioinformatics ( ), – ( ). doi: . /bioinformatics/bty . zappi, l., phipson, b., oshlack, a.: splatter: simulation of single-cell rna sequencing data. genome biology ( ) ( ). doi: . /s - - - . ding, j., adiconis, x., simmons, s.k., kowalczyk, m.s., hession, c.c., marjanovic, n.d., hughes, t.k., wadsworth, m.h., burks, t., nguyen, l.t., et al.: systematic comparison of single-cell and single-nucleus rna-sequencing methods. nature biotechnology ( ). doi: . /s - - - . mereu, e., lafzi, a., moutinho, c., ziegenhain, c., mccarthy, d.j., Álvarez-varela, a., batlle, e., sagar, grün, d., lau, j.k., et al.: benchmarking single-cell rna-sequencing protocols for cell atlas projects. nature biotechnology ( ). doi: . /s - - - . freytag, s., tian, l., lönnstedt, i., ng, m., bahlo, m.: comparison of clustering tools in r for medium-sized x genomics single-cell rna-sequencing data. f research ( ) ( ). doi: . /f research. . . lun, a.t.l., mccarthy, d.j., marioni, j.c.: a step-by-step workflow for low-level analysis of single-cell rna-seq data with bioconductor. f research ( ) ( ). doi: . /f research. . . senabouth, a., lukowski, s.w., hernandez, j.a., andersen, s.b., mei, x., nguyen, q.h., powell, j.e.: ascend: r package for analysis of single-cell rna-seq data. gigascience ( ) ( ). doi: . /gigascience/giz . chen, j., cheung, f., shi, r., zhou, h., lu, w.: pbmc fixation and processing for chromium single-cell rna sequencing. journal of translational medicine ( ) ( ). doi: . /s - - - .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://dx.doi.org/ . /gr. . http://dx.doi.org/ . /j.coisb. . . http://dx.doi.org/ . /nmeth. http://dx.doi.org/ . /j.cels. . . http://dx.doi.org/ . /bioinformatics/btm http://arxiv.org/abs/https://academic.oup.com/bioinformatics/article-pdf/ / / / /btm .pdf http://dx.doi.org/ . /msb. http://dx.doi.org/ . /nmeth. http://dx.doi.org/ . /j.cell. . . http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /bioinformatics/bty http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /f research. . http://dx.doi.org/ . /f research. . http://dx.doi.org/ . /gigascience/giz http://dx.doi.org/ . /s - - - https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of . massoni-badosa, r., iacono, g., moutinho, c., kulis, m., palau, n., marchese, d., rodríguez-ubreva, j., ballestar, e., rodriguez-esteban, g., marsal, s., et al.: sampling time-dependent artifacts in single-cell genomics studies. genome biology ( ) ( ). doi: . /s - - - . villani, a.-c., satija, r., reynolds, g., sarkizova, s., shekhar, k., fletcher, j., griesbeck, m., butler, a., zheng, s., lazo, s., et al.: single-cell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. science ( ) ( ). doi: . /science.aah . zheng, g.x.y., terry, j.m., belgrader, p., ryvkin, p., bent, z.w., wilson, r., ziraldo, s.b., wheeler, t.d., mcdermott, g.p., zhu, j., et al.: massively parallel digital transcriptional profiling of single cells. nature communications ( ) ( ). doi: . /ncomms . zhu, l., yang, p., zhao, y., zhuang, z., wang, z., song, r., zhang, j., liu, c., gao, q., xu, q., et al.: single-cell sequencing of peripheral mononuclear cells reveals distinct immune response landscapes of covid- and influenza patients. immunity ( ) ( ). doi: . /j.immuni. . . . hafemeister, c., satija, r.: normalization and variance stabilization of single-cell rna-seq data using regularized negative binomial regression. genome biology ( ) ( ). doi: . /s - - - . lytal, n., ran, d., an, l.: normalization methods on single-cell rna-seq data: an empirical survey. frontiers in genetics ( ). doi: . /fgene. . . nestorowa, s., hamey, f.k., sala, b.p., diamanti, e., shepherd, m., laurenti, e., wilson, n.k., kent, d.g., göttgens, b.: a single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. blood ( ) ( ). doi: . /blood- - - . tran, h.t.n., ang, k.s., chevrier, m., zhang, x., lee, n.y.s., goh, m., chen, j.: a benchmark of batch-effect correction methods for single-cell rna sequencing data. genome biology ( ) ( ). doi: . /s - - - . yip, s.h., sham, p.c., wang, j.: evaluation of tools for highly variable gene discovery from single-cell rna-seq data. briefings in bioinformatics ( ), – ( ). doi: . /bib/bby . https://academic.oup.com/bib/article-pdf/ / / / /bby .pdf . qiu, p.: embracing the dropouts in single-cell rna-seq analysis. nature communications ( ). doi: . /s - - - . kvalseth, t.o.: on normalized mutual information: measure derivations and properties. entropy ( ) ( ). doi: . /e . liu, x., cheng, h.-m., zhang, z.-y.: evaluation of community detection methods ( ). . . jaccard, p.: the distribution of the flora in the alpine zone. the new phytologist ( ) ( ). doi: . /j. - . .tb .x . demšar, j.: statistical comparisons of classifiers over multiple data sets. journal of machine learning research , – ( ) tables table percentage of ribosomal protein (rbp) and mitochondrial (mt) genes appearing within the selected genes by each fs method. mereu ding % rbp % mt % rbp % mt triku . . . . m drop . . . . nbumi . . . . scanpy . . . . std . . . . scry . . . . brennecke . . . . figures .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /science.aah http://dx.doi.org/ . /ncomms http://dx.doi.org/ . /j.immuni. . . http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /fgene. . http://dx.doi.org/ . /blood- - - http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /bib/bby http://arxiv.org/abs/https://academic.oup.com/bib/article-pdf/ / / / /bby .pdf http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /e http://arxiv.org/abs/ . http://dx.doi.org/ . /j. - . .tb .x https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of ex pr es sio n � �� log mean expression p er ce n ta g e o f z er o s figure distribution of gene expression in three scenarios. there are three main patterns of expression for any particular gene in a single-cell dataset: a) the gene is expressed evenly across cells in the dataset, which probably means it does not define any particular cell type. b) a gene shows an unexpected distribution of zeros, because it is only expressed by a subset of cells. within case b, there are two possible patterns. b ) the gene is highly expressed by a subset of transciptionally different cells (i.e. cells that are not collocalized in the dimensionally reduced map) and b ) the gene is highly expressed by cells that share an overall transcriptomic profile. triku preferentially selects the genes shown in the b pattern. when looking at the proportion of zeros, genes in cases b and b show an increased proportion of zeros with respect to a, but they are indistinguishable from each other by that metric. � � � �� figure graphical abstract of triku workflow. a) dr representation of the gene expression from the count matrix from a dataset, where each dot represents a cell. b) knn graph representation with neighbors. for each cell the k transcriptomically most similar cells are selected ( in this example). c ) considering the graph in b) for each cell with positive expression, the expression of its k neighbors is summed to yield the knn distribution in blue. c ) with the distribution of reads (blue line), the null distribution is estimated by sampling k random cells. d) the null and knn distributions of each gene are compared using the wasserstein distance. e) for each gene, its distance is plotted against the log mean expression, and divided into w windows ( in this example). for each window, the median of the distances is calculated and subtracted to the distances in that window. f) all corrected distances are ranked and the cutoff point is selected. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of figure comparison of nmi for fs methods on artificial datasets. barplots of the nmi for all fs methods with different artificial datasets, using the top (top) and (bottom) features of each fs method. the probability of the selected genes being differentially expressed between clusters (de.prob) is shown in the x axis. higher nmi values mean better recovery of the cell populations. note that in category all, all features are selected, not the top or , therefore their nmi values are the same in both graphs. figure nmi for annotated cell types in mereu and ding datasets. barplots of silhouette coefficient for mereu (top) and ding (bottom) datasets. each barplot represents the mean over runs, and the vertical bar is the standard deviation. the plot on the left is a critical difference diagram, where each horizontal bar represents the mean rank for all datasets. if two or more bars are linked by a vertical bar, the mean ranks for those fs methods are not significantly different (quade test, α = . ). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of figure silhouette coefficients for annotated cell types in mereu and ding datasets. barplots of silhouette coefficient for mereu (top) and ding (bottom) datasets. each barplot represents the mean of seeds, and the vertical bar is the standard deviation. the plot on the left is a critical difference diagram, where each horizontal bar represents the mean rank for all datasets and all seeds. if two or more bars are linked by a vertical bar, the mean ranks for those fs methods are not significantly different (quade test, α = . ). figure heatmaps of overlap of features between pairs of methods. for each pair of methods, the value represents the proportion of features that are shared between the two methods. the number of genes selected in each method is the automatic cutoff by triku. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ascensión et al. page of triku scanpy std scry brennecke m drop nbumi figure barplot of p-values of goea. each bin represents the number of features selected for each method, in mereu et al. mouse dropseq dataset. the y value is the -log adjusted p-value for the best ontologies. on the bottom, the bar plot shows the names of the ontology terms for the case with the best features. in immune datasets, gray dots at the left of each term represent that that term is directly-related to an immune process. non-dotted terms refer to more general processes that may or may not be related to immune processes. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / abstract background results triku efficiently recovers cell populations present in sc-rnaseq datasets nmi silhouette coefficient genes selected by different fs methods show limited overlap triku selects genes that are biologically relevant selection of locally-expressed genes avoidance of mitochondrial and ribosomal genes selection of genes based on gene ontologies discussion methods artificial and benchmarking datasets fs methods fs and dataset preprocessing nmi calculation in artificial and benchmarking datasets silhouette coefficient in benchmarking datasets overlap between gene lists performance of gene selection and locality measures proportion of ribosomal and mitochondrial genes go enrichment analysis ranking and cd abbreviations the landscape of precision cancer combination therapy: a single-cell perspective the landscape of precision cancer combination therapy: a single-cell perspective saba ahmadi , ^, pattara sukprasert , ^, rahulsimham vegesna , sanju sinha , fiorella schischlik , natalie artzi , , , samir khuller , , alejandro a. schäffer *, eytan ruppin * dept. of computer science, university of maryland, college park md usa dept. of computer science, northwestern university, evanston il usa cancer data science laboratory, national cancer institute, bethesda, md usa dept. of medicine, engineering in medicine division, brigham and women’s hospital, harvard medical school, boston, ma usa broad institute of harvard and mit, cambridge, ma usa institute for medical engineering and science, mit, cambridge, ma usa part of this research done while at dept. of computer science, northwestern university, evanston il usa part of this research was done while at dept. computer science, university of maryland, college park md usa ^ equally contributing first authors * equally contributing corresponding authors correspondence should be addressed to alejandro.schaffer@nih.gov and eytan.ruppin@nih.gov. physical address: cancer data science laboratory, national cancer institute, bldg. -c , bethesda, md usa keywords: targeted cancer therapy, combination therapy, personalized medicine, combinatorial optimization, hitting set, single-cell transcriptomics and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:alejandro.schaffer@nih.gov mailto:eytan.ruppin@nih.gov https://doi.org/ . / . . . abbreviations: cts: cohort target set, synonym of global hitting set geo: gene expression omnibus ghs: global hitting set, synonym of cohort target set gtex: genotype-tissue expression (project or consortium) hpa: human protein atlas hugo: human genome organization ihs: individual hitting set, synonym of individual target set ilp: integer linear programming its: individual target set lb: lower bound on fraction of tumor cells killed rme: receptor-mediated endocytosis tpm: transcripts per million ub: upper bound on fraction of non-tumor cells killed and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . abstract the availability of single-cell transcriptomics data opens new opportunities for rational design of combination cancer treatments in a systematic manner. mining such data, we employed combinatorial optimization techniques to explore the landscape of optimal combination therapies in solid tumors, including brain, head and neck, melanoma, lung, breast and colon cancers. we assume that each individual therapy can target any one of genes encoding cell surface receptors, which may be targets of car-t, conjugated antibodies or coated nanoparticle therapies. in most cancer types, personalized combinations composed of at most four targets are sufficient to kill at least % of the tumor cells while killing at most % of the non-tumor cells in each patient. the number of distinct targets needed to do that for all patients in of the cohorts we studied is at most , while one larger melanoma cohort requires over distinct targets. further requiring that the target genes be lowly expressed across many different healthy tissues uncovers qualitatively similar trends. however, as one requires either more stringent killing thresholds or more stringent sparing of non-cancerous tissues beyond these baseline values, the number of targets needed rises rapidly. emerging promising targets include the gene ptprz , which is frequently found in the optimal combinations for brain and head and neck cancers, and egfr, a recurring target in multiple tumor types. in sum, this is the first systematic single-cell based characterization of the landscape of combinatorial receptor-mediated cancer treatments, identifying promising targets for future development. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction personalized oncology offers hope that each patient's cancer can be treated based on its genomic characteristics , . several trials have suggested that it is possible to collect genomics data fast enough to inform treatment decisions - . meta-analysis of phase i clinical trials completed during - showed that overall, trials that used molecular biomarker information to influence treatment plans gave better results than trials that did not . however, most precision oncology treatments utilize only one or two medicines, and resistant clones frequently emerge, emphasizing the need to deliver personalized medicine as multiple agents combined - . important opportunities to combine systems biology and design of nanomaterials have been recognized to deliver medicines in combination to overcome drug resistance and combine biological effects . here, we propose and rigorously study a new conceptual framework for designing future precision oncology treatments. it is motivated by the growing recognition that tumors typically have considerable intra-tumor heterogeneity (ith) , and thus need to be targeted with a combination of medicines such that as many as possible tumor cells are hit by at least one medicine. our analysis is based on two recently emerging technologies: ( ) the advancement of single-cell transcriptomics and proteomics measurements from patients’ tumors, which is anticipated to gradually enter into clinical use , and ( ) the introduction of “modular” treatments that target specific overexpressed genes/proteins to recognize cells in a specific manner and then use either the t cell immune response or a lethal toxin to kill the tumor cells preferentially. based on these two foundations, we formulate and systematically answer two basic questions. first, how many targeted treatments are needed to selectively kill most tumor cells while sparing most of the non-tumor cells in a given patient? and second, given a cohort of patients to treat, how many distinct single-target treatments need to be prepared beforehand so that there is a combination that kills at least a specified proportion of the tumor cells of each patient? we focus our analysis on genes encoding protein targets that encode receptors on the cell surface, as these may be precisely targeted by any one of at least six technologies: e.g., by car- t therapy , immunotoxins ligated to antibodies - , immunotoxins ligated to mimicking and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . peptides , conventional chemotherapy ligated to nanoparticles , degraders associated with ubiquitin e ligases and designed ankyrin repeat proteins (darpins) , . these treatments are all “modular”, including one part that specifically targets the tumor cell via one gene/protein and another part, the cytotoxic mechanism that kills the cells. two recent genome-wide analyses of modular therapies have focused on car-t therapy , , so we focus first on this technology to put our work in context. in the original formulation, car-t therapy used one cell surface target that marks the cells of interest, such as cd as a marker for b cells. to date, car-t therapy has been effective in achieving remissions for some blood cancers , , but less effective for solid tumors. mackay et al. focused primarily on single targets and looked at combinations of two targets and did all analysis in silico. dannenfelser et al. focused on predicting combinations of two and three targets and did most of their work in silico, with in vitro validation of two high-scoring predicted combinations in renal cancer. importantly, these studies have analyzed bulk tumor and normal expression data to identify likely targets. here we present the first analysis that aims to identify modular targets based on the analysis of tumor single-cell transcriptomics. this enables to study the research questions at a higher resolution but presents new analytical challenges that need to be addressed. two related difficulties with car-t therapy are i) toxicity to non-cancer cells , and ii) difficulty in finding single targets that are sufficiently selective . to address the toxicity problem, mackay et al. selected targets that had low expression in most tissues in the genotype-tissue expression (gtex) data; however, their analysis did not require that the targets are cell surface proteins. we proceed in a stepwise manner; we start with a formal analysis of a space of candidate cell surface receptors. then, we proceed to add a low-expression requirement like that of mackay et al. and parameterized by a transcripts per million (tpm) expression threshold. for completeness, we also tested their set of genes. to address the selectivity problem, various groups have engineered composite forms of car-t treatments that implement boolean and, or, and not gates that have been tested for combinations of up to three target proteins - . both mackay et al. and dannenfelser et al. presented in silico methods focusing on and gates and pairs or trios of targets; dannenfelser et al. analyzed likely cell surface proteins that are not necessarily receptors. we have chosen and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . to focus on the simpler logical or construction because that can be achieved not only by car-t technology , , but can also be implemented via other modular treatment technologies by combining multiple single-target treatments, assuming that the composite treatment kills a cell if any one of the single treatments kill the cell. conceptually, such a logical or combination treatment can still achieve selectivity by choosing targets, each of which is expressed on a much higher proportion of cancer cells than non-cancer cells. one of our key contributions is to show that by using techniques from combinatorial optimization, one can find such effective combinations involving a large number of targets, while previous studies were limited to at most three targets. beyond car-t, our analysis applies to several additional types of modular treatment technologies that rely instead on receptor-mediated endocytosis (rme) delivering a toxin via a targeted receptor to enter the cell , . like car-t, these rme-based technologies do not downregulate the target receptor. for rme technologies and other technologies that work intracellularly, we anticipate combining modular treatments from one technology such that all treatments use the same toxin or mechanism of cell killing, thereby mitigating the need to test for interaction effects between pairs of different treatments. to address these research questions, we designed and implemented a computational approach named madhitter (after the mad hatter from alice in wonderland) to identify optimal precision combination treatments that target membrane receptors (figure , a-c). we define three key parameters related to the stringency of killing the tumor and protecting the non-tumor cells and explore how the optimal treatments vary with those parameters (figure b, c). solving this problem is analogous to solving the classical “hitting set problem” in combinatorial algorithms , which is formally defined in the methods (see also supplementary materials ). unlike the previous studies on car-t targets, we define the problem in a personalized manner, intending that each patient will get optimal treatments for her or his tumor from among a collection of available treatments. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . conceptual schematic example of madhitter analysis of single-cell data transcriptomics from three cancer patients. (a) a cohort of patients (three in this example) arrives for a study in which single-cell tumor microenvironment (tme) transcriptomics data are collected from each patient; the data are analyzed with madhitter and each patient receives an optimal personalized combination of targeted therapies from a pre- specified set (pill bottle). madhitter is aimed at optimizing combinations of targeted therapies that are modular, that is, having a recognition unit that is gene/protein-specific, and a joint killing subunit (similar for all gene targets). icons of four such modular therapies are shown; for three of these, the target protein must be on the cell surface and for two it must be a receptor, so we focus our analyses on cell surface receptors. three main algorithm parameters are denoted near the madhitter icon in panel a and explained in the later panels. (b) the and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . single-cell tme data are represented in two matrices with the genes as rows and cells as columns, partitioned into tumor (t) and non-tumor (n) cells. the expression ratio r determines by how much a gene must be overexpressed for a cell to be considered as a targeted. a gene is considered ‘overexpressed’ in either a non- tumor cell or a tumor cell if its expression is at least r times the mean, reference level; e.g, the reference level for flt is ( + + )/ = and only cell t has flt expression above × = . the matrices on the right side show a boolean representation of which targets kill which cells, based on the expression values presented in this toy problem in matrix b and taking r= . accordingly, the combination of egfr and kdr would kill all tumor cells and would spare all non-tumor cells. (c) the main algorithm in madhitter seeks a combination of targets that is as small as possible and would kill many tumor cells and few non-tumor cells, in a patient- specific manner. the 𝑙𝑏 and 𝑢𝑏 parameters are the lower bound on the fraction of tumor cells killed and the upper bound on the fraction of non-tumor cells whose killing is tolerated, respectively. baseline settings used in our analyses are 𝑟 = , 𝑙𝑏 = . and 𝑢𝑏 = . , and are varied in some of the analyses. the right side of the panel shows a hypothetical example of the tradeoff between killing tumor cells and sparing non-tumor cells. while target set a could kill a larger fraction of tumor cells than target set b, madhitter would select target set b since only it satisfies both our baseline settings and kills at most . fraction of the non-tumor cells. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . results the data and the combinatorial optimization framework we focused our analysis searching for optimal treatment combinations in nine single-cell rnaseq data sets that include tumor cells and non-tumor cells from at least three patients for that were publicly available at the onset of our investigation (methods; table ). those data sets include four brain cancer data sets and one each from head and neck, melanoma, lung, breast and colon cancers. most analyses were done for all data sets, but for clarity of exposition, we focused in the main text analyses on four data sets from four different cancer types (brain, head and neck, melanoma, lung) that are larger than the other five and hence, make the optimization problems more challenging. results on the other five data sets are provided in supplementary materials . analyzing separately each of these data sets, we ask how many targets are needed to kill most cells of a given tumor and what is the tradeoff between cancer cells killed and non-cancer cells spared? figure shows a small schematic example in which there are alternative target sets of sizes two and three. one would prefer the target set of size two because the patients would need to receive only two distinct treatments rather than three treatments. figure . a schematic small example of killing a four tumor cells illustrating why choosing a minimum-size combination of targets may be non-trivial. the schematic tumor has four cancer cells (a, b, c, d in separate columns), which may express any of five cell-surface receptor genes (rows) that may be targeted selectively by modular treatments (pills). if one targets {app, kdr, met}, all cancer and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cells will be killed (left panel). however, if, instead, one would target {cr , tek} then all cancer cells in the given example tumor will be killed with just two targets (right panel) instead of three, providing a smaller solution. to formalize our questions as combinatorial optimization hitting set problems, we define the following parameters and baseline values and explore how the optimal answers vary as functions of these parameters: we specify a lower bound on the fraction of tumor cells that should be killed, 𝑙𝑏, which ranges from to . similarly, we define an upper bound on the fraction of non-tumor cells killed, 𝑢𝑏, which also ranges from to . our baseline settings are 𝑙𝑏 = . and 𝑢𝑏 = . . to represent the concept that only cells that overexpress the target, we introduce an additional parameter 𝑟. the expression ratio 𝑟 defines which cells are killed, as follows (figure b): denote the mean expression of a gene 𝑔 in non-cancer cells that have non- zero expression by e(𝑔). a given cell is considered killed if gene 𝑔 is targeted and its expression level in that cell is at least 𝑟 × 𝐸(𝑔). higher values of 𝑟 thus model more selective killing. having 𝑟 as a modifiable parameter anticipates that in the future one could experimentally tune the overexpression level at which cell killing occurs . in this respect, technologies that rely on rme to get a toxin into the cell are particularly tunable because there is known to be a non-linear relationship between the number of protein copies on the cell surface and the probability that rme occurs successfully . in these technologies, the toxin or other therapy delivered by the modular treatment enters cells in a gene-specific manner , while car- t therapy activates t-cell killing against cells in a gene-specific manner , . for most of our analyses, the expression ratio 𝑟 is varied from . to . , with a baseline of . , based on experiments in the lab of n.a. and related to combinatorial chemistry modeling ; in one analysis, we varied r up to . (supplementary table s ). given these definitions, we solve the following combinatorial optimization hitting set problem (methods): given an input of a single-cell transcriptomics sample of non-tumor and tumor cells for each patient in a cohort of multiple patients, bounds 𝑢𝑏 and 𝑙𝑏, ratio 𝑟, and a set of target genes, we seek to find a solution that finds a minimum-size combination of targets in each individual patient, while additionally minimizing the size of all targets given to the patients cohort. the latter is termed the global minimum-size hitting set (ghs) in computer science terminology or the cohort target set (cts) in terminology specific to our problem, while the and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . optimal hitting set of genes targeting one patient is termed the individual target set (its). this optimum hitting set problem with constraints can be solved to optimality using integer linear programming (ilp) (methods). we solve different optimization problem instances, each of which considers a different set of candidate target genes: genes encoding cell surface receptor proteins, and subset of out of these genes that already have published ligand- mimicking peptides, and a nested collection of sets of - out of the genes that are lowly expressed below a series of decreasing gene expression thresholds . from a computational standpoint, there is no inherent limit on the size of the candidate gene set. our formulation is personalized as each patient receives the minimum possible number of treatments. the global optimization comes into play only when there are multiple solutions of the same size to treat a patient. for example, suppose we have two patients such that patient a could be treated by targeting either {egfr, fgfr } or {met, fgfr } and patient b could be treated by targeting either {egfr, cd } or {anpep, cd }. then we prefer the cts {egfr, fgfr , cd } of size and we treat patient a by targeting {egfr, fgfr } and patient b by targeting {egfr, cd }. as the number of cells per patient varies by three orders of magnitude across data sets, we use random sampling to obtain hitting set instances of comparable sizes and yet adequately capture tumor heterogeneity. we found that sampling hundreds of cells from the tumor is sufficient to get enough data to represent all cells. in most of the experiments shown, the number of cells sampled, which we denote by 𝑐, was . in some smaller data sets, we had to sample smaller numbers of cells (methods). as shown in (supplementary materials , figures s -s ), cells, when available, are roughly sufficient for cts size to plateau for our baseline parameter settings, 𝑙𝑏 = . , 𝑢𝑏 = . , 𝑟 = . . for each individual within a data set, we performed independent sampling of c cells times and their results were summarized. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . cohort and individual target set sizes as functions of tumor killing and non-tumor sparing goals given the single-cell tumor data sets and the ilp optimization framework described above, we first studied how the resulting optimal cohort target set (cts) may vary as a function of the parameters defining the optimization objectives in different cancer types. figures and s -s in supplementary materials show heatmaps of cts sizes when varying lb, ub, and r around the baseline values of . , . , and . , respectively. the cts sizes for melanoma were largest, partly due to the larger number of patients in that data set (table ). indeed, as we sampled subsets of or patients uniformly and observed that the mean cts sizes grew from . ( patient subsets) to . ( patient subsets) to . (all patients, as shown in figure ). encouragingly, for most data sets and parameter settings, the optimal cts sizes are in the single digits. however, in several data sets, we observe a sharp increase in cts size as 𝑙𝑏 values are increased above . and/or as the 𝑢𝑏 is decreased below . , with a more pronounced effect of varying 𝑙𝑏. this transition is more discernable at the lowest value of 𝑟 ( . ), probably because when 𝑟 is lower, it becomes harder to find genes that are individually selective in killing tumor cells and sparing non-tumor cells (supplementary figures s -s ). the qualitative transition observed in cts sizes occurs robustly regardless of the threshold for filtering out low expressing cells when preprocessing the data (supplementary materials , figures s -s ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . heat maps showing how the cohort target set size (cts) varies as a function of 𝒍𝒃, 𝒖𝒃, 𝒓 and across data sets. for each plot the x-axis and y-axis represent lb and ub parameter values, respectively. the scale on the right shows the cohort target set sizes by color scale. we show separate plots for 𝑟 = . , . here and a larger set { . , . , . , . } in supplementary materials . individual values are not necessarily integers because each value represents the mean of replicates of sampling 𝑐 ( for each of the data sets shown here) cells (figure s ). we next examined what are the resulting individual target set (its) sizes obtained in the optimal combinations under the same conditions. in all data sets, the mean its sizes are in the single digits for most values of 𝑙𝑏 and 𝑢𝑏. the distributions of its sizes are shown for four data sets and two combinations of (𝑙𝑏, 𝑢𝑏) (figure ) and for additional data sets in supplementary materials , figure s . overall, the mean its sizes with the baseline parameter values (𝑟 = . , 𝑙𝑏 = . , 𝑢𝑏 = . ) range from . to . among the nine data sets studied (supplementary table s ); on average targets per patient should hence suffice if enough single-target treatments are available in the cohort target set. however, there is considerable variability across patients. evidently, as we make the treatment requirements more stringent (by increasing 𝑙𝑏 from . to . and decreasing 𝑢𝑏 from . to . ), the variability in its size across patients became and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . larger. importantly, this analysis provides rigorous quantifiable evidence grounding the prevailing observation that among tumors of the same type, some individual tumors may be much harder to treat than others. taken together, these results show that we can compute precise estimates of the number of targets needed for cohorts (in the tens) and individual patients (in the single digits usually) and that these estimates are sensitive to the killing stringency, especially when the 𝑙𝑏 increases above . . the variation for more aggressive killing regimes, with values of 𝑙𝑏 up to . for the baseline 𝑟 = . is displayed in figures s -s in supplementary materials . for fixed 𝑙𝑏 = . , 𝑢𝑏 = . and varying 𝑟, smallest cts sizes are typically obtained for 𝑟 values close to . , further motivating our choice of 𝑟 = as the default value (supplementary materials , figures s -s , supplementary table s ). finally, we show that, as expected, a ‘control’ greedy heuristic algorithm searching for small and effective target combinations finds its sizes substantially larger than the optimal its sizes identified using our optimization algorithm (figure ). the greedy cts size is greater than the ilp optimal cts size for eight out of nine data sets (table s in supplementary materials , methods). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . the distribution of optimal and greedy individual treatment combination sizes (its) values in four different cancer types. we study both our baseline parameter setting (upper row panels) and a markedly more stringent one (middle row plots). for the more stringent parameter setting, we compare the its sizes obtained using madhitter (middle row plots) and a greedy algorithm that tries to add pairs of genes at a time (bottom row plots). in each plot, the patients are sorted from left to right according to their mean its values in the optimal stringent regime. additional comparisons between its sizes at different parameter settings can be found in supplementary materials . description of the greedy algorithm and more comparisons between the optimal and greedy algorithms are provided in supplementary materials . and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the landscape of combinations achievable with receptors currently targetable by published ligand-mimicking peptides to get a view of the combination treatments that are possible with receptor targets for which there are already existing modular targeting reagents, we conducted a literature search identifying out of the genes with published ligand-mimicking peptides that have been already tested in in vitro models, usually cancer models (methods; tables and ). we asked whether we could find feasible optimal combinations in this case and if so, how do the optimal cts and its sizes compare vs. those computed for all genes? figure . comparison of individual target set sizes with or targets for three out of the six data sets that have feasible solutions. we attempted to find feasible solutions for all patients using cell surface receptors that have published ligand-mimicking peptides that have been tested in vitro or in pre-clinical models. there are feasible solutions for all patients in six data sets, but not for the brain (gse ), melanoma (gse ), and lung (e-mtab- ), which were displayed in previous figures. instead, we show here results for breast and colorectal cancers, for which other analyses, such as those in figures and , are in the supplementary materials. some of the optimal solutions obtained on the -receptors restricted set are of the same size to those obtained on the whole receptors set and some are larger. computing the optimal cts and its solutions for this basket of targets, we found feasible solutions for six of the data sets across all parameter combinations we surveyed and and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . three of these six are illustrated for each patient in figure . however, for three data sets, in numerous parameter combinations we could not find optimal solutions that satisfy the optimization constraints (supplementary materials , figures s -s ). that is, the currently available targets do not allow one to design treatments that may achieve the specified selective killing objectives, underscoring the need to develop new targeted cancer therapies, to make personalized medicine more effective for more patients. overall, comparing the optimal solutions obtained with targets to those we have obtained with the targets, three qualitatively different behaviors are observed (supplementary materials , figures s -s ): ( ) in some datasets, it is just a little bit more difficult to find optimal its and cts solutions with the -gene pool, while in others, the restriction to a smaller pool can be a severe constraint making the optimization problem infeasible. ( ) the smaller basket of gene targets may force more patients to receive similar individual treatment sets and thereby reduces the size of the cts. ( ) unlike the cts size, the its size must stay the same or increase when the pool of genes is reduced, because we find the optimal its size for each patient. overall, the average its sizes across each cohort using the pool of genes for baseline settings range from . to . . among cases that have any solution, the average increases in the its sizes at baseline settings in the genes case vs. that of the case were moderate, ranging from . to . . optimal fairness-based combination therapies for a given cohort of patients until now we have adhered to a patient-centered approach that aims to find the minimum-size its for each patient, first and foremost. we now study a different, cohort-centered approach, where given a cohort of patients, we seek to minimize the total size of the overall cts size, while allowing for some increase in the its sizes. the key question is how much larger are the resulting its sizes if we optimize for minimizing the cohort (cts size), rather than the individuals (its size)? this challenge is motivated by a ‘fairness’ perspective (supplementary materials ), where we seek solutions that are beneficial for the entire community from a social or economic perspective (in terms of cts size) even if they are potentially sub-optimal at the individual level (in terms of its sizes). here, the potential benefit is economic since running a and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . basket trial would be less expensive if one reduces the size of the basket of available treatments (figure a-b). we formalized this ‘fair cts problem’ by adding a cost parameter 𝛼 that specifies the limit on the excess number of (its) targets selected for any individual patient, compared to the number selected in the individual-based approach that was studied up until now (formally, the latter corresponds to setting 𝛼 = ). we formulated and solved via ilp this fair cts problem for up to possible targets on all nine data sets (methods). we fixed 𝑟 = and 𝑢𝑏 = . while varying 𝛼 and 𝑙𝑏. figure c and figures s -s in supplementary materials show the optimal cts and its sizes for 𝛼 = , . . . , . and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . a schematic example demonstrating the rationale and workings of fairness-based solutions. (a, b) let us assume that each of three patients has two tumor cells (columns), each displaying five membrane receptors that are highly expressed only on the tumor cells and not on the non-tumor ones (rows). if we target {app, met} (panel a, 𝛼 = ) in all patients, then this achieves a cts size of , which is the minimum possible. employing the original individual- based optimizing objective, each patient could instead be treated by an its of size by targeting the distinct receptors called target (specific to patient ), target and target , respectively, but this would result in an optimal cts of size (panel b, 𝛼 = ). the solution in panel a has an unfairness value 𝛼 = because the worst difference among all patients is that a patient and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . receives more treatment than necessary. (c) heatmaps showing how the cts size varies as 𝛼 increases (y-axis), starting from its baseline value of where each patient is assigned a minimum-sizes individual treatment set (top row). the lower bound on tumor cells killed (x- axis) is also varied while the upper bound on non-tumor cells killed is kept fixed at . . we are particularly interested in finding the smallest value on the y-axis at which the cts size reaches its minimum value, which is circled for the baseline 𝑙𝑏 = . , because this bounds the tradeoff between the achievable reduction in the number of targets needed to treat the whole cohort and the number of extra targets above the its minimum that any patient might need to receive. for out of data sets, we encouragingly find that the unfairness cost parameter 𝛼 is bounded by a constant of ; i.e., it is sufficient to increase 𝛼 by no more than to obtain the smallest cts sizes in the optimally fair solutions. for the largest data set (melanoma), 𝛼 = . as we show in supplementary materials , empirically, even if one requires lower α values, then as those approach , the size of the fairness-based cts grows fairly moderately and remains in the lower double digits, and the mean size of the number of treatments given to each patient (their its) is overall < . theoretically, we show that one can design instances for which 𝛼 would need to be at least √𝑛 − to get a cts of size less than the overall number of targets 𝑛 (supplementary materials ). however, in practice, we find that given the current tumor single-cell expression data, fairness-based treatment strategies are likely to be a reasonable economic option in the future. the landscape of optimal solutions targeting receptors that are lowly expressed across many healthy tissues we turn to examine the space of optimal solutions when restricting the set of eligible surface receptor gene targets to those that have lower expression across many noncancerous human tissues (methods), aiming to mitigate potential damage to tissues unrelated to the tumor site. to this end, we selected subsets of the cell surface receptor targets in which the genes have overall lower expression across multiple normal tissues, by mining gtex and the human protein atlas (hpa) (methods). varying the selectivity expression thresholds (expressed in transcripts per million (tpm)) used to filter out genes whose mean expression across the normal adult tissues is above values of , , , , . , and . (i.e., employing more and more extensive and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . filtering as this threshold is decreased), decreases the size of the target cell surface receptor gene list by more than half (table ). as shown in figures a, b (and supplementary figures s -s ), madhitter identifies very different cohort target sets (which are larger than the original optimal solutions, as expected) as the tpm selectivity threshold value is decreased. furthermore, different its instances may become infeasible (supplementary figure s ). at an individual patient level, using lower selectivity threshold levels, which leads to a smaller space of membrane receptors to choose from, also leads to increased mean its sizes (supplementary figures s , s ). across the nine data sets, the selectivity threshold at which the cts problem became infeasible varied (supplementary figure s ). the differences observed could be the result of expression heterogeneity of the cancer, number of patients within the data set, size of target gene set, lack of expression of available gene targets and other unknown factors. in the future, further experimentation is required to identify tissue-specific optimal gene expression thresholds that will minimize side effects while allowing cancer cells to be killed by combinations of targeted therapies. finally, for completeness, we also tested madhitter on the set of lowly expressed genes suggested by mackay et al. all instances with default setting of 𝑟, 𝑙𝑏, 𝑢𝑏 have feasible solutions for all patients. mean its sizes are below for eight of nine data sets, but close to for the brain cancer data set gse . more details can be found in supplementary materials and table s . and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . figure . variation in the cts size and composition as function of the magnitude of filtering of genes expressed in noncancerous human tissues, for different tumor types. (a- b) the number of times a gene (cell-surface receptor) is included in the cts (out of replicates, which is therefore the max count in panels a-b), where each column presents the cts solutions when the input target genes sets are filtered using a specific tpm filtering threshold (methods), for (a) a breast cancer and (b) brain cancer. these data sets were selected due to their relatively small cohort target set sizes, permitting their visualization. (c-f) circos plots of the genes occurring most frequently in optimal cts solutions (length of arc along the circumference) and their pairwise co-occurrence (thickness of the connecting edge) for the four main cancer types, in our original target space of encoding cell-surface receptors. for each data set, we sampled up to optimal cts solutions. network representations of the most common target genes out of encoding cell-surface receptors (with greater than % frequency of occurrence) are represented in a cancer specific manner for (c) brain cancer, (d) head and neck cancer, (only seven genes have a frequency of % or more across optimal and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . solutions), (e) melanoma, and (f) lung cancer. genes and connections have distinct colors for improved visibility. key targets composing optimal solutions across the space of membrane receptors to identify the genes that occur most often in optimal solutions for our baseline settings, since there may be multiple distinct optimal solutions composed of different target genes, we sampled up to optimal solutions for each optimization instance solved and recorded how often each gene occurs and how often each pair of genes occur together (methods). we analyzed and visualized these gene (co-)occurrences in three ways. first, we constructed co-occurrence circus plots in which arcs around the circle represent frequently occurring genes and edges connect targets that frequently co-occur in optimal cts solutions. figure c-f shows the co-occurrence visualizations for optimal cts solutions obtained with the original, unfiltered target space of genes and in baseline parameter settings. the genes frequently occurring in optimal solutions are quite specific and distinct between different cancer types. in melanoma, the edges form a clique-like network because virtually all optimal solutions include the same clique of genes (figure e). the head and neck cancer data set has only one commonly co-occurring pair {gpr , cxadr} (figure d). of the cancer types not depicted in figure , the breast cancer data set has a commonly co-occurring set of size , {cldn , insr, p ry , sorl}, and the colorectal cancer data set has a different commonly co-occurring set of size , {gabre, gprr, lgr , ptprj} (data not shown). we next tabulated sums of how often each gene occurred in optimal solutions for all nine data sets (supplementary materials , tables s , s and s ), obtained when solving for either gene targets or gene targets. strikingly, one gene, ptprz (protein tyrosine phosphatase receptor zeta ), appears far more frequently than others, especially in three brain cancer data sets (gse , gse , gse , supplementary table s ). ptprz also occurs commonly in optimal solutions for the head and neck cancer data set (figure d). the brain cancer finding coincides with previous reports that ptprz is overexpressed in glioblastoma (gbm) , . ptprz also forms a fusion with the nearby oncogene met in some brain tumors that have an overexpression of the fused met . notably, various cell line studies and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . and mouse studies have shown that inhibiting ptprz , for example by shrnas, can slow glioblastoma tumor growth and migration , . there have been some attempts to inhibit ptprz pharmacologically in brain cancer and other brain disorders , . in the four brain cancer data sets, ptprz is expressed selectively above the baseline 𝑟 = . in . (gse ), . (gse ), . (gse ) and . (gse ) proportion of cells in each cohort. the much lower relative level of ptprz expression in gse is likely due to the heterogeneity of brain cancer types in this data set . among the genes with known ligand-mimicking peptides, egfr stands out as most common in optimal solutions (supplementary table s ). even when all genes are available, egfr is most commonly selected for the brain cancer data set (gse ) in which ptprz is not as highly overexpressed (figure c). ptprz was the fifth most frequently occurring gene in optimal solutions for the head and neck cancer data set (gse ). the two most common genes by a large margin are cxadr and gpr . cxadr has been studied primarily by virologists and immunologists because it encodes a receptor for cocksackieviruses and adenoviruses . in one breast cancer study, cxadr was found to play a role in regulating pten in the akt pathway, but cxadr was underexpressed in breast cancer whereas it is overexpressed in the head and neck cancer data we analyzed. gpr is a rarely studied g protein-coupled receptor with an unknown natural ligand . in the context of cancer, gpr has previously been reported as overexpressed in several tumor types including lung and liver and its overexpression may play an oncogenic role via either the p pathway the nfκb pathway or other pathways. finally, we analyzed the set of genes in optimal solutions via the string database and associated tools to perform several types of gene set and pathway enrichment analyses. figures s -s (supplementary materials ) show string-derived protein-protein interaction networks for the most common genes in the same four data for which we showed co- occurrence graphs in figure c-f. again, egfr stands out as being a highly connected protein node in the solution networks for both the brain cancer and head and neck cancer data sets. among the genes in the -gene set most commonly in optimal solutions (supplementary table s ), there are six kinases (out of total human transmembrane kinases with a catalytic domain, string gene set enrichment 𝑝 < 𝑒 − ), namely {egfr, ephb , erbb , fgfr , insr, ntrk } and two phosphatases {ptprj, ptprz }. the kegg pathways most and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . significantly enriched, all at 𝐹𝐷𝑅 < . , are (“proteoglycans in cancer”) represented by {cd , egfr, erbb , fgfr , plaur}, (“adherens junction”) represented by {egfr, fgfr , insr, ptprj}, and (“calcium signaling pathway”) represented by {ednrb, egfr, erbb , grpr, p rx }. the one gene in the intersection of all these pathways and functions is egfr. discussion in this multi-disciplinary study, we harnessed techniques from combinatorial optimization to analyze publicly available single-cell tumor transcriptomics data to chart the landscape of future personalized combinations that are based on ‘modular’ therapies, including car-t therapy. we showed that, for most tumors we studied, four modular medications targeting different overexpressed receptors may suffice to selectively kill most tumor cells, while sparing most of the non-cancerous cells (figures and and table s ). for the more restricted sets of low- expression genes or the receptors with validated ligand-mimicking peptides (tables and ), some patients do not have feasible solutions, especially as we reduce the tpm expression used for filtering the gene set to avoid targeting non-cancerous tissues. these findings indicate, on one hand, that researchers designing ligand-mimicking peptides have been astute in choosing targets relevant to cancer. on the other hand, these results suggest that there is a need for extending the set of cell surface receptors that can be targeted to enter tumor cells with ligated chemotherapy agents. remarkably, we found that if one designs the optimal set of treatments for an entire cohort adopting a fairness-based policy, then the size of the projected treatment combinations for individual patients are at most targets larger, and in most data sets at most target receptor larger than the optimal solutions that would have been assigned for these patients based on an individual-centric policy (figure , supplementary materials ). this suggests that the concern that the personalized treatment for any individual will be suboptimal solely because that individual happens to have registered for a cohort trial appears to be tightly bounded. like the study of mackay et al. , our study is a conceptual computational investigation. we studied nine data single-cell expression data sets for the first time, but it would be helpful to and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . analyze more and larger data sets in the future. even among four data sets of the same (brain) cancer type, we observed considerable variability in cts and its sizes. since our approach is general and the software is freely available as source code, other researchers can test the method on new data sets or add new variables, constraints, and optimality criteria. these investigations may of course lead other to further improve our method and broaden its applicability. in future work, we plan to apply our approach to study ways for selectively killing specific populations of immune cells, such as myeloid-derived suppressor cells, because they inhibit tumor killing, while sparing most other non-cancer cells. we compared gene expression levels between non-cancer cells and cancer cells sampled from the same patient, which avoids inter-patient expression variability . however, we did little to account for “expression dropout” beyond the normalization performed by the providers of the data sets, aiming to preserve the public data as it was submitted to geo or array express. to achieve some uniformity and also to take as cautious an approach as possible, we added a step to filter low expressing cells because some data sets had already been filtered in this way. one could instead apply imputation methods such as magic or scimpute to infer denser gene expression matrices and then apply our method to the adjusted input data. therefore, our results about the sizes of optimal its should be viewed as estimated upper bounds that are likely to decrease if the dropout rate decreases or if cells expressing few genes are eliminated from the analysis more stringently. another limitation of our method is that we viewed the measured gene expression as being valid over all time, even though gene expression is known to be a stochastic process . in the future, we would like to extend our approach to use the single-cell data to infer how stochastic is the expression of each gene and to prefer targets whose expression is more stable. even though the combinatorial optimization problems solved here are in the worst-case exponentially hard (np-complete in computer science terminology), the actual instances that arise from the single-cell data could be either formally solved to optimality or shown to be infeasible with modern optimization software. of note, delaney et al., have recently formalized a related optimization problem in analysis of single-cell clustered data for immunology . their optimization problem is also np-complete in the worst case and they could solve sets of up to size four using heuristic methods . we have shown that the optimal ilp solutions we obtained and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . are often substantially smaller than solutions obtained via a greedy heuristic (figure , supplementary materials including table s ). on the cautionary side, experiments with target gene sets that were further filtered by low expression in normal tissues showed that the individual target set problem can become infeasible in many instances. even when the instance remained feasible, optimal cohort treatment set sizes increased rapidly as the expression levels allowed decreased (figure ), pointing to potential inherent limitations of applying such combination approaches to patients in the clinic and the need to carefully monitor their putative safety and toxicity in future applications. finally, functional enrichment analysis of genes commonly occurring in the optimal target sets reinforced the central role of the widely studied oncogene egfr and other transmembrane kinases. we also found that that the less-studied phosphatase ptprz is a useful target, especially in brain cancer. in summary, this study is the first to harness combinatorial optimization tools to analyze emerging single-cell data to portray the landscape of feasible personalized combinations in cancer medicine. our findings uncover promising membranal targets for the development of future oncology medicines that may serve to optimize the treatment of cancer patient cohorts in several cancer types. the madhitter approach presented and the accompanying software made public can be readily applied to address additional fundamental related research questions and analyze additional cancer data sets as they become available. methods data sets we retrieved and organized data sets from ncbi’s gene expression omnibus (geo) and ensembl’s arrayexpess and the broad institute’s single cell portal (https://portals.broadinstitute.org/single_cell). nine data sets had sufficient tumor and non-tumor cells and were used in this study; an additional five data sets had sufficient tumor cells only and were used in testing early versions of madhitter. suitable data sets were identified by searching scrnaseqdb , cancersea , geo, arrayexpress, google scholar, and the x genomics list of publications (https://www. xgenomics.com/resources/publications/). we required that each and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://portals.broadinstitute.org/single_cell https://www. xgenomics.com/resources/publications/ https://doi.org/ . / . . . data set contain measurements of rna expression on single cells from human primary solid tumors of at least two patients and the metadata are consistent with the primary data. we are grateful to several of the data depositing authors of data sets for resolving metadata inconsistencies by e-mail correspondence and by sending additional files not available at geo or arrayexpress. we excluded blood cancers and data sets with single patients. when it was easily possible to separate cancer cells from non-cancer cells of a similar type, we did so. the main task in organizing each data set was to separate the cells from each sample or each patient into one or more single files. representations of the expression as binary, as read counts, or as normalized quantities such as transcripts per million (tpm) were retained from the original data. when the data set included cell type assignments, we retained those to classify cells as “cancer” or “non-cancer”, except in the data set of karaayvaz et al. where it was necessary to reapply filters described in the paper to exclude cells expressing few genes and to identify likely cancer and likely non-cancer cells. if cell types were not distinguished, all cells were treated as cancer cells. to achieve partial consistency in the genes included, we filtered data sets to include only those gene labels recognized as valid by the hugo gene nomenclature committee (http://genenames.org), but otherwise we retained whatever recognized genes that the data submitters chose to include. after filtering out the non-hugo genes, but before reducing the set of genes to or or or , we filtered out cells as follows. some data sets came with low expressing cells filtered out. to achieve some homogeneity, we filtered out any cells expressing fewer than % of all genes before we reduced the number of genes. in supplementary materials , we tested the robustness of this % threshold. finally, we retained either all available genes from among either our set of genes encoding cell-surface receptors that met additional criteria on low expression or available ligand-mimicking peptides. table . summary descriptions of single-cell data sets from solid tumors used either for analysis ( ) or preliminary testing ( additional). data sets are ordered so that those from the same or similar tumor types are on consecutive rows. the first data sets were obtained either from geo or the broad institute single cell portal, but the geo code is shown. the data set on the last row was obtained from arrayexpress. in some data sets that have both cancer and non- cancer cells, there may be samples for which only one type or the other is provided. hence, the and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://genenames.org/ https://doi.org/ . / . . . numbers in parentheses in the third and fourth columns may differ. data set gse supersedes and partly subsumes gse . data set code cancer type(s) cancer cells(samples) non- cancer cells (samples) clinical follow-up reference(s) gse breast ( ) -- metastasis or not gse breast ( ) ( ) metastasis or not gse brain (glioma) ( ) ( ) no gse brain (glioma) ( ) -- no gse brain (glioma) ( ) ( ) no gse brain (glioma) ( ) -- no gse brain ( glioma and glioblastoma) ( ) ( ) no gse brain (glioblastoma) ( ) ( ) no gse colorectal ( ) ( ) no gse head and neck ( ) ( ) no gse melanoma ( ) ( ) yes, immuno- therapy , gse ovarian ( primary) ( metastasis) ( ) no gse prostate ( ) -- metastasis or not e-mtab- lung ( ) ( ) no and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . sampling process to generate replicates of data sets as shown in table , the number of cells available in the different single-cell data sets varies by three orders of magnitude; to enable us to compare the findings across different data sets and cancer types on more equal footing, we employed sampling from the larger sets to reduce this difference to one order of magnitude. this goes along with the data collection process in the real world as we might get measurements from different samples at different times. suppose for a data set we have 𝑛 genes, and 𝑚 cells comprising tumor cells and non-tumor cells. we want to select a subset of 𝑚 ′ < 𝑚 cells. we select a set of 𝑚′ cells uniformly at random without replacement from among all cells. then we partition the selected cells into 𝑚𝑡 ′ tumor cells and 𝑚𝑛 ′ non-tumor cells to define one replicate. in most of the computational experiments shown we used replicates and we report either the arithmetic mean or entire distribution of quantities such as the cts size. considering a previously defined set of target genes and of hpa gene expression across different normal tissues the general aim of our methods is to target the cancer cells while sparing the adjacent non- cancer cells as much as possible. a related concern is that genes within the target set could be expressed at high levels in other normal tissues that are not part of the non-cancer cells from the tumor microenvironment included in the input data sets. one way to address this problem is to identify genes that have low expression in the majority of the tissues and to use them to obtain a target set. this approach has been pioneered in a recent paper on selecting gene targets suitable for car-t therapy . the authors selected candidate genes that they judged could be reasonable targets for car-t. they made this selection based on expression data from the human protein atlas and the genotype-tissue expression consortium (gtex) , which have expression information from multiple tissues which was used to identify low expressed target genes. mckay et al. used a threshold of tpm units of expression (written in their work as log (tpm+ ) ≤ ), but they allowed a small number of tissues to exceed this threshold. instead, we used quantitative levels of expression for finer granularity in analysis, as described in the next subsection. one clinical difference is that we looked only at adult tissues because we are and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . analyzing adult tumors, while car-t therapy can be used for either childhood or adult tumors. the reason to focus on cell-surface receptors, as suggested by dannenfelser et al. , is that car- t therapy requires a cell-surface target that may or may not be a receptor, antibody technologies require a cell surface receptor, and the ligand-mimicking peptide nanotechnology that we summarized in the introduction also requires cell surface receptor targets. construction of target gene sets that are lowly expressed in normal tissues to analyze the tissue specificity of the candidate target genes, the rnaseq based multiple tissue expression data was obtained from the human protein atlas (hpa) database (https://www.proteinatlas.org/about/download ; date: may , ). the hpa database includes expression values (in units of transcripts per million (tpm)) for tissues from hpa (rna_tissue_hpa.tsv.zip) and tissues from the genotype-tissue expression consortium (rna_tissue_gtex.tsv.zip) . next, to identify target genes with low or no expression within majority of adult human tissues, for the candidate genes we identified genes whose average expression across tissues is below certain threshold value ( . , . , , , , and tpm) in both hpa and gtex data sets. using the intersection of low expression candidate genes from hpa and gtex data sets, we generated lists of high confidence targets. the size of the resulting high confidence target genes varied from (average expression less than . tpm) to (average expression across tissue less than tpm) genes (table ). while the total number of genes decreases slowly, the decrease is much steeper if one excludes olfactory receptors and taste receptors (table ). these sensory receptors are not typically considered as cancer targets, although a few of these receptors are selected in optimal target sets when there are few alternatives (figure ). madhitter was run on all nine data sets using the expression information from the high confidence gene lists. table : size of high confidence target gene sets for different thresholds. thresholds expression across size of gene set no of genes which are not olfactory no of genes with ligand mimicking and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.proteinatlas.org/about/download https://doi.org/ . / . . . tissues (tpm) (or*) and taste receptors (tas*) peptides (intersection with tables and ) . . assembling lists of membrane target genes we are interested in the set of genes 𝐺 that i) have the encoded protein expressed on the cell surface and ii) for which some biochemistry lab has found a small peptide (i.e. amino acid sequences of - amino acids) that can attach itself to the target protein and get inside the cell carrying a tiny cargo of a toxic drug that will kill the cell and iii) encode proteins that are receptors. the third condition is needed because many proteins that reside on the cell surface are not receptors that can undergo rme. the first condition can be reliably tested using a recently published list of genes encoding human predicted cell surface proteins ; we reduced the list to by requiring that the proteins be receptors, which is necessary for rme-based therapies but not for car-t therpy . for condition ii), we found two review articles in the chemistry literature - that list targets effectively meeting this condition. intersecting the lists meeting conditions i) and ii) gave us genes/proteins that could be targeted (table ). most of the data sets listed in table had expression data on - of these genes because the list of includes many olfactory receptor genes that may be omitted from standard genome-wide expression experiments. among the genes in table , / data sets have all genes, but gse was substantially filtered and has only / genes; since gse lacks non-tumor cells, we did not use this data set in any analyses shown. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . because the latter review was published in , we expected that there are now additional genes for which ligand-mimicking peptides are known. we found additional genes and those are listed in table . thus, our target set analyses restricted to genes with known ligand-mimicking peptides use = + targets. table . single proteins that can be targeted by peptides based on references , and are expressed on the cell surface . for easier correspondence with the gene expression data, the entries are listed in alphabetical order by gene symbol. in this table, we follow the clinical genetics formatting convention that proteins are in roman and gene symbols are in italics. protein gene symbol apn/cd anpep app app pd-l cd cd cd p /gc qr cd e-cadherin cdh n-cadherin cdh cd cr egfr egfr epha epha ephb ephb her erbb fgfr fgfr fgfr fgfr fgfr fgfr and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fgfr fgfr vegfr flt vegfr flt psma folh gpc gpc il- ra il ra il- rα il ra il- rα il ra il- rα il r gp il st vegfr kdr muc mcam met met mmp mmp thomsen-friedenreich carbohydrate antigen muc nrp- nrp pdgfrβ pdgfrb cd prom ptprj ptprj hspg sdc e-selectin sele tie tek vpac vipr table . single proteins that can be targeted by ligand-mimicking peptides but are not included in the two principal reviews that we consulted - and are among cell surface receptors . and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . since the evidence that these genes have ligand-mimicking peptides is scattered in the literature, we include at least one pubmed id of a paper describing a suitable peptide. protein gene symbol at least one pubmed id actriib acvr b cd cd cxcr cxcr , ephrin a epha , ephrin b ephb , ephrin b ephb , ephrin b ephb , gonadotrophin releasing hormone receptor gnrhr , g protein coupled receptor gpr bombesin receptor grpr , il receptor il r low density lipoprotein receptor ldlr leptin receptor lepr , lrp lrp melanocortin receptor mc r melanocortin receptor mc r cd mrc urokinase plasminogen activator receptor plaur neurokinin- receptor tacr vpac vipr and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . definition of the minimum hitting set problem and solution feasibility one of karp’s original np-complete problems is called “hitting set” and is defined as follows . let 𝑈 be a finite universal set of elements. let 𝑆 , 𝑆 , 𝑆 , . . . , 𝑆𝑘 be subsets of 𝑈. is there a small subset 𝐻 ⊆ 𝑈 such that for 𝑖 = , , , . . . , 𝑘, 𝑆𝑖 ∩ 𝐻 is non-empty. in our setting, u is the set of target genes and the subsets 𝑆𝑖are the single cells. in reference , numerous applications for hitting set and the closely related problems of subset cover and dominating set are described; in addition, practical algorithms for hitting set are compared on real and synthetic data. among the applications of hitting set and closely related np-complete problems in biology and biochemistry are stability analysis of metabolic networks - , identification of critical paths in gene signaling and regulatory networks - and selection of a set of drugs to treat cell lines - or single patients - . more information about related work can be found in supplementary materials . two different difficulties arising in problems such as hitting set are that ) an instance may be infeasible meaning that there does not exist a solution satisfying all constraints and ) an instance may be intractable meaning that in the time available, one cannot either i) prove that the instance is infeasible or ii) find an optimal feasible solution. all instances of minimum hitting set that we considered were tractable on the nih biowulf system. many instances were provably infeasible; in almost all cases. we did not plot the infeasible parameter combinations. however, in figure , the instance for the melanoma data set with the more stringent parameters was infeasible because of only one patient sample, so we omitted that patient for both parameter settings in figure . basic optimal target set formulation given a collection 𝑆 = {𝑆 , 𝑆 , 𝑆 , . . . } of subsets of a set 𝑈, the hitting set problem is to find the smallest subset 𝐻 ⊆ 𝑈 that intersects every set in 𝑆. the hitting set problem is equivalent to the set cover problem and hence is np-complete. the following ilp formulates this target set problem: and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝑚𝑖𝑛 ∑ 𝑔∈𝑈 𝑥(𝑔) ∑ 𝑔∈𝐶 𝑥(𝑔) ≥ ∀𝑆𝑖 ∈ 𝑆 ( ) in this formulation, there is a binary variable 𝑥(𝑔) for each element 𝑔 ∈ 𝑈that denotes whether the element 𝑔 is selected or not. constraint ( ) makes sure that from each set 𝑆𝑖 in s, at least one element is selected. for any data set of tumor cells, we begin with the model that we specify a set of genes that can be targeted, and that is 𝑈. each cell is represented by the subset of genes in 𝑈 whose expression is greater than zero. in biological terms, a cell is killed (hit) if it expresses at any level on one of the genes that is selected to be a target (i.e., in the optimal target set) in the treatment. in this initial formulation, all tumor cells are combined as if they come from one patient because we model that the treatment goal is to kill (hit) all tumor cells (all subsets). in a later subsection, we consider a fair version of this problem, taking into account that each patient is part of a cohort. before that, we model the oncologist’s intuition that we want to target genes that are overexpressed in the tumor. combining data on tumor cells and non-tumor cells to make the hitting set formulation more realistic, we would likely model that a cell (set) is killed (hit) only if one of its targets is overexpressed compared to typical expression in non- cancer cells. such modeling can be applied in the nine single-cell data sets that have data on non- cancer cells to reflect the principle that we would like the treatment to kill the tumor cells and spare the non-tumor cells. let 𝑁𝑇 be the set of non-tumor cells. for each gene 𝑔, define its average expression 𝐸(𝑔) as the arithmetic mean among all the non-zero values of the expression level of 𝑔 and cells in 𝑁𝑇. the zeroes are ignored because many of these likely represent dropouts in the expression measurement. following the design of experiments in the lab of n. a., we define an expression ratio threshold factor 𝑟 whose baseline value is . . we adjust the formulation of the previous subsection, so that the set representing a cell (in the tumor cell set) contains only those genes 𝑔 such that the expression of 𝑔 is greater than 𝑟 × 𝐸(𝑔) instead of greater than zero. we keep the and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . objective function limited to the tumor cells, but we also store a set to represent each non-tumor cell, and we tabulate which non-tumor cells (sets) would be killed (hit) because for at least one of the genes in the optimal target set, the expression of that gene in that non-tumor cell exceeds the threshold 𝑟 × 𝐸(𝑔). we add parameters 𝑙𝑏 and 𝑢𝑏 each in the range [ , ] and representing respectively a lower bound on the proportion of tumor cells killed and an upper bound on the proportion of non-tumor cells killed. the parameters 𝑙𝑏, ub are used only in two constraints, and we do not favor optimal solutions that kill more tumor cells or fewer non-tumor cells, so long as the solutions obey the constraints. the fair cohort target set problem for a multi-patient cohort we want to formulate an integer linear program that selects a set of genes 𝑆∗ from available genes in such a way that, for each patient, there exists an individual target set 𝐻𝑖 𝑆∗ ⊆ 𝑆∗of a relative small size (compared to the optimal its of that patient alone which is denoted by 𝐻(𝑖)). let u = {g , g , ..., g|u|} be the set of genes. there are 𝑛 patients. for the i th patient, we denote by 𝑆𝑃(𝑖), the set of tumor cells related to patient i. for each tumor cell 𝐶 ∈ 𝑆𝑃(𝑖), we describe it as a set of genes which is known to be targetable to cell 𝐶. that is, 𝑔 ∈ 𝐶 if and only if a drug containing 𝑔 can target the cell 𝐶. in the ilp, there is a variable 𝑥(𝑔) corresponding to each gene 𝑔 ∈ 𝑈 that shows whether the gene g is selected or not. there is a variable 𝑥(𝑔, 𝑃(𝑖)) which shows whether a gene g is selected in the target set of patient 𝑃(𝑖). the objective function is to minimize the total number of genes selected, subject to having a target set of size at most 𝐻(𝑖) + 𝛼 for patient 𝑃(𝑖) where ≤ 𝑖 ≤ 𝑛. constraint ( ) ensures that, for patient 𝑃(𝑖),we do not select any gene 𝑔 that are not selected in the global set. constraint ( ) ensures all the sets corresponding to tumor cells of patient 𝑃(𝑖) are hit. 𝑚𝑖𝑛 ∑ 𝑔∈𝑈 𝑥(𝑔) ( ) ∑ 𝑔∈𝑆𝑃(𝑖) 𝑥(𝑔, 𝑃(𝑖)) ≤ 𝐻(𝑖) + 𝛼 ∀𝑖 ( ) and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝑥(𝑔, 𝑃(𝑖)) ≤ 𝑥(𝑔) ∀𝑖∀𝑔 ∈ 𝑈 ( ) ∑ 𝑔∈𝐶 𝑥(𝑔, 𝑃(𝑖)) ≥ ∀𝑖∀𝐶 ∈ 𝑆𝑃(𝑖) ( ) parameterization of the fair cohort target set problem in the fair cohort target set ilp shown above, we give more preference towards minimizing number of genes needed in the cts. however, we do not take into account the number of non- tumor cells killed. killing (covering) too many non-tumor cells potentially hurts patients. in order to avoid that, we add an additional constraint to both the ilp for the local instances and the global instance. intuitively, for patient 𝑃(𝑖), given an upper bound of the portion of the non- tumor cell killed 𝑈𝐵, we want to find the smallest cohort target set 𝐻(𝑖) with the following properties: . 𝐻(𝑖) covers all the tumor cells of patient 𝑃(𝑖). . 𝐻(𝑖) covers at most 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)| where 𝑁𝑇𝑃(𝑖) is the set of non-tumor cells known for patient 𝑃(𝑖); the number of non-tumor cells killed is represented by the variable 𝑦. the ilp can be formulated as follows: 𝑚𝑖𝑛 ∑ 𝑔∈𝑈 𝑥(𝑔) ( ) ∑ 𝑔∈𝐶 𝑥(𝑔) ≥ ∀𝐶 ∈ 𝑆𝑃(𝑖) ( ) 𝑦(𝐶) ≥ 𝑚𝑎𝑥 𝑔∈𝐶 𝑥(𝑔) ∀𝐶 ∈ 𝑁𝑇𝑃(𝑖) ( ) ∑ 𝐶 𝑦(𝐶) ≤ 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)| ∀𝐶 ∈ 𝑁𝑇𝑃(𝑖) ( ) with this formulation, the existence of a feasible solution is not guaranteed. however, covering all tumor cells might not always be necessary either. this statement can be justified as ( ) measuring data is not always accurate, and some tumor cells could be missing and ( ) in some cases, it might be possible to handle uncovered tumor cells using different methods. hence, we add another parameter 𝐿𝐵 to let us model this scenario. in the high-level, this is the ratio of the tumor cells we want to cover. the ilp can be formulated as follows: 𝑚𝑖𝑛 ∑ 𝑔∈𝑈 𝑥(𝑔) ( ) and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ∑ 𝐶 𝑦(𝐶) ≥ 𝐿𝐵 ∗ |𝑆𝑃(𝑖)| ∀𝐶 ∈ 𝑆𝑃(𝑖) ( ) 𝑦(𝐶) ≥ 𝑚𝑎𝑥 𝑔∈𝐶 𝑥(𝑔) ∀𝐶 ∈ 𝑆𝑃(𝑖) ∪ 𝑁𝑇𝑃(𝑖) ( ) ∑ 𝐶 𝑦(𝐶) ≤ 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)| ∀𝐶 ∈ 𝑁𝑇𝑃(𝑖) ( ) notice that the constraint ( ) here is different from the one above as we only care about the total number of tumor cells covered. even with both 𝑈𝐵 and 𝐿𝐵, the feasibility of the ilp is still not guaranteed. however, modeling the ilp in this way allows us to parameterize the ilp for various other scenarios of interest. while the two ilps above are designed for one patient, one can extend these ilps for multi-patient cohort. 𝑚𝑖𝑛 ∑ 𝑔∈𝑈 𝑥(𝑔) ( ) ∑ 𝑔∈𝐶 𝑥(𝑔, 𝑃(𝑖)) ≤ 𝐻(𝑖) + 𝛼 ∀𝑖∀𝐶 ∈ 𝑆𝑃(𝑖) ( ) 𝑥(𝑔, 𝑃(𝑖)) ≤ 𝑥(𝑔) ∀𝑖, 𝑔 ∈ 𝑈 ( ) 𝑦(𝐶, 𝑃𝑃(𝑖)) ≥ 𝑚𝑎𝑥 𝑔 ∈𝐶 𝑥(𝑔, 𝑃(𝑖)) ∀𝑖∀𝐶 ∈ 𝑆𝑃(𝑖) ( ) ∑ 𝐶 𝑦(𝐶, 𝑃𝑃(𝑖)) ≥ 𝐿𝐵 ∗ |𝑆𝑃(𝑖)| ∀𝑖 ( ) ∑ 𝐶 𝑦(𝐶, 𝑃𝑃(𝑖)) ≤ 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)| ∀𝑖 ( ) implementation note, accounting for multiple optima and software availability we implemented in python the above fair cohort target set formulations, with the expression ratio 𝑟 as an option when non-tumor cells are available. the parameters 𝛼, 𝑙𝑏, 𝑢𝑏 can be set by the user in the command line. to solve the ilps to optimality we usually used the scip library and its python interface . to obtain multiple optimal solutions of equal size we used the gurobi library (https://www.gurobi.com) and its python interface. when evaluating multiple optima, for all feasible instances, we sampled optimal solutions that may or may not be distinct, using the gurobi function select_solution(). to determine how often each gene or pair of genes occur in and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.gurobi.com/ https://doi.org/ . / . . . optimal solutions, we computed the arithmetic mean of gene frequencies and gene pair frequencies over all sampled optimal solutions. the software package is called madhitter. the main program is called hitting_set.py. we include in madhitter a separate program to sample cells and generate replicates, called sample_columns.py. so long as one seeks only single optimal solutions for each instance, exactly one of scip and gurobi is sufficient to use madhitter. we verified that scip and gurobi give optimal solutions of the same size. if one wants to sample multiple optima, this can be done only with the gurobi library. the choice between scip and gurobi and the number of optima to sample are controlled by command-line parameters use_gurobi and num_sol, respectively. the madhitter software is available on github at https://github.com/ruppinlab/madhitter acknowledgements this research is supported in part by the intramural research program of the national institutes of health, national cancer institute. this research is supported in part by the university of maryland year of data science program. this research is supported in part by start-up funds from northwestern university and a research award from amazon to support the research of s.k. this work utilized the computational resources of the nih hpc biowulf cluster. (http://hpc.nih.gov). thanks to e. michael gertz for technical assistance with scip, gurobi, and biowulf. thanks to allon wagner, keren yizhak and sushant patkar for assistance in identifying and retrieving suitable single-cell rnaseq data sets. thanks to leandro hermida for technical advice. competing interests the authors declare that they have no competing interests. references . von hoff d.d., et al. pilot study using molecular profiling of patients’ tumors to find potential targets and select treatments for their refractory cancers. j. clin. oncol. ( ), - ( ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/ruppinlab/madhitter http://hpc.nih.gov/ https://doi.org/ . / . . . . schütte m., et al. cancer precision medicine: why more is more and dna is not enough. pub. health genomics ( ): - ( ). . jameson, g.s., et al. a pilot study utilizing multi-omic molecular profiling to find potential targets and select individualized treatments for patients with previously treated metastatic breast cancer. breast cancer res. treat. ( ), - ( ). . saulnier sholler, g.l., et al. feasibility of implementing molecular-guided therapy for the treatment of patients with relapsed or refractory neuroblastoma. cancer med. ( ): - ( ). . byron, s.a., et al. prospective feasibility trial for genomics-informed treatment in recurrent and progressive glioblastoma. clin. cancer res. ( ), - ( ). . schwaederle, m., et al. association of biomarker-based treatment strategies with response rates and progression-free survival in refractory malignant neoplasms: a meta-analysis. jama oncology ( ), - ( ). . arnedos, m., vielh, p., soria, j.c. & andre, f. the genetic complexity of common cancers and the promise of personalized medicine: is there any hope? j. pathol. ( ): - ( ). . nikanjam, m., liu, s., yang, j. & kurzrock, r. dosing three-drug combinations that include targeted anti-cancer agents: analysis of , patients. oncologist ( ), - ( ). . rebollo, j. et al. gene expression profiling of tumors from heavily pretreated patients with metastatic cancer for the selection of therapy: a pilot study. am. j. clin. oncol. ( ), - ( ). . sureda, m., et al. determining personalized treatment by gene expression profiling in metastatic breast carcinoma patients: a pilot study. clin. trans. oncol. ( ), - ( ). . sicklick, j.k., et al. molecular profiling of cancer patients enables personalized combination therapy: the i-predict study. nat. med. ( ), - ( ). . joo, j.i., et al. realizing cancer precision medicine by integrating systems biology and nanomaterial engineering. adv. mater. ( ):e ( ). . marusyk, a. & polyak, k. tumor heterogeneity: causes and consequences. biochimica et biophysica acta ( ), - ( ). . mcgranahan, n. & swanton, c. biological and therapeutic impact of intratumor heterogeneity in cancer evolution. cancer cell ( ): - ( ). . yofe, i., dahan, r. & amit i. single-cell genomic approaches for developing the next generation of immunotherapies. nat. med. ( ), - ( ). . neelapu, s.s., et al. axicabatagene ciloleucel car t-cell therapy in refractory large b-cell lymphoma. new engl. j. med. ( ): - , . . bjorn, m.j., ring, d. & frankel, a. evaluation of monoclonal antibodies for the development of breast cancer immunotoxins. cancer res. ( ), - ( ). . pastan, i., willingham, m.c. & fitzgerald d.j.p. immunotoxins. cell ( ), - ( ). . gray, b.p. & brown, k.c. combinatorial peptide libraries: mining for cell-binding peptides. chem. rev. ( ), - ( ). . liu, r., li, x., xiao, w. & lam, k.s. tumor-targeting peptides from combinatorial libraries. adv. drug delivery rev. - , - ( ). . fisher, s.l. & phillips, a.j. targeted protein degradation and the enzymology of degraders. curr. opin. chem. biol. , - ( ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.ncbi.nlm.nih.gov/pubmed/ https://www.ncbi.nlm.nih.gov/pubmed/?term=jameson% gs% bauthor% d&cauthor=true&cauthor_uid= https://www.ncbi.nlm.nih.gov/pubmed/?term=jameson+gs% bauth% d+ https://www.ncbi.nlm.nih.gov/pubmed/ https://www.ncbi.nlm.nih.gov/pubmed/ https://www.ncbi.nlm.nih.gov/pubmed/ https://www.ncbi.nlm.nih.gov/pubmed/ https://www.ncbi.nlm.nih.gov/pubmed/ https://www.ncbi.nlm.nih.gov/pubmed/ https://www.ncbi.nlm.nih.gov/pubmed/ https://www.ncbi.nlm.nih.gov/pubmed/ https://doi.org/ . / . . . . plückthun a. designed ankyrin repeat proteins (darpins): binding proteins for research, diagnostics, and therapy. annu. rev. pharmacol. toxicol. , - ( ). . sokolova, e.a., et al. her -specific targeted toxin darpin-lope: immunogenicity and antitumor effect on intraperitoneal ovarian cancer xenograft model. int. j. mol. sci. ( ). pii: e ( ). . dannenfelser, r., et al., discriminatory power of combinatorial antigen recognition in cancer t cell therapies. cell syst. ( ): - ( ). . mackay, m., et al. the therapeutic landscape for cells engineered with chimeric antigen receptors. nat. biotech. ( ), - ( ). . maude, s.l., et al. tisagenlecleusel in children and young adults with b-cell lymphoblastic leukemia. new engl. j. med. ( ): - ( ). . lamers, c.h., et al. treatment of metastatic renal cell carcinoma with caix car- engineered t cells: clinical evaluation and management of on-target toxicity. mol. ther. ( ): - ( ). . thistlethwaite, f.c., et al. the clinical efficacy of first-generation carcinoembryonic antigen (ceacam )-specific car t cells is limited by poor persistence and transient pre- conditioning-dependent respiratory toxicity. cancer immunol. immunother. ( ): - ( ). . fedorov, v.d., themeli, m., sadelain, m. pd- - and ctla- -based inhibitory chimeric antigen receptors (icars) divert off-target immunotherapy responses. sci. transl. med. ( ): ra ( ). . grada, z., et al. tancar: a novel bispecific chimeric antigen receptor for cancer immunotherapy. mol. ther. nucl. acids :e ( ). . hegde, m., et al. combinational targeting offsets antigen escape and enhances effector function of adoptively transferred t cells in glioblastoma. mol. ther. ( ): - ( ). . roybal, k.t., et al. engineering t cells with customized therapeutic response using synthetic notch receptors. cell ( ): - .e ( ). . williams, j.z., et al. precise t cell recognition programs designed by transcriptionally linking multiple receptors. science ( ): - ( ). . Říhová, b. receptor-mediated targeted drug or toxin delivery. adv. drug deliv. rev. ( ), - ( ). . tortorella, s. & karagiannis, t.c. transferrin receptor-mediated endocytosis: a useful target for cancer therapy. j. membr. biol. ( ), - ( ). . karp, r.m. reducibility among combinatorial problems. in complexity of computer computations, pp. - (plenum press, new york, ). . martinez-veracoechea, f.j. & frenkel, d. designing super selectivity in multivalent nano- particle binding. proc. natl acad. sci. usa , - ( ). . delaney, c., et al. combinatorial prediction of marker panels from single-cell transcriptomic data. mol. syst. biol. ( ), e ( ). . müller, s., et al. a role for receptor tyrosine phosphatase zeta in glioma cell migration. oncogene ( ), - ( ). . ulbricht, u. et al. expression and function of the receptor protein tyrosine phosphatase zeta and its ligand pleiotrophin in human astrocytomas. j. neuropathol. exp. neurol. ( ), - ( ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . chen, h.m., et al. enhanced expression and phosphorylation of the met oncoprotein by glioma-specific ptprz -met fusions. febs lett. ( ): - ( ). . ulbricht, u., eckerich, c., fillbrandt, r., westphal, m. & lamszus, k. rna interference targeting protein tyrosine phosphatase zeta/receptor-type protein tyrosine phosphatase beta suppresses glioblastoma growth in vitro and in vivo. j. neurochem. , – ( ). . bourgonje. a.m., et al. intracellular and extracellular domains of protein tyrosine phosphatase ptprz-b differentially regulate glioma cell growth and motility. oncotarget ( ), - ( ). . fujikawa, a., et al. targeting ptprz inhibits stem cell-like properties and tumorigenicity in glioblastoma cells. sci. rep. : ( ). . pastor, m., et al. development of inhibitors of receptor protein tyrosine phosphatase β/ζ (ptprz ) as candidates for cns disorders. eur. j. med. chem. : - ( ). . darmanis, s., et al. single-cell rna-seq analysis of infiltrating neoplastic cells at the migrating front of human glioblastoma. cell rep. , - ( ). . bergelson, j.m., et al. isolation of a common receptor for coxsackie b viruses and adenoviruses and . science ( ): - ( ). . nilchian, a., et al. cxadr-mediated formation of an akt inhibitory signalosome at tight junctions controls epithelial-mesenchymal plasticity in breast cancer. cancer res. ( ): - ( ). . arfelt k.n., et al. signaling via g proteins mediates tumorigenic effects of gpr . cell. signal. : - ( ). . zhang, y., qian, y., lu, w., chen, x. the g protein-coupled receptor is necessary for p -dependent cell survival in response to genotoxic stress. cancer res. ( ): - ( ). . wang, l., et al. overexpression of g protein-coupled receptor gpr promotes pancreatic cancer aggressiveness and activates nf-κb signaling pathway. mol. cancer : ( ). . szklarczyk, d., et al. string v : protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. nucleic acids res. ; (d ), d -d ( ). . seoane, j. & de mattos-arruda l. the challenge of intratumour heterogeneity in precision medicine. j. intern. med. ( ), - ( ). . van dijk d, et al. recovering gene interactions from single-cell data using data diffusion. cell ( ): - .e ( ). . li, w.v. & li, j.j. an accurate and robust imputation method scimpute for single-cell rna- seq data. nat. comm. , ( ). . raj, a. & van oudenaarden, a. stochastic gene expression and its consequences. cell ( ): - ( ). . kim, j.y. & marioni, j.c. inferring the kinetics of stochastic gene expression from single- cell rna-sequencing data. genome biol. , r ( ). . clough, e. & barrett, t. the gene expression omnibus database. meth. mol. biol. , - ( ). . kolesnikov, n., et al. arrayexpress update--simplifying data submissions. nucleic acids res. (database issue), d -d ( ). . cao, y., zhu, j., jia, p. & zhao z. scrnaseqdb: a database for rna-seq based gene expression profiles in human single cells. genes ( ), ( ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . yuan j, et al. single-cell transcriptome analysis of lineage diversity in high-grade glioma. genome med. ( ), ( ). . karaayvaz, m., et al. unravelling subclonal heterogeneity and aggressive disease states in tnbc through single-cell rna-seq. nat. comm. ( ), ( ). . jerby-arnon, l., et al. a cancer cell program promotes t cell exclusion and resistance to checkpoint blockade. cell ( ): - .e ( ). . tirosh, i., et al. dissecting the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. science ( ), - ( ). . chung, w., et al. single-cell rna-seq enables comprehensive tumour and immune cell profiling in primary breast cancer. nat. comm. , ( ). . venteicher, a.s., et al. decoupling genetics, lineages, and microenvironment in idh-mutant gliomas by single-cell rna-seq. science ( ): pii:eaai ( ). . tirosh, i., et al. single-cell rna-seq supports a developmental hierarchy in human oligodendroglioma. nature ( ), - ( a). . patel, a.p., et al. single-cell rna-seq highlights intratumoral heterogeneity in primary glioblastoma. science ( ), - ( ). . filbin, m.g., et al. developmental and oncogenic programs in h k m gliomas dissected by single-cell rna-seq. science ( ), - ( ). . li, h., et al. reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. nat. genet. ( ), - ( ). . puram, s.v., et al. single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck cancer. cell ( ), - .e ( ). . shih, a.j., et al. identification of grade and origin specific cell populations in serous epithelial ovarian cancer by single cell rna-seq. plos one ( ), e ( ). . miyamoto, d.t., et al. rna-seq of single prostate ctcs implicates noncanonical wnt signaling in antiandrogen resistance. science , - ( ). . lambrechts, d., et al. phenotype molding of stromal cells in the lung tumor microenvironment. nat. med. , - ( ). . uhlén, m., et al. tissue-based map of the human proteome. science ( ), ( ). . the gtex consortium. the genotype-tissue expression (gtex) project. nat. genet. ( ), - ( ). . bausch-fluck, d., et al. the in silico human surfaceome. proc. natl. acad. sci usa , e -e ( ). . gainer-dewar, a. & vera-lincona, p. the minimal hitting set generation problem: algorithms and computation. siam j. discr. math. , - ( ). . haedlicke, o. & klamt, s. computing complex metabolic intervention strategies using constrained minimal cut sets. metabolic eng. , - ( ). . haus, u.-u., klamt, s. & stephen, t. computing knock-out strategies in metabolic networks. j. comput. biol. ( ), - ( ). . jarrah, a.s., laubenbacher, r., stigler, b. & stillman, m. reverse-engineering of polynomial dynamical systems. adv. appl. math. , - ( ). . klamt, s. & gilles, e.d. minimal cut sets in biochemical reaction networks. bioinformatics , - ( ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . . trinh, c.t., wlaschin, a. & srienc f. elementary mode analysis: a useful metabolic pathway analysis tool for characterizing cellular metabolism. appl. microbiol. biotech. , - ( ). . ideker, t. discovery of regulatory interactions through perturbation: inference and experimental design. pac. symp. biocomput. , - ( ). . wang, r.s. & albert, r. elementary signaling modes predict the essentiality of signal transduction network components. bmc syst. biol. , ( ). . zvedei-oancea, i. & schuster, s. a theoretical framework for detecting signal transfer routes in signaling networks. comput. chem. engineer. , - ( ). . vazquez a. optimal drug combinations and minimal hitting sets. bmc syst. biol., , ( ). . mellor, d., prieto, e., mathieson, l. & moscato, p. a kernelisation approach for multiple d- hitting set and its application in optimal multi-drug therapeutic combinations. plos one ( ), e ( ). . vera-licona, p., bonnet, e., brillot, e & zinovyev, a. ocsana: optimal combinations of interventions from network analysis. bioinformatics , - ( ). . pang, k., et al. combinatorial therapy discovery using mixed integer linear programming. bioinformatics , - ( ). . achterberg, t. scip: solving constraint integer programs. math. program. comput. ( ), - ( ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . ribovore: ribosomal rna sequence analysis for genbank submissions and database curation schäffer et al. software ribovore: ribosomal rna sequence analysis for genbank submissions and database curation alejandro a. schäffer , , richard mcveigh , barbara robbertse , conrad l. schoch , anjanette johnston , beverly a. underwood , ilene karsch-mizrachi and eric p. nawrocki * abstract background: the dna sequences encoding ribosomal rna genes (rrnas) are commonly used as markers to identify species, including in metagenomics samples that may combine many organismal communities. the s small subunit ribosomal rna (ssu rrna) gene is typically used to identify bacterial and archaeal species. the nuclear s ssu rrna gene, and s large subunit (lsu) rrna gene have been used as dna barcodes and for phylogenetic studies in different eukaryote taxonomic groups. because of their popularity, the national center for biotechnology information (ncbi) receives a disproportionate number of rrna sequence submissions and blast queries. these sequences vary in quality, length, origin (nuclear, mitochondria, plastid), and organism source and can represent any region of the ribosomal cistron. results: to improve the timely verification of quality, origin and loci boundaries, we developed ribovore, a software package for sequence analysis of rrna sequences. the ribotyper and ribosensor programs are used to validate incoming sequences of bacterial and archaeal ssu rrna. the ribodbmaker program is used to create high-quality datasets of rrnas from different taxonomic groups. key algorithmic steps include comparing candidate sequences against rrna sequence profile hidden markov models (hmms) and covariance models of rrna sequence and secondary-structure conservation, as well as other tests. at least nine freely available blastn rrna databases created and maintained with ribovore are used either for checking incoming genbank submissions or by the blastn browser interface at ncbi or both. since , ribovore has been used to analyze more than million prokaryotic ssu rrna sequences submitted to genbank, and to select at least , fungal rrna refseq records from type material of , taxa. conclusion: ribovore combines single-sequence and profile-based methods to improve genbank processing and analysis of rrna sequences. it is a standalone, portable, and extensible software package for the alignment, classification and validation of rrna sequences. researchers planning on submitting ssu rrna sequences to genbank are encouraged to download and use ribovore to analyze their sequences prior to submission to determine which sequences are likely to be automatically accepted into genbank. keywords: ribosomal rna; annotation; alignment; ncrna * correspondence: nawrocke@ncbi.nlm.nih.gov national center for biotechnology information, national library of medicine, national institutes of health, bethesda, md, usa full list of author information is available at the end of the article background in , carl woese and george fox proposed the archaebacteria (later renamed archaea) as a third domain of life distinct from bacteria and eukaryota based on analysis of small subunit ribosomal rna (ssu rrna) oligonucleotide fragments from microbes [ ]. the use of ssu rrna to elucidate phylogenetic relationships continued and dramatically expanded in the late s when norm pace and col- leagues developed a technique to pcr amplify potentially unculturable microbes from environmental samples by targeting so-called universal primer sites [ ]. the technique was later refined by pace and others including ward, weller [ ] and giovanonni and colleagues [ ]. environmental studies targeting ssu rrna as a phylogenetic marker gene that seek to characterize the diversity of life in a given and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of environment have remained common ever since, and consequently there are now millions of prokaryotic ssu rrna sequences in public databases. when rrna se- quences are submitted to public databases, such as genbank, it is important to do quality control, so that subsequent data analyses are not misled by errors in sequencing and sequence annotation. because rrna gene sequences do not code for proteins, but have been studied so extensively, specialized checks for correct- ness and completeness are feasible and desireable. the focus of this paper is the description of ribovore, a software package for validating incoming rrna sequence submissions to genbank and for curating rrna sequence collections. ssu rrna was initially chosen by woese and fox for inferring a universal phylo- genetic tree of life because it existed in all cellular life, was large enough to provide enough data (about nucleotides (nt) in bacteria), and had evolved slowly enough to be comparable across disparate groups [ ]. the first environmental sur- veys targeted ssu rrna, but studies targeting lsu rrna, which is roughly twice as long as ssu rrna, followed soon after [ , ]. these types of analyses eventually began to target eukaryotes, especially fungi. in eukaryotes, the . s rrna gene is surrounded by two internal transcribed spacers (its and its ). this region is sometimes collectively referred to as the its region and it has been selected as the primary fungal barcode since it has the highest probability of successful identification for the broadest range of fungi [ ]. however, the lsu rrna gene [ ] is a popular phylogenetic marker in certain fungal groups [ ]. in general, the nuclear ssu rrna has poor species-level resolution in most fungi and other eukaryote taxonomic groups [ , ], but remains useful at species level in some rapid evolving groups such as the diatoms [ ]. species identification in protists takes a two-step barcoding approach, which use the ∼ bp variable v region of the ssu rrna gene as a variable marker and then use a group-specific barcode for species-level assignments, some of which include the lsu rrna gene and its region [ ]. specialized analysis tools and databases have been developed to help researchers analyze their rrna sequences. many of these specialized tools are based on compar- ing sequences to either profile hidden markov models (profile hmms) or covariance models (cms). cms are profile stochastic context-free grammars, akin to profile hmms of sequence conservation [ , ], with additional complexity to model the conserved secondary structure of an rna family [ , , ]. like profile hmms, cms are probabilistic models with position-specific scores, determined based on the frequencies of nucleotides at each position of the input training alignment used to build the model. unlike hmms, cms also model well-nested secondary structure, provided as a single, fixed consensus secondary structure for each model and anno- tated in the input training alignment. a cm includes scores for each of the possible ( x ) basepairs for basepaired positions and both paired positions are considered together by scoring algorithms. the incorporation of secondary structure has been shown to significantly improve remote homology detection of structural rnas [ ], and for ssu rrna considering structure has been shown to offer a small improvement to alignment accuracy versus profile hmms [ , ]. for eukaryotes, where ssu and lsu rrna sequences are often more divergent at the sequence level than for bacteria and archaea, harnessing structural information during alignment may be more impactful. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of the specialized tools for rrna sequence analysis include: databases, some of which are integrated with software, rrna prediction software, and multiple align- ment software. the integrated and highly curated databases include the arb work- bench software package for rrna database curation [ ], the comparative rna website (crw) [ ], the ribosomal database project (rdp) [ , ], and the greengenes [ ], and silva [ ] databases. these databases differ in their scope and methodology. crw contains tens of thousands of sequences and corresponding alignments of ssu, lsu and s rrna from all three domains as well as from or- ganelles, along with secondary structure predictions for selected sequences. green- genes, which is seemingly no longer maintained as its last update was in , includes ssu rrna sequences for bacteria and archaea, but not for eukarya, nor does it contain any lsu rrna sequences. rdp also includes ssu rrna for bac- teria and archaea, as well as fungal lsu rrna and its sequences, but no other lsu sequences. silva, which split off from the arb project starting in [ ], includes bacterial, archaeal and eukaryotic (fungal and non-fungal) ssu and lsu rrna sequences. rdp includes more than million ssu rrna and , fungal lsu rrna sequences as of its latest release ( . ), and silva includes more than million ssu rrna and million lsu rrna sequences (release . ). available rrna prediction software packages include rnammer [ ], rrnaselec- tor [ ], and barrnap(https://github.com/tseemann/barrnap) all of which use some version of the profile hmm software hmmer [ ] to predict the locations of rrnas in contigs or whole genomes. both rdp and silva make available multiple alignments of all sequences for each gene and taxonomic domain, and all include several sequence analysis tools for tasks such as classification. the alignment methodology differs: silva uses sina, which implements a graph-based alignment algorithm that computes a sequence- only based alignment of an input sequence to one or more similar sequences selected from a fixed reference alignment [ ]. rdp uses infernal [ ], which computes align- ments using cms. per-domain cms for ssu and lsu rrna are freely available in the rfam database, a collection of more than rna families each represented by a con- sensus secondary structure annotated reference alignment called a seed alignment and corresponding cm built from that alignment [ ]. rfam includes five full length ssu and four full length lsu rrna families and cms. although rdp uses cms for rrna alignment, the cms are not from rfam. users can download and use rfam cms to annotate their own sequences using infernal, thus offering a distinct strategy from silva or rdp for rrna analysis. the rfam database includes a model (rf ) for ssu rrna from mi- crosporidia, a phylum of particular interest within the kingdom of fungi. more than years ago, woese and colleagues discovered that microsporidia have a dis- tinctive ribosome that is smaller and more primitive than the ribosomes of most if not all other eukaryotes [ ]. recently, barandun and colleagues presented the first crystal structure of the ribosome of microsporidia, confirming that both the ssu and lsu rrna are smaller than in other fungi [ ]. most of the sequence analysis and curation to date in microsporidia has focused on ssu rather than lsu rrna. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of genbank processing of rrna sequences data submitted to genbank are subject to review by ncbi staff to prevent incorrect data from entering ncbi databases. over the past three decades, personnel called genbank indexers have spent a large proportion of their time validating incoming submissions of thousands to millions of rrna sequences due to the large number of rrna sequences generated in phylogenetic and environmental studies. similarity searches with blastn have been used to compare submitted rrna sequences against one of several databases of trusted, high-quality rrna sequences depending on the taxonomic domain and gene. the blastn query results were a primary source of evidence used to determine if rrna sequences would be accepted to genbank or not. prior to the ribovore project, suitable blastn databases did not exist for validating submissions of eukaryotic ssu rrna or lsu rrna sequences, making checking for those genes especially difficult and time-consuming. starting in , a system with predefined criteria for per-sequence blastn results was deployed at ncbi; submissions in which all sequences met those criteria have been automatically accepted into genbank without any indexer review. ribovore- based tests began being used in conjunction with or instead of blastn-based tests for some submissions in this system in june . although the engine inside the pre- validation system, blast, is freely available and portable, the system as a whole was internal to genbank and not portable, preventing researchers who wish to submit sequence data to genbank (henceforth, called “submitters”) from replicating the tests on their local computers. for rrna sequences as well as other sequences of high biological interest, gen- bank indexers and other ncbi personnel want to carry out two related and re- current processes: quick identification of which submitted sequences should be ac- cepted into genbank, and the construction of non-redundant collections of trusted, full length sequences that have no or few errors. the second problem is the moti- vation behind the entire refseq project [ ]. towards addressing the first problem, the development of an alternative sequence validation system for rrna included four design goals offering potential improvements over the existing system. first, the system should be as deterministic and as reproducible as possible in deciding whether sequences are accepted or not, which we refer to as passing (accepted) or failing (not accepted), allowing submissions with zero failing sequences to be au- tomatically added to genbank without the need for any manual genbank indexer intervention. some non-determinism over time is unavoidable because various in- puts to the system, such as the ncbi taxonomy tree, change over time. second, the system should be available as a standalone tool that submitters can run on their sequences prior to submission, saving time for both the genbank indexers and submitters. third, the system should be general enough to facilitate exten- sion to additional taxonomic groups and rrna genes. fourth, the system should be capable of increasing the stringency of tests for quality and adding tests to avoid re- dundancy to enable producing collections of high quality non-redundant sequences for other applications, such as serving as blastn databases. because none of the existing databases or specialized rrna tools listed above address all of these design goals, we implemented the freely available and portable ribovore software package for the analysis of ssu rrna and lsu rrna sequences and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of from bacteria, archaea, and eukarya as well as mitochondria from some eukary- otic groups. ribovore includes several programs designed for related but distinct tasks, each of which has specific rules dictating whether a sequence passes or fails based on deterministic criteria described in detail in the implementation section and in the ribovore documentation. rrna sensor is a simplified, standalone ver- sion of the previous blastn-based system that is more portable and faster for bacterial and archaeal ssu rrna than the previous system owing to a smaller blastn target database constructed by removing redundancy from the pre-existing blastn database. ribotyper is similar to rrna sensor but compares each input sequence against a library of profile hmms and/or cms offering an alternative, and in some cases, more powerful approach than the single sequence-based blastn algorithm. additionally, ribotyper can be used to validate the taxonomic domain each sequence belongs to because it compares a set of models from different tax- onomic groups against each sequence. to take advantage of both single-sequence and profile-based approaches, and partly to ease the transition from the previous blastn-based system towards profile-based analysis, we implemented ribosensor that runs both rrna sensor and ribotyper and then combines the results. up to this point, rrna sensor and ribotyper are deliberately designed to accept both partial and complete sequences of moderate quality or better. to more selectively identify full-length rrna sequences that extend up to, but not beyond the gene boundaries, we implemented riboaligner which runs ribotyper as a first pass validation, and then creates multiple alignments and selects sequences that pass based on those alignments. finally, to make ribovore capable of generating datasets of trusted sequences from different taxonomic groups for wider use by the commu- nity, we developed ribodbmaker, which chooses a non-redundant set of high-quality, full-length sequences based on a series of tests. the pipeline of tests includes some specific to rrna, including analysis by ribotyper and riboaligner, some more general tests, such as counting ambiguous nucleotides and vector contamination screening, and some tests that require connection to the ncbi taxonomy database to validate the taxonomy assignment of sequences. implementation ribovore is written in perl and available at https://github.com/ncbi/ribovore. the ribovore installation procedure also installs the program rrna sensor, which is described here as well. the rrna sensor program includes a shell script and perl scripts and is available at https://github.com/aaschaffer/rrna sensor. these pack- ages use existing software as listed in table . each of the four ribovore programs takes as input two command-line arguments: the path to an input sequence file in fasta format and the name of an output directory to create and store output files in. command-line options exist to change default parameters and behavior of the programs in various ways. the options as well as example usage can be found as part of the source distribution and on github in the form of markdown files in the ribovore documentation subdirectory (e.g. https://github.com/ncbi/ribovore/blob/master/documentation/ribotyper.md). cen- tral to each of the scripts is the concept of sequences passing or failing. if a sequence meets specific criteria, many of which are changeable with command-line options, and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of table software packages and libraries used within ribovore v . . ∗: the esl-cluster executable from infernal v . . , which is absent in v . . , is also installed and used within ribovore. software and website used within purpose in ribovore sequip v . all ribovore option handling, output github.com/nawrockie/sequip programs file handling and other utilities infernal v . . ∗ all ribovore build and use profile github.com/eddyrivaslab/infernal programs hmms and cms to classify, validate and align rrna sequences blast+ v . . ribodbmaker build blast databases ftp.ncbi.nlm.nih.gov/blast/ and validate rrna sequences executables/blast+/ . . vecscreen plus taxonomy v . ribodbmaker screen for vector contamination github.com/aaschaffer/ vecscreen plus taxonomy gnu time (not required) all ribovore determine running time programs if -p option is used then it will pass and otherwise it will fail, as discussed more below. an overview of the four ribovore programs and rrna sensor is shown in figure . table command-line arguments for rrna sensor argument index argument name description min length lower bound on sequence length max length upper bound on sequence length seq file input sequence file in fasta format output file name name for summary output file min id percentage lower bound on percent identity max evalue upper bound on e-value nprocessors number of threads for blastn output dir output directory path blastdb blastn database rrna sensor the rrna sensor program compares input sequences to a blastn database of ver- ified rrna sequences using blastn. the program takes nine command-line argu- ments specified in table . each input sequence is classified into one of five classes based on its length and blastn results. a sequence is classified as too long or too short if its length is greater than the maximum length or less than the minimum length specified in the command by the user. to allow partial sequences and flex- ibility in the length, genbank indexers were typically using a length interval of [ , ] nt for prokaryotic s ssu rrna. empirical analysis shows that more than . % of the full-length validated prokaryotic sequences have lengths in the range [ , ], so this narrower range is recommended if one wants to check that sequences are typically full-length sequences. sequences within the allowed length range are classified as either no if there are zero blastn hits, yes if they have at least one blastn hit that has an e-value of e- or less and a percent identity of % or more, or imperfect match if there is at least one hit but the e-value or percent identity thresholds are not met for any hits. sequences that are too long are probably either incorrect or containing and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of figure schematic summarizing the use cases for the four ribovore programs and rrna sensor. programs listed in white boxes underneath the black boxes are important external programs executed from within the program in the attached black box. validate and classify ribosomal rna sequences: analyze lengths of ribosomal rna sequences: riboaligner ribotyper cmalign create high-quality reference database of ribosomal rna sequences: ribodbmaker srcchk vecscreen blastn ribotyper riboaligner esl-cluster sequence �le sequence �le ribotyper cmsearch sequence �le - pass/fail de�nition - classi�cation to best-matching model (e.g. ssu.bacteria) - list of unexpected features, if any rrna_sensor blastn sequence �le - classi�cation into one of �ve classes: yes, no, too long, too short or imperfect match ribosensor ribotyper rrna_sensor sequence �le - pass/fail de�nition - list of ribotyper, rrna_sensor and genbank errors, if any - alignment to best-matching model - length classi�cation based on alignment - overall pass/fail de�nition - per-test pass/fail de�nition for tests: • ambiguous nucleotides • vector contamination • repetitive sequences • validation by ribotyper and riboaligner • reference model span • taxonomic ingroup analysis output per sequence:input: sequences compared to: pro�les single sequences pro�les and single sequences pro�les pro�les and single sequences also executes: output per sequence:input: sequences compared to: output per sequence:input: sequences compared to: extra flanking sequence that should be trimmed, while sequences that are too short may be valid partial sequences. the other tests based on quality of blastn matches codify the tests that genbank indexers were doing internally before rrna sensor was implemented. submitted sequences of a suitable length now classified as no would have been rejected in the past framework; sequences now classified as yes would have been accepted into genbank in the past framework. in the current testing framework, rrna sensor is used as part of the ribosensor program as described below, not by itself. there are two target blastn databases included with rrna sensor, one for prokaryotic s ssu rrna and one for eukaryotic s ssu rrna. the prokaryotic database includes sequences, of which are bacterial and the remain- ing are archaeal. the eukaryotic database includes sequences. additional, user-created blastn databases can also be used with the program. the prokaryotic database was updated most recently on june , by filtering and clustering the pre-existing database of , sequences used by genbank indexers for s ssu rrna analysis. one could repeat the same procedure with the larger ver- sion of the s ssu rrna database described in results. the initial database was filtered to remove sequences outside the length range [ , ]. the re- maining , sequences were clustered using uclust [ ] so that the surviving and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of table profile models used by ribovore. ’#seqs’ is the number of sequences in the multiple alignment used to build the model. ’length’ is the number of reference model positions. abbreviations in ’taxonomy group’ column: ’bac’ is bacteria, ’euk’ is eukarya and ’mito’ is mitochondria. model name gene taxonomy group #seqs length rfam ssu rrna archaea ssu rrna archaea rf ssu rrna bacteria ssu rrna bacteria rf ssu rrna eukarya ssu rrna eukarya rf ssu rrna microsporidia ssu rrna euk-microsporidia rf lsu rrna archaea lsu rrna archaea rf lsu rrna bacteria lsu rrna bacteria rf lsu rrna eukarya lsu rrna eukarya rf ssu rrna mitochondria metazoa ssu rrna mito-metazoa - ssu rrna mitochondria amoeba ssu rrna mito-amoeba - ssu rrna mitochondria chlorophyta ssu rrna mito-chlorophyta - ssu rrna mitochondria fungi ssu rrna mito-fungi - ssu rrna mitochondria kinetoplast ssu rrna mito-kinetoplast - ssu rrna mitochondria plant ssu rrna mito-plant - ssu rrna mitochondria protist ssu rrna mito-protist - ssu rrna chloroplast ssu rrna chloroplast - ssu rrna chloroplast pilostyles ssu rrna chloroplast - ssu rrna cyanobacteria ssu rrna bac-cyanobacteria - ssu rrna apicoplast ssu rrna euk-apicoplast - sequences were no more than % identical, leaving sequences. the eukaryotic s ssu rrna database of sequences was updated most recently on septem- ber , by running version . of the ribovore program ribodbmaker on an input set of , genbank sequences returned from the eukaryotic ssu rrna e-utilities (eutils) query provided in results and discussion with command-line options --skipfribo --model ssu.eukarya --ribo hmm. ribotyper the ribotyper program is also designed to validate ribosomal rna sequences but it differs from rrna sensor in the method of sequence comparison and the taxonomic breadth over which it applies. instead of using blastn, ribotyper uses a profile hmm and optionally a covariance model (cm) to compare against input sequences. the profile hmm and cms were built either from rfam rrna seed alignments (see table ) or from alignments created specifically for ribovore by the authors for taxonomic groups not covered by the rfam models. sequence processing by ribotyper proceeds over two main stages. in stage , each sequence is compared against all profiles using a truncated version of the hmmer pipeline [ ] optimized for speed. only the first three stages of the hm- mer pipeline are employed to compute a score for each sequence/profile compar- ison but without calculating accurate alignment endpoints. for each sequence, the best-scoring model is selected and used in the second stage where the hmmer pipeline is used again but this time in its entirety to compute likely endpoints of high-scoring hits to each model. these two stages are very similar to the classifi- cation and coverage determination stages of the vadr software package for viral sequence annotation [ ]. the results of the stage comparison are then post- processed to determine if any unexpected features exist for each sequence. there are types of unexpected features, listed in table . ribosensor the ribosensor program is a wrapper script that runs both ribotyper and rrna sensor and combines the results to determine if each sequence should pass or and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of table attributes of the types of ribotyper unexpected features. unexpected features labelled with * in the first column are fatal by default, in that they cause a sequence to fail. unacceptablemodel and questionablemodel can only potentially be reported if the --inaccept option is used. evaluescorediscrepancy can only be reported if the --evalues option is used. tooshort and toolong can only be reported if the --shortfail or --longfail options are used, respectively. unexpected feature name description nohits* no stage hits above threshold to any models unacceptablemodel* best stage hit is to a model that is unacceptable as defined in --inaccept input file multiplefamilies* stage hits exist to more than one family (e.g. ssu and lsu) bothstrands* stage hits above threshold exist on both strands duplicateregion* at least two stage or hits on same strand overlap inconsistenthits* not all hits are in the same order in sequence and model coordinates questionablemodel* best stage hit is to a model that is questionable as defined in --inaccept input file minusstrand best stage hit is on the minus strand lowscore the bits per nucleotide value (total bit score divided by total length of sequence) is below threshold of . lowcoverage sequence coverage of all hits is below threshold of . lowscoredifference difference between top two models in different domains is below . bits per position verylowscoredifference difference between top two models in different domains is below . bits per position multiplehits there is more than one hit to the best scoring model on the same strand evaluescorediscrepancy if hits were sorted by e-value due to --evalue, best hit has lower bit score than second best hit tooshort* sequence length is less than and --shortfail used toolong* sequence length is greater than and --shortlong used fail. this script was motivated partly by an effort to ease the transition for genbank indexers between the pre-existing blastn-based system and a system based on pro- files. additionally, in some cases, the profile models in ribotyper allow some valid rrna sequences that would fail blastn and rrna sensor to pass, and conversely some valid sequences pass rrna sensor and fail ribotyper, making a combination of the two programs potentially more accurate. the ribosensor program can be run in one of two modes: s mode is the default mode and should be used for bacterial and archaeal s ssu rrna sequences, and s mode should be used (by specifying the option -m s on the command-line) for eukaryotic s ssu rrna. all sequences are first processed by ribotyper us- ing command-line options --scfail --covfail --tshortcov . --tshortlen to fail sequences for which lowscore and lowcoverage unexpected features are reported, and to specify that the threshold for lowcoverage is % for sequences of nt or less. these options were selected based on results of internal testing by genbank indexers. next, rrna sensor is run, potentially up to three separate times, on partitions of the input sequence file separated based on length and using custom thresholds for each length range. sequences that are shorter than nt or longer than nt are considered too short or too long and are not analyzed. for sequences between and nt, a minimum percent identity of % and minimum coverage of % is enforced. for sequences between and nt, the minimum thresholds used are % percent identity and % coverage, and for se- quences between and nt the minimum thresholds used are % percent identity and % coverage. these thresholds can be changed via command-line options. the results of ribotyper and rrna sensor are combined and each sequence is sep- arated into one of four outcome classes depending on whether it passed or failed each and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of program: rpsp (passed ribotyper and rrna sensor), rpsf (passed ribotyper and failed rrna sensor), rfsp (failed ribotyper and passed rrna sensor), and rfsf (failed both). additionally, the reasons for failing each program are reported. for ribotyper, these are the unexpected features described above, each prefixed with a “r ” (e.g. r multiplefamilies). the possible errors for rrna sensor are listed in table and the possible errors for ribotyper are listed in table . fi- nally, these errors are mapped to a different set of errors created for use within the pre-existing context of genbank’s sequence processing pipeline shown which has its own error naming and usage conventions. this mapping is shown in table . the “fails to” column is of practical importance because it indicates which errors cause a submission to not be accepted. more positively, if a submitter runs ribosensor before actually trying to submit and the submitter sees that the errors in the first seven rows and the third column of table do not occur, then, assuming the meta- data for the submission are complete and valid, the submitter can have confidence that the submission to genbank will be accepted. table descriptions of rrna sensor errors within ribosensor and mapping to the genbank errors they trigger. ’*’: the first four rrna sensor errors do not trigger genbank errors and are ignored by ribosensor if either (a) the sequence is ’rpsf’ (passes ribotyper and fails rrna sensor) and the -c option is not used with ribosensor or (b) the sequence is ’rfsf’ (fails both ribotyper and rrna sensor) and r unacceptablemodel or r questionablemodel ribotyper errors are also reported. rrna sensor error associated genbank error cause/explanation s nohits∗ seq hom notssuorlsurrna no hits reported (’no’ column ) s nosimilarity∗ seq hom lowsimilarity coverage (column ) of best blast hit is < % s lowsimilarity∗ seq hom lowsimilarity coverage (column ) of best blast hit is < % (≤ nt) or % (> nt) s lowscore∗ seq hom lowsimilarity either id percentage below length-dependent threshold ( %, %, %) or e-value above e- (’imperfect match’ column ) s bothstrands seq hom misasbothstrands hits on both strands (’mixed’ column ) s multiplehits seq hom multiplehits more than hit reported (column > ) table descriptions of ribotyper errors within ribosensor and mapping to the genbank errors they trigger. ’+’: these errors errors do not trigger a genbank error if sequence is ’rfsp’ (fails ribotyper and passes rrna sensor); ribotyper error associated genbank error cause/explanation r nohits seq hom notssuorlsurrna no hits reported r multiplefamilies seq hom ssuandlsurrna ssu and lsu hits r lowscore seq hom lowsimilarity bits/position score is < . r bothstrands seq hom misasbothstrands hits on both strands r inconsistenthits seq hom misashitorder hits are in different order in sequence and model r duplicateregion seq hom misasdupregion hits overlap by or more model positions r unacceptablemodel seq hom taxnotexpectedssurrna best hit is to model other than expected set s expected set: ssu.archaea, ssu.bacteria, ssu.cyanobacteria, ssu.chloroplast s expected set: ssu.eukarya r lowcoverage seq hom lowcoverage coverage of all hits is < . (if ≤ nt) or . (if > nt) r questionablemodel+ seq hom taxquestionablessurrna best hit is to a ’questionable’ model (if mode is s: ssu.chloroplast) r multiplehits+ seq hom multiplehits more than hit reported riboaligner the riboaligner program was designed to help genbank indexers to evaluate whether ribosomal rna sequences are full length and do not extend past the boundaries of the gene. one application for a set of full length rrnas is as part and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of table mapping of genbank errors to the rrna sensor and ribotyper errors that trigger them. there are two classes of exceptions marked by two different superscripts in the table: ’*’: these rrna sensor errors do not trigger a genbank error if: (a) the sequence is ’rpsf’ (passes ribotyper and fails rrna sensor) and the -c option is not used with ribosensor. or (b) the sequence is ’rfsf’ (fails both ribotyper and rrna sensor) and r unacceptablemodel or r questionablemodel are also reported. ’+’: these ribotyper errors do not trigger a genbank error if sequence is ’rfsp’ (fails ribotyper and passes rrna sensor); genbank error fails to triggering rrna sensor/ribotyper errors seq hom notssuorlsurrna submitter s nohits∗, r nohits seq hom lowsimilarity submitter s nosimilarity∗, s lowsimilarity∗, s lowscore∗, r lowscore seq hom ssuandlsurrna submitter r multiplefamilies seq hom misasbothstrands submitter s bothstrands, r bothstrands seq hom misashitorder submitter r inconsistenthits seq hom misasdupregion submitter r duplicateregion seq hom taxnotexpectedssurrna submitter r unacceptablemodel seq hom taxquestionablessurrna indexer r questionablemodel+ seq hom lowcoverage indexer r lowcoverage seq hom multiplehits indexer s multiplehits, r multiplehits+ of the blastn database for screening and validating incoming sequences using rrna sensor, ribosensor or other blastn-based methods. riboaligner first calls ribotyper to determine the best matching model for each sequence using special command-line options. the --minusfail, --scfail and --covfail options are used to specify that sequences with unexpected features of minusstrand, lowscore and lowcoverage will fail. additionally, the --inaccept option is used to specify that the names of the desired models to use are in file ; only sequences that match best to one of these models is eligible to pass. the default set of acceptable models is ssu.archaea and ssu.bacteria by default. all sequences that score best to one of the acceptable models are aligned to that model using the cmalign program of infernal which takes into account both sequence and secondary structure conservation. the alignment is then parsed to determine the length classification of each sequence based on the alignment. there are possible length classes which are defined based on whether the alignment of each sequence extends to or past the first and final model reference position as well as how many insertions and deletions occur in the first and final ten model reference positions. more information on these classes can be found in the ribovore documentation. only sequences that pass ribotyper will be aligned by riboaligner, and the per-sequence ribotyper pass/fail designation is not changed by riboaligner. the riboaligner summary output file is identical to the ribotyper output summary file with additional per-sequence information on the length class, start and stop model reference position of each aligned sequence and number of insertions/deletions in the first and final ten model positions. ribodbmaker the ribodbmaker program is designed to create high quality datasets of rrna sequences, which may be useful as reference datasets or blastn databases. it takes as input a set of candidate sequences and a specified rrna model (e.g. ssu.bacteria) and applies numerous quality control tests or filters such that only high quality sequences pass. the program performs the following steps: fail sequences with too many ambiguous nucleotides and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of fail sequences that do not have a specified species taxid in the ncbi taxonomy database fail sequences that have non-weak vecscreen hits, suggesting the presence of vector contamination, as calculated by the vecscreen plus taxonomy software package [ ] fail sequences that have unexpected internal repeats as determined by com- paring each sequence against itself using blastn and finding off-diagonal local alignments with an e-value of no more than and length at least for the plus strand and for the minus strand fail sequences that fail ribotyper, including matching best to a model other than the specified one, using non-default options --minusfail --lowppossc . --scfail to specify that sequences with best hits on the minus strand or with scores below . bits per nucleotide will fail fail sequences that fail riboaligner, including matching best to a model other than the specified one, using non-default options --lowppossc . --tcov . to specify that sequences with scores below . bits per nucleotide or for which less than % of the sequence length is covered by hits will fail fail sequences that do not cover a specified span of model positions (are too short) fail sequences that survive all above steps but do not meet expected criteria of an ingroup analysis based on taxonomy and alignment identity in step , riboaligner outputs multiple sequence alignments of all sequences. these alignments are used for further scrutiny of each sequence in step , the ingroup analysis step. at this stage, sequences that do not cluster (based on alignment identity) with other sequences in their taxonomic group fail. finally, sequences that survive all stages are clustered based on alignment identity and centroids for each cluster are selected for the final set of surviving sequences. steps , , and require access to the ncbi taxonomy database and further that each input sequence be assigned in the nucleotide database to a unique or- ganism in the taxonomy database. this restricts the use of ribodbmaker to se- quences already present in genbank. the taxonomy criterion excludes, for ex- ample, some chimeric sequences that have been engineered and patented. users can run ribodbmaker on other sequences, but must bypass these steps using the --skipftaxid, --skipfvecsc, and --skipingrup. the vecscreen plus taxonomy package is only available for linux and so is not installed with ribovore on mac/osx. consequently, the following ribodbmaker options must be used on mac/osx: --skipftaxid --skipfvecsc --skipingrup --skipmstbl. in gen- eral, ribodbmaker is highly customizable via command-line option usage, and can be run using many different subsets of tests. for more information on command-line options see the ribodbmaker.md file in the ribovore documentation subdirectory. as described above, riboaligner calls ribotyper, so ribotyper is actually called twice by ribodbmaker, once in step and once in step . in the riboaligner step, ribotyper is called with options that differentiate its usage from step , making the criteria for passing more strict in several ways. the --difffail and --multfail options are used to specify that sequences with unexpected features of lowscoredifference and verylowscoredifference will fail. additionally, a cm is and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of used instead of a profile hmm for the second stage (ribotyper -- slow option) and any sequence for which less than % of the nucleotides are covered by a hit in the second stage will fail (--tcov . option). finally, the --scfail option, which is used in the ribotyper call in step , is not used in step . ribovore reference model library and blastn databases the ribovore package includes sequence- and structure-based alignments and corresponding cms, listed in table . seven of the alignments are from rfam, and the other were created during development of the package. rrna sensor includes two blastn databases: one of bacterial and archaeal s ssu rrna sequences created by clustering and filtering the blastn database already in use at genbank in when development of the script began, and one of eukaryotic s ssu rrna sequences created by filtering a sequence dataset generated by ribodbmaker. all of the ribovore model alignments are the end products of a multi-step model refinement procedure using the valuable secondary structure data available from crw [ ] and sequences from genbank. for each gene and taxonomic group (e.g. ssu rrna eukarya), an initial alignment with consensus secondary structure was created based on combining alignments and individual sequence secondary structure predictions from crw as described in [ , ], and used to build a cm using the infernal program cmbuild. that cm was then calibrated for database se- quence search using cmcalibrate and searched against all currently available rrna sequences in genbank. the resulting high-scoring hits were then filtered for redun- dancy and manually examined and surviving sequences were realigned to the model to create a new alignment. in some cases the consensus secondary structure was modified slightly based on the new alignment. some models were further refined by additional iterations of building, searching, and realigning. eight of the ribovore models are ssu models with fewer than sequences in the training alignment (table ). these are for taxonomic groups with relatively few known example sequences for which the consensus secondary structure is distinct but not as well understood as for other groups, like s ssu rrna. six of these eight are non-metazoan mitochondrial models, one is a chloroplast model for the pilostyles plant genus, and one is for apicoplasts. these eight models are less mature than the other ten models, but they are included in the package for completeness and we plan to improve them in future versions. currently, users should be cautious when interpreting results that involve any of these eight models. from each of the ribovore model alignments, two separate cms were con- structed using different command-line parameters to the cmbuild program of in- fernal. one model was built using cmbuild’s default entropy weighting feature that controls the average entropy per model position [ , ], and one was built using the cmbuild --enone option, which turns off entropy weighting. the non-entropy weighted models, which perform better at sequence classification in our internal testing (results not shown), are used by ribotyper, and the entropy weighted mod- els are used by riboaligner for sequence alignment because they are slightly more accurate at getting alignment endpoints correct based on our own internal test results (not shown). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of timing measurements timing measurements of rrna sensor, ribotyper, riboaligner, and ribodbmaker were done primarily on an intel(r) xeon(r) gold cpu @ . ghz with cores and running the centos . . version of linux. we used one thread except for tests of rrna sensor that measured the effect on wall-clock time of increasing the number of threads. for the runs of ribodbmaker, we used the ncbi compute farm to parallelize some time intensive steps. the reason for using the compute farm only for the ribodbmaker tests is that ribodbmaker is intended primarily for curation of databases at ncbi, while the other modules are intended to be used both by submitters around the world and genbank indexers at ncbi. results and discussion ribovore is used directly or indirectly by ncbi and genbank in various ways: as part of its submission pipelines for rrna sequences, through the blast web server (https://blast.ncbi.nlm.nih.gov/blast.cgi?program=blastn&page_type= blastsearch&link_loc=blasthome) and by facilitating the validation of sequences from type material to be incorporated into new records in the refseq database (https://www.ncbi.nlm.nih.gov/bioproject/ ). we detail each of these uses below and then compare the capability of ribovore for fungal rrna sequence validation to related projects. rrna sequence submission checking submitters of rrna sequences to genbank who use the ncbi submission portal can choose between different subtypes, listed in table . for most submission subtypes, the sequences are analyzed via a blastn-based pipeline by comparing each submitted sequence against a blastn database for the specific submission subtype. three of these submission subtypes (its and its and s- s igs) are for non-rrna sequences. for four of the remaining nine subtypes, the blastn database currently used was created with the help of ribodbmaker, as discussed more below. the ribosensor program is used instead of the blastn pipeline to analyze s prokaryotic ssu rrna submissions of or more sequences for which the submitter chooses the attribute uncultured to describe the sequences. for ribosensor, the default parameters are used to determine if sequences should pass or fail as discussed in the implementation section. for blastn, sequences are evaluated based on the average percentage identity, the average percentage query coverage, and the percentage of gaps in the alignments for the top target sequences. additionally, using a blastn-based method that predates and inspired rrna sensor, sequences that are suspected to be misassembled or incorrectly la- belled taxonomically fail. specifically, the query sequence is tested with blastn against the s ssu rrna database described below and the matches are ranked in increasing order of e-value. a sequence passes the misassembly test if and only if the best matches each have exactly one local alignment. the taxonomy tests are based on a comparison of the proposed taxonomy from the submitter and the tax- onomic information of the top matches, taking into account variant spellings and and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of table ncbi rrna and its sequence submission types and attributes. the ’submission type’ column indicates the three possible rrna sequence related options for genbank submissions available at https://submit.ncbi.nlm.nih.gov/subs/genbank/. there is not an intergenic spacer type of eukaryotes at this time. the ’submission subtype’ column indicates the more specific possible sequence types a submitter can select after choosing one of the options in the first column. the ’validation method’ column indicates whether incoming sequences are processed with ribosensor, blastn using a database constructed with ribodbmaker (’blast(ribo)’) or blastn using either the general non-redundant (nr) database or a database constructed by other means (’blastn’). the ’percentage of accepted submissions’ and ’percentage of accepted sequences’ reflect rrna/igs/its submissions published between jan , and may , . note that the percentages for rows and are summed and reported in row , and for rows through (all eukaryotic submission types) are summed and reported in in row . counts pertain only to submissions that advanced through enough preliminary checks to be assigned an internal submission code. percentage of percentage submission submission validation of accepted of accepted type subtype method submissions sequences ssu rrna only ( s) ≥ seqs ribosensor . % . % prokaryotic ssu rrna only ( s) < seqs blastn(ribo) . % . % rrna/igs lsu rrna only ( s) blastn(ribo) . % . % intergenic spacer ( s- s igs) blastn (sum of rows) contains rrna-its region blastn . % . % eukaryotic ssu rrna only ( s) blastn(ribo) (sum of eukaryotic rows) nuclear lsu rrna only ( s) blastn(ribo) rrna/its its only blastn its only blastn eukaryotic mitochondrial ssu rrna ( s) blastn organellar mitochondrial lsu rrna ( s) blastn rrna chloroplast ssu rrna ( s) blastn chloroplast lsu rrna ( s) blastn synonyms in ncbi taxonomy. the exact thresholds for these pre-ribovore, blastn- related comparisons vary according to the type of submission and are outside the scope of this paper. for both the blastn and ribosensor pipelines submissions in which all sequences pass and which have the required metadata are automatically deposited into gen- bank. all other submissions fail and are either sent back to the submitter with automated error reports or manually examined further by genbank indexers, de- pending on the specific reason for the failure. a key objective of distributing the ribovore software is to permit submitters to do on their own computers similar checks to those done by the genbank submission pipeline. in , we began using earlier versions of ribosensor to analyze large- scale s prokaryotic ssu rrna submissions. this remains the only submission type for which ribosensor is employed in an automated way, although we plan to expand to additional genes and taxonomic domains in the future. parts of ribovore are also used manually by genbank indexers to evaluate some submissions. the most common type of rrna submission by far is s prokaryotic ssu rrna (table ). between july , and may , , , submissions of s ssu rrnas with less than sequences (or for which the submitter indicated the sequences were from cultured organisms), were handled by the blastn pipeline. the total number of sequences in these submissions was , , for an average of . sequences per submission. in the same time interval, ribosensor processed s ssu rrna submissions comprising , , sequences for an average of , . sequences per submission. in the first six months of , ribosensor processed more than . % of the sequences deposited in genbank via any of the the rrna or its submission pipelines (table ). and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of construction and usage of rrna databases for blastn four of the ten rrna blastn databases used for submission checking were created by ribodbmaker as indicated in table . an additional three blastn databases, available in the web server version of blast are described below. all blastn databases we mention can be retrieved for local use from the direc- tory https://ftp.ncbi.nlm.nih.gov/blast/db/ the databases are (re)generated semi- automatically by extracting large sets of plausible sequences from entrez and pro- viding them as input to ribodbmaker. the ribodbmaker program is run so that the tests for ambiguous nucleotides, specified species, vector contamination, self- repeats, ribotyper, riboaligner, and model span are all executed. however, the databases are allowed to contain more than one sequence per taxid and the ingroup analysis is skipped (--skipingrup option). only sequences that pass ribodbmaker tests are eligible to be in the blastn databases. to keep the s prokaryotic ssu, eukaryotic ssu, and eukaryotic lsu databases to a reasonable size, the sequences that pass ribodbmaker are clustered with uclust [ ] at a threshold of % identity and all other parameters at default values. the clustering stage of ribodbmaker is not used for this purpose and is skipped by using the --skipclustr option. the s prokaryotic ssu rrna blast database is generated starting from all sequences in the genbank nucleotide database that match ncbi bioproject ids prjna or prjna using the eutils query in the first row of table . prjna has the title “bacterial s ribosomal rna refseq tar- geted loci project”. prjna has the title “archaeal s ribosomal rna refseq targeted loci project”. this formal query is supplemented by manual searches of the journal international journal of systematics and evolutionary bi- ology (https://www.microbiologyresearch.org/content/journal/ijsem), where many new bacterial species are announced and peer-reviewed, along with their s ssu rrna sequences. among the databases described here, the s ssu rrna database is the only one restricted to sequences from “type material” that have been more stringently vetted before curation for refseq. the fungal refseq records described in a later subsection are also restricted to be from “type material”. table also lists the eutils queries to the nucleotide database that are used for s prokaryotic lsu rrna, eukaryotic ssu rrna, eukaryotic lsu rrna, mi- crosporidia ssu rrna, and microsporidia lsu rrna. when we seek sequences that are likely to be complete, not larger genome pieces, and not partial, we add a constraint on the length with an extra term such as : [slen] for eukary- otic ssu rrna. the main attribute that distinguishes microsporidia is that the lower bound on slen for complete sequences is set about - nucleotides lower as explained in background. to find possibly partial lsu sequences that are long enough to cover the variable regions, we add the condition : [slen]. these queries rely on standardized nomenclature and structure of the definition line of genbank sequence records which contain information about the source organism, feature content, completeness and location. since , these definition lines have been constructed formulaically during the processing of submissions. for example, the sequence mt . has the title: “staphylococcus epidermidis strain ra s ribosomal rna gene, partial sequence”, and the sequence mn . has the and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of table queries used in command-line eutils to collect input datasets for ribodbmaker. gene eutils query archaeal and prjna [bioproject] or prjna [bioproject] bacterial ssu rrna bacterial bacteria[orgn] and ( s [ti] or large subunit ribosomal rna [ti]) lsu rrna not uncultured[orgn] not s rrna methyltransferase[ti] not srcdb pdb [prop] not srcdb pat [prop] not wgs [filter] not mrna [filter] [ti] not refseq [filter] not mrna not “mitochondrion”[filter] not tls [ti] archaeal archaea[orgn] and ( s [ti] or large subunit ribosomal rna [ti]) lsu rrna not uncultured[orgn] not s rrna methyltransferase[ti] not srcdb pdb [prop] not srcdb pat [prop] not wgs [filter] not mrna [filter] [ti] not refseq [filter] not mrna not “mitochondrion”[filter] not tls [ti] eukaryotic eukaryota[orgn] ( s [ti] or small subunit ribosomal rna [ti]) ssu rrna not wgs [filter] not mrna [filter] not “mitochondrion”ssu rrna [filter] not plastid [filter] not chloroplast [filter] not plastid [ti] not chloroplast [ti] not mitochondrial [ti] not refseq [filter] not ( . s [ti] or internal [ti]) not s [ti] not wgs not mrna not “mitochondrion”[filter] not tls [ti] not srcdb pdb[prop] eukaryotic eukaryota[orgn] and ( s [ti] or s [ti] or s [ti] or large subunit lsu rrna ribosomal rna [ti]) not wgs [filter] not mrna [filter] not “mitochondrion” [filter] not plastid [filter] not chloroplast [filter] not plastid [ti] not chloroplast [ti] not mitochondrial [ti] not refseq [filter] not ( . s [ti] or internal [ti]) not tls [ti] not partial cds [ti] not chain [ti] not s [ti] not srcdb pdb[prop] microsporidia microsporidia[orgn] and ( s [ti] or small subunit ribosomal rna [ti]) ssu rrna not wgs [filter] not mrna [filter] not “mitochondrion”[filter] not plastid [filter] not chloroplast [filter] not plastid [ti] not chloroplast [ti] not mitochondrial [ti] not refseq [filter] not ( . s [ti] or internal [ti]) not s [ti] not wgs not mrna not “mitochondrion”[filter] not tls [ti] not srcdb pdb[prop] microsporidia microsporidia[orgn] and ( s [ti] or s [ti] or s [ti] or large subunit lsu rrna ribosomal rna [ti]) not wgs [filter] not mrna [filter] not “mitochondrion” [filter] not plastid [filter] not chloroplast [filter] not plastid [ti] not chloroplast [ti] not mitochondrial [ti] not refseq [filter] not ( . s [ti] or internal [ti]) not tls [ti] not partial cds [ti] not chain [ti] not s [ti] not srcdb pdb[prop] fungal fungi [orgn] and : [slen] and sequence from type [filter] ssu rrna and ( s [ti] or small subunit ribosomal rna [ti]) not wgs [filter] refseq not mrna [filter] not mitochondrion [filter] not plastid [filter] records not chloroplast [filter] not mitochondrial [ti] not ( . s [ti] or internal [ti]) not s [ti] not s [ti] not s [ti] not s [ti] not s [ti] not s [ti] not wgs not mrna not refseq [filter] not tls [ti] fungal fungi[orgn] and : [slen] and sequence from type[filter] lsu rrna and ( s[ti] or s[ti] or s[ti] or large subunit ribosomal rna[ti]) refseq not wgs[filter] not mrna[filter] not “mitochondrion”[filter] records not plastid[filter] not chloroplast[filter] not mitochondrial[ti] not ( . s[ti] or internal[ti]) not tls[ti] not partial cds[ti] not chain[ti] not s[ti] not srcdb pdb[prop] not refseq[filter] title “tetrahymena rostrata strain traus s ribosomal rna gene, internal tran- scribed spacer , . s ribosomal rna gene, internal transcribed spacer , and s ribosomal rna gene, complete sequence”. creation of fungal rrna refseq entries using ribodbmaker ncbi’s refseq project seeks to create a representative, non-redundant set of an- notated genomes, transcripts, proteins and nucleotide records including rrna se- quences [ ]. since , ribodbmaker has been used to screen the set of fungal s ssu rrna and s lsu rrna sequences. table lists the queries used to identify candidates to be new fungal ssu and lsu rrna refseq records. studies that target fungal rrnas frequently attempt to obtain ssu rrna se- quences that span most of the v and part of the v variable regions, or lsu rrna sequences that span the d and d variable regions as these have been and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of table web blastn usage of specialized rrna databases that are curated using ribovore to decide which sequences are valid. usage was measured during november , - june , . database runs runs/day visits visits/day s rrna sequences (ssu) , . , . from bacteria and archaea s rrna sequences (ssu) from , . , . fungi and reference type material s rrna sequences (lsu) from , . , . fungi and reference type material shown to be phylogenetically informative [ , ]. these regions correspond to rfam rf model positions to (ssu) and rf model posi- tions to (lsu). correspondingly, ribodbmaker is run with command- line options (--fmlpos and --fmrpos) that enforce that only sequences that span these model coordinates can pass. the following ribodbmaker options are used for ssu: --fione --fmnogap --fmlpos --fmrpos -f --model ssu.eukarya --skipclustr, and for lsu: --fione --fmnogap --fmlpos --fmrpos -f --model lsu.eukarya --skipclustr. ncbi blast webpage rrna target databases for many years, ncbi has been offering searches of nucleotide and protein databases with various modules of blast [ ] through the ncbi blast webpage. most commonly, searches of nucleotide queries use a comprehensive “nonredundant (nr)” database of nucleotide sequences or databases of whole genomes. a disproportion- ate number of queries are rrna sequences. when blastn users know that their queries are of these special types, searching smaller targeted databases that exclude sequences unexpected to have a significant match to the query reduces running time and leads to more focused results. the blast webpage now allows users to select from three ribodbmaker-derived rrna target databases, listed in table . the s ssu rrna database is identical to the one used by the blastn submission pipeline. the other two are specific to fungi, due to the popularity of the analysis of rrna sequences for studies of that kingdom. the fungal ssu and fungal lsu blast databases are effectively equivalent to the sets of curated refseq records described below. the availability of these databases was announced in late , and the number of blastn runs and unique blastn visitors who selected each database during the seven-month period november , - june , are reported in table . the usage suggests that there is sufficient user demand to justify the cura- tion effort. as of november , , there are , fungal ssu refseq records and , fungal lsu refseq records, almost all of which were curated with ribovore. comparison to curated sets of fungal rrna sequences from silva we compared iteratively our ribodbmaker approach to curating fungal refseq records with other curatorial efforts that are part of the silva project [ ]. the purposes of this comparison were: to identify new candidate refseq records and possible weakness in our proce- dures for choosing refseq records, to test whether ribovore works on curated data sets and to correct errors asso- ciated with some sequences in ncbi databases, such as misleading definition lines, and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of to characterize what proportion of sequences curated by others pass the ri- bovore criteria and why sequences fail. as noted above, fungal lsu sequences submitted since or should have definition lines that match this query. in the course of doing the silvaparc lsu tests described below, we corrected the definition lines of older sequences that are fungal lsu and passed all ribodbmaker tests, but did not match the above eutils query. the current fungal refseq ssu and lsu sequences can be obtained with the queries “prjna [bioproject]” and “prjna [bioproject]”, respectively or both from the ftp site: https://ftp.ncbi.nlm.nih.gov/refseq/targetedloci/fungi/. for our comparison of fungal sequences, we used a curated set of , ssu sequences from silva [ ], a set of , ssu sequences from silva in the phylum microsporidia, a set of , high-quality lsu reference sequences from silva, and a much larger set of , sequences from silva called parc [ , ]. we denote these four sets as yarza, silvamicrosporidia, silvaref, and silvaparc, respectively. to set up the silvamicrosporidia set, we downloaded the fasta files for all of silvaparc ssu and extracted , sequences labeled as being from the phylum microsporidia; of these, , sequences were in genbank with sufficient taxonomy information to be considered for ribodbmaker. silvaparc contains fewer than microsporidia lsu sequences, supporting our previous assertion that the ssu has been much more studied than the lsu in microsporidia. similarly, to set up the silvaref and silvaparc sets, we downloaded fasta files for all silvaref lsu sequences for all taxa in version on july , and all lsu parc sequences in version on august , . we filtered for all sequences that had the token “fungi” in the definition line. a small number of sequences had to be dropped subsequently because ) they are not from the kingdom fungi (e.g., they may be from a pathogen of a fungus) ) they were absent from the nuccore database of genbank either due to being “unverified” or from certain types of patents or ) due to phylogenetic discrepancies (cf. [ ]) that we subsequently fixed as part of the first objective listed above. in all data sets, we retrieved the most recent version of all genbank accessions, which differs from the curated version for a very small number of sequences since version of silva is recent. in our analysis of the silvaparc set, non-fungal sequences were inadvertently included in the analysis, and excluded only while checking the results. the results of the three comparisons are shown in table . the main steps in these tests consisted of: download and uncompress a fasta file of source sequences from the sup- plementary information of [ ] or from the silva ftp retrieval site, which we denote file .fa. as explained in the ribovore documentation, retrieve and condense the cur- rent version of ncbi’s taxonomy tree. an important and subtle column is the boolean ( / ) specified species column for each taxon; a in this column for the row of taxon t means that according to ncbi’s taxonomy group the taxon name is valid and currently peferred; a in this column is a neces- sary condition for a sequence from taxon t to be eligible to be in the rrna databases or to be a refseq. call the resulting file taxonomy.txt. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of table summary of ribodbmaker pass/fail outcomes for yarza(ssu), silvamicrosporidia(ssu), silvaref(lsu), and silvaparc(lsu) datasets. all tests except the ingroup analysis depend only on the sequence being tested. the four tests for ambiguous nucleotides, specified species, vector contamination, and self-repeats are done on all sequences, so sequences may fail more than one test. only sequences that pass the ribotyper test are eligible as input to riboaligner. only sequences that pass the riboaligner test are eligible to be tested for length and alignment span. only sequences that pass all -sequence tests are eligle for ingroup analysis. the ingroup analysis can be done allowing many sequences from the same taxon to pass or limiting to the number of sequences that pass from each taxon (argument --fione). the many option is a more meaningful test; we show the option just for comparison. dataset/ yarza silvamicrosporidia silvaref silvaparc test pass/fail pass/fail pass/fail pass/fail ambiguous nucleotides / / / / specified species / / / / vector contamination / / / / self-repeats / / / / ribotyper / / / / riboaligner / / / / length in range? / / / / expected span? / / / / all -sequence tests? / / / / ingroup analysis(many) / / / / ingroup analysis( ) / / / / (for tests of silva data only) extract the definition lines for sequence identifiers of interest with the command: grep fungi file .fa | grep -v bacteria or grep microsporidia file .fa | grep -v bacteria, redi- recting the output to an intermediate file. the second command in the pipe removes most sequences from fungal pathogens that are not actually from the kingdom of fungi. for the fungal ssu set, all sequences are of interest, so the simpler command grep ">" file .fa extracts the definition lines. extract genbank accessions without the versions from the definition lines at step . versions are removed because for some sequences, the version in silva has been superseded in genbank with a newer version. use the ncbi package eutils [ ] to retrieve from the nucleotide database of genbank all currently live accessions from the accession sets derived at the previous step. some sequences get dropped at this step because they are no longer live. call the fasta file at this step file .fa. use the ncbi standalone tool srcchk (available at: ftp://ftp.ncbi.nih.gov/toolbox/ncbi tools/converters/by program/srcchk/) to check which sequences in file .fa have a valid and fully consistent taxon- omy entry. remove sequences that do not get a normal result from srcchk because they will cause ribovore to halt. a small number (well below %) of sequences get removed at this step either because they are engineered se- quences from patents or because there are transient inconsistencies between the ncbi taxonomy tree and the organism values in the genbank nucleotide records. call the resulting file file .fa. run ribodbmaker --taxin taxonomy.txt --skipclustr --model --fmlpos --fmrpos --fmnogap --fione --pidmax --indiffseqtax -f -p . the value of was either ssu.eukarya, ssu.microsporidia, or lsu.eukarya, depending on the test being done. the values of and are set in a model-specific manner according to the rec- ommended values in the ribovore documentation (ribodbmaker.md file in and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of documentation subdirectory). the is an upper bound on the number of sequences input in our various tests. version . of ribovore was used. for the large test of silvaparc, we ran the last ribodbmaker step separately on various subsets of file .fa, then collected all sequences that passed all sequence- specific tests, and did a final run of ribodbmaker to include the ingroup analysis. this split into multiple runs achieves better parallelism and throughput for large versions of file .fa because the ingroup analysis is the only step in which the results for any single sequence depend on which other sequences are included in the input. the numbers of eligible sequences reported in table are those included in file .fa. the results are slightly sensitive to changes in the ncbi taxonomy tree, which is updated daily. for the yarza tests, we used the ncbi taxonomy tree as of august , and for the silva tests we used the tree as of september , . it was not our intent to compare sets of “passing” sequences because the criteria for fungal refseq records are deliberately more stringent than for inclusion in silva. most notable are the two taxonomic tests: ) that each sequence should come from a specified species and ) that in selecting sequences for refseq, we may choose to keep only one sequence per species taxid, as specified by the command-line option --fione, to avoid redundancy. indeed, the analysis of fungal sequences from silva yielded some new fungal refseq records; specifically, we added ssu sequences from the yarza set, and lsu sequences from the silvaref and silvaparc data sets. however, the tests results also show some possible improvements in the silva curation. it appears that a small number of silva sequences have vector contamination and more than % may be misassembled as indicated by self-repeats, which are not expected in fungal ssu and lsu rrna genes (see methods, subsection ribodbmaker for the self-repeat criteria). it appears that the yarza ssu data set was carefully curated for sequences to be full-length and not too long, but in the silvaref and silvaparc data sets more than % of sequences have either a length that is out of the range of typical eukaryotic lsu sequences or do not span the range [ , ] that includes the d /d regions typically covered for species differentitation. thus, it appears that the silva resource curation could arguably be improved by checking sequence ends, so as to trim long sequences, remove short sequences, and remove sequences that are unlikely to be full lsu sequences. the sequences could be too long either because they were not trimmed to the lsu boundary or possibly because they contain introns. in the silvamicrosporidia test, we tested for the presence of the most conserved v and v regions only with a permissive expected span of [ , ]. partial comparison to rnammer to our knowledge, there is no other software that solves the rrna validation prob- lem as we have formulated it for genbank submissions. one widely used software package that solves a related problem is rnammer [ ]. the problem that rnammer solves is to find likely rrna sequences within larger sequences by using an old ver- sion of hmmer (v . . ) to compare against one of six profile hmms. to accelerate searches, rnammer first utilizes a small spotter profile hmm that models only the most conserved consecutive positions of the overall rrna alignment to detect rrna regions, padding those regions with extra sequence on each end, and then and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of using a profile of the full rrna to determine gene boundaries within those padded regions. so far as we could determine, rnammer does not work with the up-to-date hmmer version nor with arbitratry profile hmms or cms. nevertheless, one can provide as input to rnammer fasta files of putative rrna sequences, pretending that they were larger contigs. the module of ribovore that is closest in purpose to this usage of rnammer is ribotyper. to allow com- parison between rnammer and ribotyper, one can define that a sequence passes rnammer if rnammer produces in the output at least one hmmer-based pre- diction for that sequence using the intended rrna model (e.g., eukaryotic ssu for the yarza set) and fails if there are zero such predictions. this comparison is unfair to rnammer because it does not use the predicted intervals, which are the most useful part of the output when rnammer is used with large contigs as inputs. we compared the performance of ribotyper versus rnammer on the yarza ssu set and the silvaref lsu set. we used the ribotyper results obtained from the ribodbmaker tests described above and summarized in table . among the , ssu sequences in the yarza set: , passed both ribotyper and rnammer, failed both ribotyper and rnammer, passed rnammer and failed ribotyper, and passed ribotyper and failed rnammer. among the set of , se- quences include “internal transcribed spacer” or “its” in the definition line, and of the other have lengths above , nt, indicating that all but at least of the sequences likely include sequence outside the ssu rrna sequence (which is rarely more than kb) and so are expected to fail ribotyper. of the sequences that passed ribotyper and failed rnammer, of them would pass rnammer if the e-value and bit score thresholds for the spotter hmm which are hard-coded at e- and were changed to and - in the rnammer perl script, indicating that these sequences do not match well to the spotter profile hmm used for eukaryotic ssu rrna. among the , lsu sequences in the silvaref set: , passed both rnammer and ribotyper, failed both rnammer and ribotyper, passed rnammer and failed ribotyper, and passed ribotyper and failed rnammer. among the sequences that passed rnammer and failed ribotyper, / can be ex- plained because they have one of three errors that ribotyper looks for and would not necessarily lead rnammer to have no output matches: r duplicateregion ( sequences), r bothstrands ( sequences), r multiplefamilies ( sequences). many of the sequences are described on the definition lines as a “shotgun assembly”; accordingly, the r duplicateregion and r bothstrands errors indicate two differ- ent errors that occur commonly in assembling nucleotide sequences into contigs. some of the sequences have both ssu and lsu in the definition lines and if ac- curate, that should lead to an r multiplefamilies error. these sequences that match both genes could have been trimmed before inclusion in the silvaref lsu set. in principle, one could detect the presence of an ssu match and an lsu match in the same sequence with rnammer, but one would have to add error rules to rnammer to decide when the occurrence of matches to both ssu and lsu mod- els is an error. that need for error semantics exemplifies how ribotyper differs in functionality from rnammer. of the sequences that passed ribotyper and failed rnammer, of them would pass rnammer if, as for the yarza set, the and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of e-value and bit score thresholds for the spotter hmms were changed to and - respectively, indicating that these sequences do not match well to the eukaryotic lsu rrna spotter profile hmm. in general, there are a large number of discrepant outcomes and neither rnammer nor ribotyper is consistently more restrictive than the other. we infer merely that the two pieces of software solve different problems and there is not a straightforward way to modify rnammer to solve the problem of checking rrna submissions to genbank. this helps to justify why we developed the new software ribovore. as explained above, ribovore also has additional modules, such as ribodbmaker, that are even less comparable to rnammer and solve other problems in rrna sequence validation and curation. limitations and future directions ribovore includes profile models (table ), only two of which are used for au- tomated submission checking (bacterial ssu rrna and archaeal ssu rrna), and seven of which (the first seven rows in table ) have been used in the context of ribodbmaker to generate one or more blastn databases or refseq records. eight of the remaining models were created from alignments of fewer than sequences, and need to be improved by adding more sequences. however, some of the models, especially those based on rfam alignments such as eukaryotic ssu and lsu rrna, could in principle be used for submission checking by ribosensor and we plan to investigate those possibilities based on empirical testing in the future. beyond the existing models, more models are needed for other rrna genes and taxonomic domains, such as mitochondrial lsu rrna, microsporidia lsu rrna, eukaryotic . s rrna and s rrna. rfam includes alignments for some of these (e.g. . s and s rrna) and future versions of ribovore could include models based on those, but manual curation effort will be required to create others. one limitation of ribovore is that there are many parameters and the user may need to choose the settings carefully for each distinct purpose. for example, the usage of ribodbmaker should be tuned for each gene and taxonomic domain, as we have reported here for fungal ssu and lsu rrna to require the commonly targeted regions of those respective genes to be present in the sequences. another limitation is that we do not model introns, simply expecting any introns to be un- aligned in the ribotyper and riboaligner analysis. additionally, minimum criteria (e.g. minimum score and coverage values) for passing sequences in the ribotyper, ribosensor and riboaligner tools should be set based on empirical testing, and the default values for those programs are currently tailored to prokaryotic ssu rrna based on our internal testing. expansion to other genes and taxonomic do- mains will require additional testing of those values. for some applications, the running time of ribovore programs can be a signif- icant limitation. profile-based cm or profile hmm methods that compare a few profiles (in this case, at most ) to each input sequence can be more efficient than single-sequence based methods like blastn which typically compare many database sequences (in this case, more than ) to each input sequence, but of course this depends on the relative speed of each profile to sequence and sequence to sequence comparison. cm methods that score both sequence and secondary structure con- servation are computationally complex. on a single cpu, alignment of a single and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of full length lsu rrna sequence typically takes several seconds. for this reason, ribotyper and ribosensor, which are intended to handle sequence submissions of up to millions of sequences do not, by default, compute an alignment using the cm, but rather use only more efficient profile hmm algorithms. the riboaligner and ribodbmaker programs, however, do compute alignments using cms and so take longer per sequence, although the frequency with which these programs need to be run, at least for genbank, is less. the ribosensor program, which runs both ribotyper and rrna sensor, which is blastn-based, combines both profile and single-sequence methods. we measured the running time of rrna sensor by itself and ribosensor on s sequences using , , , , and processors. the rrna sensor program took s, s, s, s, and s, respectively; ribosensor took s, s, s, s, s, respectively. thus, a submission of million sequences, which is on the high end, would take hours to process given processors on the host computer. the programs ribotyper, ribosensor, riboaligner, and ribodbmaker all include a command-line option -p that enables finer-grained parallelization by splitting the input file into roughly equal sized chunks and processing each independently on nodes on a compute cluster. however, for ribodbmaker, only the ribotyper and riboaligner steps are parallelized in this way. while doing the comparison of curated fungal datasets, we also timed ribotyper, riboaligner, and ribodbmaker on the , sequence yarza fungal ssu set and the , sequence silvaref fungal lsu set as described in implementation. the ribotyper program required m s ( . s per sequence) on the yarza set and m s ( . s per sequence) on the silvaref set. the riboaligner program took m s ( . s per sequence) on the yarza set and m s ( . s per sequence) on the silvaref set. the ribodbmaker program took m s wall-clock time and m s cumulative time for all processors on the yarza set and m s wall-clock time and m s cumulative time for the silvaref set. in general, we conclude that these analyses are tractable for tens of thousands of sequences at a time. conclusions our primary contribution described herein is the software package ribovore for rrna sequence analysis. at ncbi since july , ribovore has been used to check the quality of incoming submissions and to curate datasets of high quality sequences for refseq or to use as blastn databases. in the submission checking context, ribovore has been used to check nearly million s bacterial and archaeal ssu rrna sequences through may , and millions more after that date. ribovore has also been used manually by genbank indexers when blastn analyses gave uncertain results for other rrnas. a subset of the blastn databases created by ribovore are selectable by users of the blast webpage as target databases, and are used in over , web blastn runs per day. we also are using ribovore internally to curate fungal refseq records for ssu and lsu rrna from type material. we showed that this curation effort is complementary to the larger silva effort, as it selects only the best sequences that pass a larger battery of tests. furthermore, the refseq records are linked within entrez to other ncbi resources including biocollections, bioprojects, taxonomy, and blast. with this formal report of how and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of ribovore is designed and implemented, we hope that both producers and consumers of rrna sequence data will achieve a new understanding of how rrna sequences are curated in genbank, refseq, and associated resources. availability and requirements project name: ribovore project home page: https://github.com/ncbi/ribovore operating system(s): linux, mac/osx programming language: perl other requirements: blast+ v . . , infernal v . . , sequip v . , vecscreen plus taxonomy v . (see table ) license: public domain any restrictions to use by non-academics: none abbreviations ncbi: national center for biotechnology information; rrna: ribosomal rna; ssu rrna: small subunit ribosomal rna; lsu rrna: large subunit ribosomal rna; cm: covariance model; hmm: hidden markov model; nt: nucleotides; kb: kilobase ( nucleotides); ethics approval and consent to participate not applicable. consent for publication not applicable. availability of data and materials all data generated or analyzed during this study are included in this published article, its supplementary material, or ncbi’s genbank database. code is available on github (https://github.com/ncbi/ribovore). blast databases are available in the directory https://ftp.ncbi.nlm.nih.gov/blast/db/. the supplementary material includes instructions for reproducing the comparisons reported in the article. competing interests the authors declare that they have no competing interests. funding this research was supported by the intramural research of the national institutes of health, national library of medicine (nlm) and national cancer institute. author’s contributions aas and epn conceived of and designed the software and wrote most of the paper. rm, br, cs assisted by writing passages about fungi and about practical usages of ribovore within ncbi. epn wrote most of the ribovore code and aas wrote rrna sensor. all authors participated in the design and user interface of at least one ribovore module. epn, aas, rm, br, aj, bau formally tested the software. rm curated the rrna databases. br selected and curated the fungal refseqs. cs guided the multiple usages of ncbi taxonomy in ribovore and corrected taxonomy inconsistencies as they were detected. epn, rm, aj, and bau collected data on ribovore usage. rm, aj, and bau used ribovore to evaluate submissions to genbank. ik-m supervised the work of rm, aj, and bau. all authors read and edited multiple versions of the manuscript and approved the final version. acknowledgements thanks to our ncbi colleagues alex kotliarov and sergiy gotvyanskyy for assistance in integrating ribovore into genbank processing pipelines and for collecting data on ribovore usage. thanks to our ncbi colleague richa agarwala for providing access to an isolated linux computer on which we could do sole-user measurements of running time. author details cancer data science laboratory, national cancer insitute, national institutes of health, bethesda, md, usa. national center for biotechnology information, national library of medicine, national institutes of health, bethesda, md, usa. references . woese cr, fox ge. phylogenetic structure of the prokaryotic domain: the primary kingdoms. proc natl acad sci usa. ; : – . . pace nr, stahl da, lane dj, olsen gj. analyzing natural microbial populations by rrna sequences. asm news. ; : – . and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of . weller r, ward dm. selective recovery of s rrna sequences from natural microbial communities in the form of cdna. appl environ microbiol. ; : – . . giovannoni sj, britschgi tb, moyer cl, field kg. genetic diversity in sargasso sea bacterioplankton. nature. ; : – . . fox ge, pechman kr, woese cr. comparative cataloging of s ribosomal ribonucleic acid: molecular approach to procaryotic systematics. int j syst evol microbiol. ; : – . . betzl d, ludwig w, schleifer kh. identification of em lactococci and enterococci by colony hybridization with s rrna-targeted oligonucleotide probes. appl env microbiol. ; : – . . amann ri, ludwig w, schleifer kh. phylogenetic identification and in situ detection of individual microbial cells without cultivation. microbiol rev. ; : – . . schoch cl, seifert ka, huhndorf s, robert v, spouge jl, levesque ca, et al. nuclear ribosomal internal transcribed spacer (its) region as a universal dna barcode marker for fungi. proc natl acad sci usa. ; : – . . peterson sw, kurtzman cp. ribosomal rna sequence divergence among sibling species of yeasts. syst appl microbiol. ; : – . . pawlowski j, audic s, adl s, bass d, belbhari l, berney c, et al. the significance of a confidence between evolutionary landmarks found in mating affinity and a dna sequence. plos biol. ; :e . . zimmerman j, hahn r, geimenholzer b. barcoding diatoms: evaluation of the v subregion on the s rrna gene, including new primers and protocols. organism diversity evol. ; : . . eddy sr. profile hidden markov models. bioinformatics. ; : – . . karplus k, barrett c, hughey r. hidden markov models for detecting remote protein homologies. bioinformatics. ; : – . . eddy sr, durbin r. rna sequence analysis using covariance models. nucleic acids res. ; : – . . sakakibara y, brown m, underwood rc, mian is, haussler d. stochastic context-free grammars for modeling rna. in: hunter l, editor. proceedings of the twenty-seventh annual hawaii international conference on system sciences: biotechnology computing. vol. v. los alamitos, ca: ieee computer society press; . p. – . . durbin r, eddy sr, krogh a, mitchison gj. biological sequence analysis: probabilistic models of proteins and nucleic acids. cambridge uk: cambridge university press; . . freyhult ek, bollback jp, gardner pp. exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding rna. genome res. ; : – . . kolbe dl, eddy sr. local rna structure alignment with incomplete sequence. bioinformatics. ; : – . . nawrocki ep. structural rna homology search and alignment using covariance models [ph.d. thesis]. washington university school of medicine; . . ludwig w, strunk o, westram r, richter l, meier h, , et al. arb: a software environment for sequence data. nucleic acids res. ; : – . . cannone jj, subramanian s, schnare mn, collett jr, d’souza lm, du y, et al. the comparative rna web (crw) site: an online database of comparative sequence and structure information for ribosomal, intron, and other rnas. bmc bioinformatics. ; : . . olsen gj, larsen n, woese cr. the ribosomal rna database project. nucleic acids res. ; : – . . cole jr, wang q, fish ja, chai b, mcgarrell dm, sun y, et al. ribosomal database project: data and tools for high throughput rrna analysis. nucleic acids res. ; :d –d . . desantis tz, hugenholtz p, larsen n, rojas m, brodie el, keller k, et al. greengenes, a chimera-checked s rrna gene database and workbench compatible with arb. appl environ microbiol. ; : – . . pruesse e, quast c, knittel k, fuchs bm, peplies j, glöckner fo. silva: a comprehensive online resource for quality checked and aligned ribosomal rna sequence data compatible with arb. nucleic acids res. ; : – . . glöckner fo, yilmaz p, quast c, gerken j, beccati a, ciuprina a, et al. years of serving the community with ribosomal rna gene reference databases and tools. j biotechnol. ; : – . . lagesen k, hallin p, rødland ea, staerfeldt h, rognes t, ussery dw. rnammer: consistent and rapid annotation of ribosomal rna genes. nucleic acids res. ; : – . . lee jh, yi h, chun j. rrnaselector: a computer program for selecting ribosomal rna encoding sequences from metagenomic and metatranscriptomic shotgun libraries. j microbiol. ; : – . . eddy sr. accelerated profile hmm searches. plos comput biol. ; :e . . pruesse e, peplies j, glöckner fo. sina: accurate high throughput multiple sequence alignment of ribosomal rna. bioinformatics. ; : – . . nawrocki ep, eddy sr. infernal . : -fold faster rna homology searches. bioinformatics. ; : – . . kalvari i, nawrocki ep, ontiveros-palacios n, argasinska j, lamkiewicz k, marz m, et al. rfam : expanded coverage of metagenomic, viral and microrna families. nucleic acids res. ;gkaa . . vossbrink cr, maddox jv, fredman s, debrunner-vossbrinck ba, woese cr. ribosomal rna sequence suggests microsporidia are extremely ancient eukarytotes. nature. ; : – . . barandun j, hunziker m, vossbrink cr, klinge s. evolutionary compaction and adaptation visualized by the structure of the dormant microsporidia ribosome. nat microbiol. ; : – . . o’leary na, wright mw, brister jr, ciufo s, haddad d, mcveigh r, et al. reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. nucleic acids res. ; :d –d . . edgar rc. search and clustering orders of magnitude faster than blast. bioinformatics. ; : – . . wheeler tj, eddy sr. nhmmer: dna homology search with profile hmms. bioinformatics. ; : – . . schäffer aa, hatcher el, yankie l, andd j r brister ls, karsch-mizrachi i, nawrocki ep. vadr: validation and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of and annotation of virus sequence submissions to genbank. bmc bioinformatics. ; : . . schäffer aa, nawrocki ep, choi y, kitts pa, karsch-mizrachi i, mcveigh r. vecscreen plus taxonomy: imposing a tax(onomy) increase on vector contamination screening. bioinformatics. ; : – . . nawrocki ep. the ssu-align user’s guide; . [http://eddylab.org/software/ssu-align/userguide.pdf]. . liu k, porras-alfaro a, kuske cr, eichorst sa, xie g. accurate, rapid taxonomic classification of fungal large-subunit rrna genes. appl environ microbiol. ; : – . . hadziavdic k, lekang k, lanzen a, jonassen i, thompson em. characterization of the s rrna gene for designing universal eukaryotic specific primers. plos one. ; :e . . altschul sf, madden tl, schäffer aa, zhang j, zhang z, miller w, et al. gapped blast and psi-blast: a new generation of protein database search programs. nucleic acids res. ; : – . . yarza p, yilmaz p, panzer k, glöckner fo, reich m. a phylogenetic framework for the kingdom fungi based on s rrna gene sequences. mar genomics. ; : – . . quast c, pruesse e, yilmaz p, gerken j, schweer t, yarza p, et al. the silva ribosomal rna gene database project: improved data processing and web-based tools. nucleic acids res. ; :d –d . . kozlov am, zhang j, yilmaz p, glöckner fo, stamatakis a. phylogeny-aware identification and correction of taxonomically mislabeled sequences. nucleic acids res. ; : – . . sayers e. entrez programming utilities help [internet]; -. [https://www.ncbi.nlm.nih.gov/books/nbk /]. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . schäffer et al. page of additional files additional file we provide ribovore-paper-supplemental-material.tar.gz, a gzipped tar archive with sequence files and instructions for reproducing the tests of ribovore and rnammer described in results and discussion, that includes a readme.txt with file descriptions. unpack with the command ’tar xf ribovore-paper-supplementary-material.tar.gz’. and is also made available for use under a cc license. (which was not certified by peer review) is the author/funder. this article is a us government work. it is not subject to copyright under usc the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the covid- pharmacome: a method for the rational selection of drug repurposing candidates from multimodal knowledge harmonization bruce schultz , andrea zaliani , , christian ebeling , jeanette reinshagen , , denisa bojkova , vanessa lage-rupprecht , reagon karki , sören lukassen , yojana gadiya , neal g. ravindra , sayoni das , shounak baksi , daniel domingo-fernández , manuel lentzen , mark strivens , tamara raschka , jindrich cinatl , lauren nicole delong , phil gribbon , , gerd geisslinger , , , sandra ciesek , , david van dijk , steve gardner , alpha tom kodamullil , holger fröhlich , manuel peitsch , marc jacobs , julia hoeng , roland eils , carsten claussen , and martin hofmann-apitius* fraunhofer institute for algorithms and scientific computing scai, department of bioinformatics, institutszentrum birlinghoven, sankt augustin, germany fraunhofer institute for translational medicine and pharmacology itmp, screeningport, hamburg, germany fraunhofer cluster of excellence for immune mediated diseases, cimd, external partner site, hamburg, germany precisionlife ltd. unit b bankside, hanborough business park, long hanborough, oxfordshire, ox lj, united kingdom philipp morris international r&d, biological systems research, r&d innovation cube t . , quai jeanrenaud , ch- neuchatel, switzerland causality biomodels pvt ltd., kinfra hi-tech park, kerala technology innovation zone- ktiz, kalamassery, cochin, -india center for digital health, charité universitätsmedizin berlin & berlin institute of health (bih) center for biomedical data science, yale school of medicine, yale university, cedar street, new haven, ct , usa pharmazentrum frankfurt/zafes, institut für klinische pharmakologie, klinikum der goethe-universität frankfurt, frankfurt am main, germany fraunhofer institute for translational medicine and pharmacology itmp,, frankfurt am main, germany institute for medical virology, university hospital frankfurt, frankfurt am main, germany dzif, german centre for infection research, external partner site, frankfurt am main, germany * martin.hofmann-apitius@scai.fraunhofer.de .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / abstract the sars-cov- pandemic has challenged researchers at a global scale. the scientific community’s massive response has resulted in a flood of experiments, analyses, hypotheses, and publications, especially in the field of drug repurposing. however, many of the proposed therapeutic compounds obtained from sars-cov- specific assays are not in agreement and thus demonstrate the need for a singular source of covid- related information from which a rational selection of drug repurposing candidates can be made. in this paper, we present the covid- pharmacome, a comprehensive drug-target-mechanism graph generated from a compilation of separate disease maps and sources of experimental data focused on sars- cov- / covid- pathophysiology. by applying our systematic approach, we were able to predict the synergistic effect of specific drug pairs, such as remdesivir and thioguanosine or nelfinavir and raloxifene, on sars-cov- infection. experimental validation of our results demonstrate that our graph can be used to not only explore the involved mechanistic pathways, but also to identify novel combinations of drug repurposing candidates. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction and motivation covid- is the term coined for the pandemic caused by sars-cov- . unprecedented in the history of science, this pandemic has elicited a worldwide, collaborative response from the scientific community. in addition to the strong focus on the epidemiology of the virus , experiments aimed at understanding mechanisms underlying the pathophysiology of the virus have led to new insights in a comparably short amount of time . in the field of computational biology, several initiatives have started generating disease maps that represent the current knowledge pertaining to covid- mechanisms . such disease maps have proven valuable before in diverse areas of research such as . when taken together with related work including cause-and-effect modeling , entity relationship graphs , and pathways ; these disease maps represent a considerable amount of highly curated “knowledge graphs” which focus primarily on covid- biology. here, we use the term “mechanism” to describe a single, or multiple cause-and-effect relationships (i.e. a subgraph), “pathways” to refer to a well-established series of interactions resulting in cellular change or a defined product, and “models” for describing a collection of experimental data or known interactions defined in the context of a particular biological process or pathology. as of july , a collection consisting of models representing core knowledge about the pathophysiology of sars-cov- and its primary target, the lung epithelium, was shared with the public. with the rapidly increasing generation of data (e.g. transcriptome , interactome , and proteome data), we are now in the position to challenge and validate these covid- pathophysiology knowledge graphs with experimental data. this is of particular interest as .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / validation of these knowledge graphs bears the potential to identify those disease mechanisms highly relevant for targeting in drug repurposing approaches. the concept of drug repurposing (the secondary use of already developed drugs for therapeutic uses other than those they were designed for) is not new. the major advantage of drug repurposing over conventional drug development is the massive decrease in time required for development as important steps in the drug discovery workflow have already been successfully passed for these compounds . our group and many others have already begun performing assays to screen for experimental compounds and approved drugs to serve as new therapeutics for covid- . dedicated drug repurposing collections, such as the broad institute library , and the even more comprehensive reframe library , were used to experimentally screen for either viral proteins as targets for functional inhibition , or for virally infected cells in phenotypic assays . in our own work, compounds were assessed for their inhibition of virus-induced cytotoxicity using the human cell line caco- and a sars-cov- isolate . a total of compounds with ic < µm were identified, from which % have not yet been previously reported as being active against sars-cov- . out of the active compounds, are approved drugs, are in phases - and are preclinical candidate molecules. the described mechanisms of action for the inhibitors included kinase signaling, pde activity modulation, and long chain acyl transferase inhibition (e.g. “azole class antifungals”). the approach presented here integrates experimental results and the output from other informatic pipelines, and combines proprietary and public data to provide a comprehensive overview on the therapeutic efficacy of candidate compounds, the mechanisms targeted by .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / these candidate compounds, and a rational approach to test the drug-mechanism associations for their potential in combination therapy. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / methodology generation of the covid- pharmacome disparate covid- disease maps focus on different aspects of covid- pathophysiology. based on comparisons of the covid- knowledge graphs, we found that not a single disease map covers all aspects relevant for the understanding of the virus, host interaction and the resulting pathophysiology. thus, we optimized the representation of essential covid- pathophysiology mechanisms by integrating several public and proprietary covid- knowledge graphs, disease maps, and experimental data (supplementary table ) into one unified knowledge graph, the covid- supergraph. to this end, we converted all knowledge graphs and interactomes into openbel , a language that is both ideally suited to capture and to represent “cause-and-effect” relationships in biomedicine and is fully interoperable with major pathway databases . in order to ensure that molecular interactions were correctly normalized, individual pipelines were constructed for each model to convert the raw data to the openbel format. for example, the covid- disease map contained separate files, each of which represented a specific biological focus of the virus. each file was parsed individually and the entities and relationships that did not adhere to the openbel grammar were mapped accordingly. whilst most of the entities and relationships in the source disease maps could be readily translated into openbel, a small number of triples from different source disease maps required a more in-depth transformation. when classic methods of naming objects in triples failed, the recently generated covid- ontology as well as other available standard ontologies and vocabularies were used to normalize and reference these entities. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / in addition to combining the listed models, we also performed a dedicated curation of the covid- supergraph in order to annotate the mechanisms pertaining to selected targets and the biology around prioritized repurposing candidates. the resulting bel graphs were quality controlled and subsequently loaded into a dedicated graph database system underlying the biomedical knowledge miner (bikmi), which allows for comparison and extension of biomedical knowledge graphs (see http://bikmi.covid -knowledgespace.de). once the models were converted to openbel and imported into the database, the resulting nodes from each mechanism-based model were compared (figure ). even when separated by data origin type, the covid- knowledge graphs had very little overlap ( shared nodes between all manually curated models and no shared nodes between all models derived from interaction databases), but by unifying the models, our covid- supergraph improves the coverage of essential virus- and host-physiology mechanisms substantially. figure : venn diagrams comparing major mechanistic models in the covid- supergraph. mechanism-based models were divided, and their entities compared within their resulting subgroups. model abbreviations are defined in supplementary table . a) manual node comparison shows the overlap of entities in the models that are knowledge-based, manually curated relationships that have been directly encoded in openbel. b) automated node comparison shows the overlap of entities in models re-encoded into openbel from other formats (e.g. sbml models). .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://bikmi.covid -knowledgespace.de/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / additionally, by enriching the covid- supergraph with drug-target information linked from highly curated drug-target databases (drugbank, chembl, pubchem), we created an initial version of the covid- pharmacome, a comprehensive drug-target-mechanism graph representing covid- pathophysiology mechanisms that includes both drug targets and their ligands (figure ). in order to maximize its utility, this network includes both experimentally validated drug-target relationships as well as a wide distribution of biological entities and concepts (supplementary figure ). the entire covid- pharmacome was manually inspected and re-curated; this graph database is openly accessible to the scientific community at http://graphstore.scai.fraunhofer.de. figure : the covid- supergraph integrates drug-target information to form the covid- pharmacome. a) an aggregate of constituent covid- computable models covering a wide spectrum of pathophysiological mechanisms associated with sars-cov- infection or harmonized to generate the mechanism-based covid- supergraph. b) the covid- supergraph is annotated with drug-target information from a variety of curated sources to generate the covid- pharmacome composed of nodes (representing proteins, .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / pathologies, and other biological entities/concepts) and edges (indicating relationships or interactions between the pair of nodes they connect). systematic review and integration of information from phenotypic screening at the time of the writing of this paper, six phenotypic cellular screening experiments have been shared via archive servers and journal publications (supplementary table ). although only a limited number of these manuscripts have been officially accepted and published, we were able to extract their primary findings from the pre-publication archive servers. a significant number of reports on drug repurposing screenings in the covid- context demonstrate how appealing the concept of drug repurposing is as a quick answer to the challenge of a global pandemic. drug repurposing screenings were all performed with compounds for which a significant amount of information on safety in humans and primary mechanism of action is available. we generated a list of “hits” from cellular screening experiments while results derived from publications that reported on in-silico screening were ignored. therefore, we keep a strict focus on well-characterized, well-understood candidate molecules in order to ensure that one of the pivotal advantages of this knowledge base is its use for drug repurposing. subgraph annotation the covid- pharmacome contains several subgraphs, three of which correspond to major views on the biology of sars-cov- as well as the clinical impact of covid- : - the viral life cycle subgraph focuses on the stages of viral infection, replication, and spreading. - the host response subgraph represents essential mechanisms active in host cells infected by the virus. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / - the clinical pathophysiology subgraph illustrates major pathophysiological processes of clinical relevance. these subgraphs were annotated by identifying nodes within the covid- pharmacome that represent specific biological processes or pathologies associated with each subgraph category and traversing out to their first-degree neighbors. for example, a biological process node representing “viral translation” would be classified as a starting node for the viral life cycle subgraph while a node defined as “defense response to virus" would be categorized as belonging to the host response subgraph. though the viral life cycle and host response subgraphs contain a wide variety of node types, the pathophysiology subgraph is restricted to pathology nodes associated with either the sars-cov- virus or the covid- pathology. mapping of gene expression data onto the covid- pharmacome two single cell sequencing data sets representing infected and non-infected cells directly derived from human samples and cultured human bronchial epithelial cells (hbecs) were used to identify the areas of the covid- pharmacome responding at gene expression level to sars-cov- infection. details of the gene expression data processing and mapping are available in the supplementary material (section gene expression data analysis). pathway enrichment associated pathways for subgraphs and significant targets were identified using the enrichr feature of the gseapy python package . briefly, gene symbol lists were assembled from their respective subgraph or dataset and compared against multiple pathway gene set libraries including reactome, kegg, and wikipathways. to account for multiple comparisons, p-values .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / were corrected using the benjamini-hochberg method and results with p-values < . were considered significantly enriched. drug repurposing screening we performed phenotypic assays to screen for repurposing drugs that inhibit the replication and the cytopathic effects of virus infection. a derivative of the broad repurposing library was used to incubate caco- cells before infecting them with an isolate of sars-cov- (ffm- isolate, see ). survival of cells was assessed using a cell viability assay and measured by high- content imaging using the operetta cls platform (perkinelmer). details of the drug repurposing screening are described in the supplemental material. drug combinations assessment with anti-cytopathic effect measured in caco- cells as described in ellinger et al., we challenged four combinations of five different compounds with the sars-cov- virus in four -well plates containing two drugs each. eight drug concentrations were chosen ranging from µm to . µm, diluted by a factor of and positioned orthogonally to each other in rows and columns. no pharmacological control was used, only cells with and without exposure to sars cov- virus at . moi. in addition, recently published data from the work of bobrowski et al. , were mapped to the covid- pharmacome and compared to the results of the combinatorial treatment experiments performed here. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / results comparative analysis of the hits from different repurposing screenings data from six published drug repurposing screenings were downloaded, and extensive mapping and curation was performed in order to harmonize chemical identifiers. the curated list of drug repurposing “hits” together with an annotation of the assay conditions is available under http://chembl.blogspot.com/ / /chembl -sars-cov- -release.html initially, we analyzed the overlap between compounds identified in the reported drug repurposing screening experiments. figure a shows no overlap between experiments, which is not surprising, as we are comparing highly specific candidate drug experiments with screenings based on large drug repositioning libraries. however, the overlap is still quite marginal for those screenings where large compound collections (broad library, reframe library) have been used. figure : overlap of compound hits between different drug repurposing screening experiments. a) direct comparison of overlapping hits in drug repurposing screenings revealed no overlap between the experiments. these experiments were performed using different cell types (vero e cells and caco cells). b) protein target space overlap between different covid- drug repurposing screenings. drug targets were identified by confidence level >= and single protein targets according to the chembl database. comparison of experiments indicates over one hundred common protein targets. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://chembl.blogspot.com/ / /chembl -sars-cov- -release.html https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / mapping of repurposing hits to target proteins in order to identify which proteins are targeted by the repurposing hits, and to investigate the extent to which there are overlaps between repurposing experiments at the target/protein level, we mapped all the identified compounds from the drug repurposing experiments to their respective targets. as most drugs bind to more than one target, we increase the likelihood of overlaps between the drug repurposing experiments when we compare them at the protein/target space. indeed, figure b shows an overlap of targets between all the drug repurposing experiments, thereby creating a list of potential proteins for therapeutic intervention when the compound targets are considered rather than the compounds themselves. the covid- pharmacome associates pathways derived from drug repurposing targets with pathophysiology mechanisms a non-redundant list of drug repurposing candidate molecules that display activity in phenotypic (cellular) assays was generated and mapped to the covid- pharmacome. figure shows the distribution of repurposing drugs in the covid- cause-and-effect graph, the “responsive part” of the graph that is characterized by changes in gene expression associated with sars-cov- infection and the overlap between the two subgraphs. this overlap analysis allows for the identification of repurposing drugs targeting mechanisms that are modulated by viral infection. a total number of mechanisms were identified as being targeted by most of the drug repurposing candidates (see section “associated pathway identification” in supplementary materials). when compared to the annotated subgraphs in the covid- pharmacome, of the determined associated pathways found for the viral life cycle subgraph overlapped .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / with those for the drug repurposing targets while the host response subgraph shared of its pathways. mapping of drug repurposing signals to hypervariable regions of the covid- pharmacome one of the key questions arising from the network analysis is whether the repurposing drugs target mechanisms are specifically activated during viral infection. in order to establish this link, we mapped differential gene expression analyses from two single-cell sequencing studies to our covid- pharmacome (see section “differential gene expression” in supplementary material). an overlay of differential gene expression data (adjusted p-value ≤ . and abs(log fold-change) > . ) on the covid- pharmacome reveals a distinct pattern characterized by the high responsiveness (expressed by variation of regulation of gene expression) to the viral infection (figure a). .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : identification of suitable targets for combination therapy by comparing subgraphs within the covid- pharmacome. incorporation of gene expression data into the covid- pharmacome resulted in a subgraph characterized by the entities (genes/proteins) that respond to viral infection (a). mapping of the filtered results obtained from drug repurposing screenings (ic < µm) to the pharmacome resulted in a subgraph enriched for drug repurposing targets (b). the intersection between subgraphs presented in (a) and (b) is highly enriched for drug repurposing targets directly linked to the viral infection response (c). virus-response mechanisms are targets for repurposing drugs in the next step, we analyzed which areas of the covid- graph respond to sars-cov- infection (indicated by significant variance in gene expression) and are targets for repurposing drugs. to this end, we mapped signals from the drug repurposing screenings to the subgraph that showed responsiveness to sars-cov- infection (figure b). figure c depicts the resulting subgraph that is characterized by the transcriptional response to sars-cov- infection and the presence of target proteins of compounds that have been identified in drug repurposing screening experiments. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the covid- pharmacome supports rational targeting strategies for covid- combination therapy we mapped existing combinatorial therapy data to the covid- pharmacome in order to evaluate its potential in guiding rational approaches towards combination therapy using repurposing drug candidates. combinatorial treatment data obtained from the results published by bobrowski et al. and ellinger et al. were mapped to the covid- pharmacome. figure provides an overview of the mapped compounds, thier protein targets, and the interaction mechanisms. analysis of the overlaps between the drug repurposing screening data showed that four of the ten compounds reported in the synergistic treatment approach by drug repurposing data were represented in our initial non- redundant set of candidate repurposing drugs. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : visualization of drug repurposing candidates (and their targets) used in combination treatment experiments. the subgraph depicts the drug repurposing candidate molecules in relation to each other and their targets. shortest path lengths between drug combinations were calculated from this subgraph and are available in the supplementary material (supplementary table ). based on the association between repurposing drug candidates and the areas of the covid- pharmacome that respond to sars-cov- infection (figure ), we hypothesized that the number of edges between a pair of drug nodes may be linked to the effectiveness of the drug combination (supplementary figure ). in order to evaluate whether the determined outcome of a combination of drugs correlated with the distance between said drug nodes, we compared distances for combinations of drugs within the covid- pharmacome for which .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / their effect was known (supplementary tables & ). of the drug combinations we were able to check within the covid- pharmacome, we found that the pairs of drugs known to have a synergistic effect in the treatment of sars-cov- had an average shortest path length of . , while antagonistic combinations were found to be farther apart with an average shortest path length of . (supplementary table ). based on our calculations, we formulated three categories for predicting the outcome of new drug combinations on infection using the shortest path lengths between them within the covid- pharmacome. drug combinations with shortest path lengths of indicate a synergistic relationship between the compounds, was determined to be inconclusive as our calculations did not justify a specific outcome, and those with a shortest path length of or more were predicted to have an antagonistic relationship. in order to test our ability to predict the outcome of novel drug combinations, we selected five compounds: remdesivir (a virus replicase inhibitor), nelfinavir (a virus protease inhibitor), raloxifene (a selective estrogen receptor modulator), thioguanosine (a chemotherapy compound interfering with cell growth), and anisomycin (a pleiotropic compound with several pharmacological activities, including inhibition of protein synthesis and nucleotide synthesis). these compounds were used in four different combinations (remdesivir/thioguanosine, remdesivir/raloxifene, remdesivir/anisomycin and nelfinavir/raloxifene) to test the potency of these drug pairings in phenotypic, cellular assays. figure shows the results of these combinatorial treatments on the virus-induced cytopathic effect in caco- cells. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : dose-response curves (drc) depicting viral inhibition of sars-cov- by select drug combinations. a) a threshold effect can be seen with the remdesivir/anisomycin combination when anisomycin reaches µm, well beyond anisomycin’s ic alone. remdesivir activity does not appear to be affected by anisomycin, while remdesivir seems to be equally affected (de-potentiated) by low to high concentrations of raloxifene. b) viral inhibition for remdesivir/thioguanosine can be seen only at lower thioguanosine concentrations, at higher concentrations the clear curve shift of remdesivir at lower concentration (effect beyond loewe’s additivity formula) could not be appreciated. c) raloxifene had an antagonistic effect on remdesivir’s viral replication inhibition activity. d) a clear shift in nelfinavir’s drc can be observed when combined with raloxifene, but also suggests a threshold effect when raloxifene concentrations are higher than . µm. our results indicate that compound combinations acting on different viral mechanisms, such as remdesivir and thioguanosine (figure b) or nelfinavir and raloxifene (figure d), showed synergy, while compounds acting on host mechanisms, for instance anisomycin or raloxifene, when combined with remdesivir (figure a and figure c, respectively), resulted in neither synergistic nor additive effects. interestingly, our experiments revealed that the hiv-protease inhibitor nelfinavir, which already appeared to be active against viral post-entry fusion steps of both sars-cov and sars-cov- , displayed synergistic effects when combined with high concentrations of raloxifene. this result agrees with our predictions generated using the covid- pharmacome in which the drug .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / combination with the shortest distance, raloxifene and nelfinavir (supplementary table ), would have a synergistic effect on sars-cov- pathology. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / discussion by combining a significant number of knowledge graphs which represent various aspects of covid- pathophysiology and drug-target information we were able to generate the covid- pharmacome, a unique resource that covers a wide spectrum of cause-and-effect knowledge about sars-cov- and its interactions with the human host. based on a systematic review of the results derived from published drug repurposing screening experiments, as well as our own drug repurposing screening results, we were able to identify mechanisms targeted by a variety of compounds showing virus inhibition in phenotypic, cellular assays. with the covid- pharmacome, we are now able to link repurposing drugs, their targets and the mechanisms modulated by said drugs within one computable data structure, thereby enabling us to target - in a combinatorial treatment approach - different, independent mechanisms. by challenging the covid- pharmacome with gene expression data, we have identified subgraphs that are responsive (at gene expression level) to virus infection. network analysis along with the overview on previous repurposing experiments provided us with the insights needed to select the optimal repurposing drug candidates for combination therapy. experimental verification showed that this systematic approach is valid; we were able to identify two drug-target-mechanism combinations that demonstrated synergistic action of the repurposed drugs targeting different mechanisms in combinatorial treatments. we are fully aware of the fact that the covid- pharmacome combines experimental results generated in different assay conditions. in the course of our work, we accumulated evidence that assay responses recorded using vero e cells in comparison to caco- cells may only partially overlap. comparative analysis of the results of both assay systems to virus infection by means of transcriptome-wide gene expression analysis is one of the experiments .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / we plan to perform next. however, for the identification of meaningful combinations of repurposing drugs, the current model-driven information fusion approach was shown to work well despite the putative differences between drug repurposing screening assays. given the urgent need for treatments that work in an acute infection situation, our approach described here paves the way for systematic and rational approaches towards combination therapy of sars-cov- infections. we want to encourage all our colleagues to make use of the covid- pharmacome, improve it, and add useful information about pharmacological findings (e.g. from candidate repurposing drug combination screenings). in addition to vaccination and antibody therapy, (combination) treatment with small molecules remains one of the key therapeutic options for combatting covid- . the covid- pharmacome will therefore be continuously improved and expanded to serve integrative approaches in anti-sars-cov- drug discovery and development. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / acknowledgements in part, this project is supported by the european union’s horizon research and innovation program under grant agreement no , project exscalate cov. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references xu, b., gutierrez, b., mekaru, s., sewalk, k., goodwin, l., loskill, a., ... & zarebski, a. e. ( ). epidemiological data from the covid- outbreak, real-time case information. scientific data, ( ), - . lipsitch, m., swerdlow, d. l., & finelli, l. ( ). defining the epidemiology of covid- —studies needed. new england journal of medicine, ( ), - . holmdahl, i., & buckee, c. ( ). wrong but useful—what covid- epidemiologic models can and cannot tell us. new england journal of medicine. cao, w., & li, t. ( ). covid- : towards understanding of pathogenesis. cell research, - . liao, m., liu, y., yuan, j., wen, y., xu, g., zhao, j., ... & liu, l. ( ). single-cell landscape of bronchoalveolar immune cells in patients with covid- . nature medicine, - . tay, m. z., poh, c. m., rénia, l., macary, p. a., & ng, l. f. ( ). the trinity of covid- : immunity, inflammation and intervention. nature reviews immunology, - . gervasoni, s.; vistoli, g.; talarico, c.; manelfi, c.; beccari, a.r.; studer, g.; tauriello, g.; waterhouse, a.m.; schwede, t.; pedretti, a. a comprehensive mapping of the druggable cavities within the sars-cov- therapeutically relevant proteins by combining pocket and docking searches as implemented in pockets . . int. j. mol. sci. , , . ostaszewski, m., mazein, a., gillespie, m. e., kuperstein, i., niarakis, a., hermjakob, h., ... & schreiber, f. ( ). covid- disease map, building a computational repository of sars-cov- virus-host interaction mechanisms. scientific data, ( ), - . domingo-fernandez, d. et al. covid- knowledge graph: a computable, multi-modal, cause-and-effect knowledge model of covid- pathophysiology. bioinformatics. btaa ( ). gysi, d. m., valle, Í. d., zitnik, m., ameli, a., gan, x., varol, o., ... & barabási, a. l. ( ). network medicine framework for identifying drug repurposing opportunities for covid- . arxiv preprint arxiv: . . khan, j. y., khondaker, m., islam, t., hoque, i. t., al-absi, h., rahman, m. s., ... & rahman, m. s. ( ). covid- base: a knowledgebase to explore biomedical entities related to covid- . arxiv preprint arxiv: . . kuperstein, i., bonnet, e., nguyen, h. a., cohen, d., viara, e., grieco, l., ... & dutreix, m. ( ). atlas of cancer signalling network: a systems biology resource for integrative analysis of cancer data with google maps. oncogenesis, ( ), e -e . kodamullil, a. t., younesi, e., naz, m., bagewadi, s., & hofmann-apitius, m. ( ). computable cause-and- effect models of healthy and alzheimer's disease states and their mechanistic differential analysis. alzheimer's & dementia, ( ), - . fujita, k. a., ostaszewski, m., matsuoka, y., ghosh, s., glaab, e., trefois, c., ... & diederich, n. ( ). integrating pathways of parkinson's disease in a molecular interaction map. molecular neurobiology, ( ), - . matsuoka, y. et al. a comprehensive map of the influenza a virus replication cycle. bmc syst. biol. , ( khan, j. y., khondaker, m., islam, t., hoque, i. t., al-absi, h., rahman, m. s., ... & rahman, m. s. ( ). covid- base: a knowledgebase to explore biomedical entities related to covid- . arxiv preprint arxiv: . . ostaszewski, m., mazein, a., gillespie, m. e., kuperstein, i., niarakis, a., hermjakob, h., ... & schreiber, f. ( ). covid- disease map, building a computational repository of sars-cov- virus-host interaction mechanisms. scientific data, ( ), - . blanco-melo, d., nilsson-payant, b. e., liu, w. c., uhl, s., hoagland, d., møller, r., ... & wang, t. t. ( ). imbalanced host response to sars-cov- drives development of covid- . cell. gordon, d. e., jang, g. m., bouhaddou, m., xu, j., obernier, k., white, k. m., ... & tummino, t. a. ( ). a sars-cov- protein interaction map reveals targets for drug repurposing. nature, - . bojkova, d., klann, k., koch, b., widera, m., krause, d., ciesek, s., ... & münch, c. ( ). proteomics of sars-cov- - infected host cells reveals therapy targets. nature, - . .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ashburn, t. t., & thor, k. b. ( ). drug repositioning: identifying and developing new uses for existing drugs. nature reviews drug discovery, ( ), - . pushpakom, s., iorio, f., eyers, p. a., escott, k. j., hopper, s., wells, a., ... & norris, a. ( ). drug repurposing: progress, challenges and recommendations. nature reviews drug discovery, ( ), - . http://rdcu.be/qkdskdsp://rdcu.be/qkds https://doi.org/ . /pnas. https://reframedb.org/assays/a https://reframedb.org/assays/a preprint, doi:. /rs. .rs- /v slater, t. ( ). recent advances in modeling languages for pathway maps and computable biological networks. drug discovery today, ( ), - . domingo-fernández, d., mubeen, s., marín-llaó, j., hoyt, c. t., & hofmann-apitius, m. ( ). pathme: merging and exploring mechanistic pathway knowledge. bmc bioinformatics, ( ), . domingo-fernández, d., hoyt, c. t., bobis-Álvarez, c., marín-llaó, j., & hofmann-apitius, m. ( ). compath: an ecosystem for exploring, analyzing, and curating mappings across pathway databases. npj systems biology and applications, ( ), - . astghik, s. et al., submitted, bioinformatics journal (oup) chua, r. l., lukassen, s., trump, s., hennig, b. p., wendisch, d., pott, f.,debnath, o., thürmann, l., kurth, f., völker, m.t., kazmierski, j., timmermann, b., twardziok, s., schneider, s., machleidt, f., müller-redetzky, h., maier, m., krannich, a., schmidt, s., balzer, f., liebig, j., loske, j., suttorp, n., eils, j., ishaque, n., liebert, u.g., von kalle, c., witzenrath, m., goffinet, c., drosten, c., laudi, s., lehmann, i., conrad, c., sander, l-e. and eils, r. ( ). covid- severity correlates with airway epithelium–immune cell interactions identified by single-cell analysis. nature biotechnology, ( ), - . ravindra, n. g., alfajaro, m. m., gasque, v., habet, v., wei, j., filler, r. b., huston, n. c., wan, h., szigeti- buck, k., wang, b., wang, g., montgomery, r.r., eisenbarth, s. c., williams, a., pyle, a.m., iwasaki, a., horvath, t.l., foxman, e.f., pierce, r.w., van dijk, d., and wilen, c.b. ( ). single-cell longitudinal analysis of sars- cov- infection in human bronchial epithelial cells. biorxiv. kuleshov mv, jones mr, rouillard ad, et al. enrichr: a comprehensive gene set enrichment analysis web server update. nucleic acids res. ; (w ):w -w . doi: . /nar/gkw https://pypi.org/project/gseapy/ benjamini y. discovering the false discovery rate: false discovery rate. j. r. stat. soc. ser. b stat. methodol. ; ( ): – . doi: . /j. - . . .x. hoehl, s., rabenau, h., berger, a., kortenbusch, m., cinatl, j., bojkova, d., behrens,p., böddinghaus, b., götsch,u., naujoks,f., neumann, p., schork, j., tiarks-jungk, p., walczok, a., eickmann, m., vehreschild,m., kann, g.,wolf, t.,gottschalk, r., & ciesek, s. ( ). evidence of sars-cov- infection in returning travelers from wuhan, china. new england journal of medicine, ( ), - . ellinger, b., bojkova, d., zaliani, a., cinatl, j., claussen, c., westhaus, s., ... & gribbon, p. ( ). identification of inhibitors of sars-cov- in-vitro cellular toxicity in human (caco- ) cells using a large scale drug repurposing collection. manuscript under review bobrowski, t., chen, l., eastman, r. t., itkin, z., shinn, p., chen, c., guo, h., zheng, w., michael, s., simeonov, a., hall, m., zakharov, a.v., and muratov, e.n. ( ). discovery of synergistic and antagonistic drug combinations against sars-cov- in vitro. biorxiv. bobrowski, t., chen, l., eastman, r. t., itkin, z., shinn, p., chen, c., guo, h., zheng, w., michael, s., simeonov, a., hall, m., zakharov, a.v., and muratov, e.n. ( ). discovery of synergistic and antagonistic drug combinations against sars-cov- in vitro. biorxiv. ellinger, b et al. ( ). identification of inhibitors of sars-cov- in-vitro cellular toxicity in human (caco- ) cells using a large scale drug repurposing collection. preprint. https://doi.org/ . /rs. .rs- /v . yamamoto, n., yang, r., yoshinaka, y., amari, s., nakano, t., cinatl, j., ... & tamamura, h. ( ). hiv protease inhibitor nelfinavir inhibits replication of sars-associated coronavirus. biochemical and biophysical research communications, ( ), - . musarrat, f., chouljenko, v., dahal, a., nabi, r., chouljenko, t., jois, s. d., & kousoulas, k. g. ( ). the anti‐ hiv drug nelfinavir mesylate (viracept) is a potent inhibitor of cell fusion caused by the sars‐cov‐ spike (s) glycoprotein warranting further evaluation as an antiviral against covid‐ infections. journal of medical virology. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://rdcu.be/qkds http://rdcu.be/qkds http://rdcu.be/qkds https://doi.org/ . /pnas. https://reframedb.org/assays/a https://reframedb.org/assays/a https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / a b c .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / - - - - - - - remdesivir drc in presence of thioguanosine log c (m) % in hi bi tio n . µm thioguanosine . µm thioguanosine . µm thioguanosine . µm thioguanosine - - - - - - - nelfinavir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of anisomycin log c (m) % in hi bi tio n µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin - - - - - - - remdesivir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of thioguanosine log c (m) % in hi bi tio n . µm thioguanosine . µm thioguanosine . µm thioguanosine . µm thioguanosine - - - - - - - nelfinavir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of anisomycin log c (m) % in hi bi tio n µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin - - - - - - - remdesivir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of thioguanosine log c (m) % in hi bi tio n . µm thioguanosine . µm thioguanosine . µm thioguanosine . µm thioguanosine - - - - - - - nelfinavir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of anisomycin log c (m) % in hi bi tio n µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin - - - - - - - remdesivir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of thioguanosine log c (m) % in hi bi tio n . µm thioguanosine . µm thioguanosine . µm thioguanosine . µm thioguanosine - - - - - - - nelfinavir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of anisomycin log c (m) % in hi bi tio n µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin - - - - - - - remdesivir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of thioguanosine log c (m) % in hi bi tio n . µm thioguanosine . µm thioguanosine . µm thioguanosine . µm thioguanosine - - - - - - - nelfinavir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of anisomycin log c (m) % in hi bi tio n µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin - - - - - - - remdesivir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of thioguanosine log c (m) % in hi bi tio n . µm thioguanosine . µm thioguanosine . µm thioguanosine . µm thioguanosine - - - - - - - nelfinavir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of anisomycin log c (m) % in hi bi tio n µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin - - - - - - - remdesivir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of thioguanosine log c (m) % in hi bi tio n . µm thioguanosine . µm thioguanosine . µm thioguanosine . µm thioguanosine - - - - - - - nelfinavir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of anisomycin log c (m) % in hi bi tio n µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin - - - - - - - remdesivir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of thioguanosine log c (m) % in hi bi tio n . µm thioguanosine . µm thioguanosine . µm thioguanosine . µm thioguanosine - - - - - - - nelfinavir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene - - - - - - - remdesivir drc in presence of anisomycin log c (m) % in hi bi tio n µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin . µm anisomycin - - - - - - - remdesivir drc in presence of raloxifene log c (m) % in hi bi tio n µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene . µm raloxifene a c b d .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / title: niagads alzheimer’s genomicsdb: a resource for exploring alzheimer’s disease genetic and genomic knowledge authors emily greenfest-allen , conor klamann , prabhakaran gangadharan , amanda kuzma , yuk yee leung , otto valladares , gerard schellenberg , christian j. stoeckert jr. , li-san wang affiliations penn neurodegeneration genomics center, perelman school of medicine, university of pennsylvania, philadelphia, pa , usa institute for biomedical informatics, perelman school of medicine, university of pennsylvania, philadelphia, pa , usa department of pathology and laboratory medicine, perelman school of medicine, university of pennsylvania, philadelphia, pa , usa department of genetics, perelman school of medicine, university of pennsylvania, philadelphia, pa , usa corresponding author emily greenfest-allen allenem@pennmedicine.upenn.edu li-san wang lswang@pennmedicine.upenn.edu (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:allenem@pennmedicine.upenn.edu https://doi.org/ . / . . . abstract introduction: the niagads alzheimer’s genomics database (genomicsdb) is an interactive knowledgebase for alzheimer’s disease (ad) genetics that provides access to gwas summary statistics datasets deposited at niagads, a national genetics data repository for ad and related dementia (adrd). methods: the website makes available > genome-wide summary statistics datasets from gwas and genome sequencing analysis for ad/adrd. variants identified from these datasets are mapped to up-to-date variant and gene annotations from a variety of resources and linked to functional genomics data. the database is powered by a big data optimized relational database and ontologies to consistently annotate study designs and phenotypes, facilitating data harmonization and efficient real-time data analysis and variant or gene report generation. results: detailed variant reports provide tabular and interactive graphical summaries of known adrd associations, as well as highlight variants flagged by the alzheimer’s disease sequencing project (adsp). gene reports provide summaries of co-located adrd risk-associated variants and have been expanded to include meta-analysis results from aggregate association tests performed by the adsp allowing us to flag genes with genetic evidence for ad. discussion: the genomicsdb makes available > million variant annotations, including ~ million ( million novel) variants identified as ad-relevant by adsp, for browsing and real-time mining via the website. with a newly redesigned, efficient, search interface and comprehensive record pages linking summary statistics to variant and gene annotations, this resource makes these data both accessible and interpretable, establishing itself as valuable tool for ad research. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . background alzheimer’s disease (ad) is a progressive neurodegenerative disorder that affects . million people in us in , is effectively untreatable, and invariably progresses to complete incapacitation and death or more years after onset. early work in the s identified mutations in the amyloid precursor protein (app) gene, presenilins and that cause ad, and alleles of the apolipoprotein e gene (apoe) that increase (ε ) or decrease (ε ) susceptibility to late-onset alzheimer’s disease (load). heritability of ad is high, ranging from near % to % in the best fitting model [ , ]. however, apart from apoe, there is no simple pattern of inheritance for load. instead, it is likely caused by a complex combination of common, polygenic variants [ ] acting together with a small number of rare variants with a large effect [ , ]. our current understanding of genetic risk for ad has resulted mainly from massive genotyping and sequencing efforts such as the alzheimer’s disease genetics consortium (adgc), the international genomics of alzheimer’s project (igap), and the alzheimer’s disease sequencing project (adsp). large-scale genome wide association studies (gwas) and gwas-derived meta- analyses have been performed by each of these groups [ – ], the results of which are deposited at the national institute of aging (nia) genetics of alzheimer’s disease data storage site (niagads) at the university of pennsylvania [ ]. niagads is an nia-designated essential national infrastructure, providing a one-stop access portal for alzheimer’s disease ′omics datasets. qualified investigators can submit data use requests to access protect personal genetic information. niagads also disseminates unrestricted meta-analysis results and gwas summary statistics to promote data reuse, allowing researchers to explore known evidence for ad genetic risk. however, substantive bioinformatics expertise and compute power are required to annotate and mine these datasets, which are significant hurdles for many researchers planning to explore this large and ever-increasing volume of data. assembly of unrestricted genomic knowledge into an integrated, interactive web resource would help overcome this barrier. here, we introduce the niagads alzheimer’s genomics database (genomicsdb), which was developed in collaboration with the adgc and adsp with this goal in mind. the genomicsdb is a user-friendly workspace for data sharing, discovery, and analysis designed to facilitate the quest for better understanding of the complex genetic underpinnings of ad neurodegeneration and accelerate the progress of research on ad and ad related dementias (adrd). it accomplishes this by making summary genetic evidence for ad/adrd both accessible to and interpretable by molecular biologists, clinicians and bioinformaticians alike regardless of computational skills. methods . genomics datasets . . niagads gwas summary statistics (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . as of december , the niagads genomicsdb provides unrestricted access to genome-wide summary statistics p-values from > gwas and adsp meta-analysis. summary statistic results are linked to > million adsp annotated single-nucleotide variants (snvs) and indels. gwas summary statistics datasets deposited at niagads are integrated into the genomicsdb as they become publicly available via publication or permission of the submitting researchers. these include studies that focus specifically on ad and late-onset ad (load), as well as those on adrd-related neuropathologies and biomarkers. a full listing of the summary statistics datasets currently available through the niagads genomicsdb is provided in supplementary table s . prior to loading in the database, the datasets are annotated (e.g. provenance, phenotypes, study design) and variant representation normalized to ensure consistency with adsp analysis pipelines and facilitate harmonization with third-party annotations. to ensure the privacy of personal health information, the niagads genomicsdb website only makes p-values from the summary statistics available for browsing (on dataset, gene, and variant reports and as genome browser tracks) and analysis. access to the full summary statistics (including genome-wide allele frequencies and effect sizes) and corresponding gwas or sequencing results is managed via formal data-access requests made to niagads. all datasets included in the genomicsdb are properly credited to the submitting researchers or sequencing project. . . nhgri-ebi gwas catalog variants and summary statistics curated in the nhgri-ebi gwas catalog [ ] are listed in niagads genomicsdb variant reports and a track is available on the genome browser. variants linked to ad/adrd are highlighted. . . adsp meta-analysis results the niagads genomicsdb has recently expanded its scope to include meta-analysis results offering genetic evidence for gene-level and single-variant risk associations for ad. currently available are case/control association results recently published by the adsp [ ] and deposited at niagads (accession no. ng ). . variant annotation . . variant identification single nucleotide polymorphisms (snps) and short-indels are uniquely identified by position and allelic variants. this allows accurate mapping of risk-association statistics to specific mutations and to external variant annotations from resources such as gnomad (https://gnomad.broadinstitute.org/) [ ] and gtex (https://www.gtexportal.org/home/) [ ]. all variants are mapped to dbsnp (https://www.ncbi.nlm.nih.gov/snp/) [ ] and linked to refsnp identifiers when possible. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://gnomad.broadinstitute.org/ https://www.gtexportal.org/home/ https://www.ncbi.nlm.nih.gov/snp/ https://doi.org/ . / . . . . . adsp variant annotations annotated variants in the niagads genomicsdb include the > million snps and ~ , short-indels identified during the adsp discovery phase whole-genome (wgs) and whole- exome sequencing (wes) efforts [ ]. these variants are highlighted in variant and dataset reports and their quality control status is provided. as part of this sequencing effort, the adsp developed an annotation pipeline that builds on ensembl’s vep software [ ] to efficiently integrate standard annotations and rank potential variant impacts according to predicted effect (such as codon changes, loss of function, and potential deleteriousness) [ , ]. variant tracks annotated by these results are available for both the wes and wgs variants on the genomicsdb genome browser. the pipeline has been applied to all variants in the genomicsdb. these annotations can be browsed on variant reports or used to filter search results. user uploaded lists of variants are automatically annotated in real-time. . . allele frequencies the niagads genomicsdb includes allele frequency data from genomes (phase , version ) (https://www.internationalgenome.org/home) [ ], exac (http://exac.broadinstitute.org/) [ ], and gnomad [ ]. . . linkage disequilibrium linkage-disequilibrium (ld) structure around annotated variants is estimated using phase version ( may ) of the genomes project [ ]. ld estimates were made using plink v . b i -bit [ ]. only ld-scores meeting a correlation threshold of r ≥ . are stored in the database. locuszoom.js [ , ] is used to render ld-scores in the context of the gwas summary statistics datasets. . gene and transcript annotation . . gene identification gene and transcript models are obtained from the gencode release (grch .p ) reference gene annotation [ ]. a grch version of the niagads genomicsdb is planned for . standard gene nomenclature is imported from the hugo gene nomenclature committee at the european bioinformatics institute [ ] and used to link annotated genes to external resources such as uniprot (https://www.uniprot.org/) [ ], the ucsc genome browser (http://genome.ucsc.edu)[ ], and online mendelian inheritance in man (omim) database (https://omim.org/) [ , ]. . . functional annotation (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.internationalgenome.org/home http://exac.broadinstitute.org/ https://www.uniprot.org/ http://genome.ucsc.edu/ https://omim.org/ https://doi.org/ . / . . . annotations of the functions of genes and gene products are taken from packaged releases of the gene ontology (go; http://geneontology.org) and go-gene associations [ ] and are updated regularly. go-gene associations are reported in summary tables on gene reports and include details on annotation sources, as well as new information from the go causal modeling (go-cam) framework that allows better understanding of how different gene products work together to effect biological processes [ ]. users can run functional enrichment analysis on gene search results or uploaded gene lists. geneset enrichment and semantic similarity scores are calculated using the goatools python library for go analysis [ ]. . . pathways gene membership in molecular and metabolic pathways is provided from the kyoto encyclopedia of genes and genomes (kegg) (https://www.genome.jp/kegg/) [ ] and reactome (https://reactome.org/) [ ]. users can run pathway enrichment analysis on gene search results or uploaded gene lists. pathway enrichment statistics are calculated using a multiple hypothesis corrected fisher’s exact test implemented using the scipy, pandas, and statsmodels python packages. . functional genomics hundreds of functional genomics tracks have been integrated into the niagads genomicsdb and mapped against ad/adrd-associated variants. these tracks are queried from the niagads functional genomics repository (filer), which provides harmonized functional genomics datasets that have been giggle indexed [ ] for quick lookups [ ]. filer tracks made available through the genomicsdb have been pulled from established functional genomics resources, including the encyclopedia of dna elements (encode) [ , ], the functional annotation of the mouse/mammalian genome (fantom ) enhancer atlas [ ], and the nih roadmap epigenomics mapping consortium [ ]. genome browser tracks are available for all functional genomics datasets and are organized by data source, biotype (e.g., cell, tissue, or cell line), type of functional annotation (e.g., expressed enhancers, transcription factor binding sites, histone modifications) and platform or assay type to facilitate track selection. . overview of database design an overview of the niagads genomicsdb systems architecture is provided in figure . the genomicsdb is powered by a postgresql relational database system that has been optimized for parallel big data querying, allowing for efficient real-time data mining. data are organized using the modular genomics unified schema version (gus ), designed for scalable integration and dissemination of large-scale ′omics datasets. loading of all data is managed by the gus application layer (https://github.com/veupathdb/gusappframework), which ensures the accuracy of data integration. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://geneontology.org/ https://www.genome.jp/kegg/ https://reactome.org/ https://github.com/veupathdb/gusappframework https://doi.org/ . / . . . . overview of website design and organization the niagads genomicsdb is powered by an open-source database system and web- development kit (wdk; https://github.com/veupathdb/wdk) developed and successfully deployed by the eukaryotic pathogen, vector and host informatics (veupathdb) bioinformatics resource center [ , ]. the veupathdb wdk provides a query engine that ties the database system to the website via an easily extensible xml data model. the data model is used to automatically generate and organize searches, search results, and reports, with concepts and data organized by topics from the embrace data and methods (edam) ontology, which defines a comprehensive set of concepts that are prevalent within bioinformatics [ ]. this facilitates updates of third-party data and rapid integration of new datasets as they become publicly available. the wdk also provides a framework for lightweight java/jersey representational state transfer (rest) services for data querying. this allows search results and reports to be returned in multiple file formats (e.g., delimited-text, xml, and json) in addition to browsable, interactive web pages. this new feature of genomicsdb has enabled the inclusion of sophisticated visualizations for summarizing search results and annotations in gene and variant reports. api development is still undergoing, with plans to develop a flexible api that allows researchers to integrate genomicsdb datasets and annotations into analysis pipelines. the genomicsdb uses a combination of an in-house javascript genomics visualization toolkit and established third- party visualization tools, including the highcharts.js (https://www.highcharts.com/) charting library for rendering scatter, pie, and bar charts, ideogram.js (https://github.com/eweitz/ideogram) for chromosome visualization, locuszoom.js for rendering ld structure in the context of niagads gwas summary statistics datasets, and an igv.js powered genome browser [ ]. all code used to generate the wdk website, including the javascript genomics visualizations are available on github (https://github.com/niagads). . overview of the niagads genome browser the niagads genome browser enables researchers to visually inspect and browse gwas summary statistics datasets in a genomic context. the genome browser allows users to compare niagads gwas summary statistics tracks to each other, against annotated gene or variant tracks, or to the functional genomics tracks from the niagads filer functional genomics repository. this tool is powered by igv.js, with track data queried in real-time by niagads genomicsdb rest services. the browser also provides a track selection tool that allows users to easily find tracks of interest by keyword search, data source, biotype (e.g., cell, tissue, or cell line) or type of functional annotation (fig. ). . results (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/veupathdb/wdk https://www.highcharts.com/ https://github.com/eweitz/ideogram https://github.com/niagads https://doi.org/ . / . . . the niagads alzheimer’s genomicsdb creates a public forum for sharing, discovery, and analysis of genetic evidence for alzheimer’s disease that is made accessible via an interface designed for easy mastery by biological researchers, regardless of background. the genomicsdb provides four main routes for data exploration and mining. first, detailed reports compile all available data concerning summary statistics datasets and genetic evidence linking ad/adrd to genes and variants. second, datasets can be mined in real-time to isolate a refined set of variants that share biological characteristics of interest. third, visualization tools such a s locuszoom.js and the niagads genome browser offer the ability to quickly view and draw conclusions from comparisons of summary statistics or adsp annotated variants to different types of sequence data in a genomic area of interest. fourth, and finally, tools such as enrichment analyses offer opportunities for users to link variants to biological processes via impacted genes. . finding variants, genes, and datasets the genomicsdb homepage and navigation menu contain a site search allowing users to quickl y find variants, genes, and datasets of interest by identifier or keyword. this search is paired with interactive graphics found throughout the site that provide shortcuts to resources and annotations of interest to the ad/adrd research community (fig. a, b). the genomicsdb also provides a dataset browser that allows users to search for gwas summary statistics datasets by ad/adrd phenotype, population, genotype, attribution, and sequencing center. . browsing and mining niagads gwas summary statistics a detailed report is provided for each of the gwas summary statistics and adsp meta-analysis datasets in the niagads genomicsdb (fig. a). these reports allow users to browse the genetic variants with genome-wide significance in the dataset (p-value ≤ × - to account for false positives due to testing associations of millions of variants simultaneously) via tables and interactive plots that provide an overview of the distribution and potential functional or regulatory impacts of the top variants (and proximal gene-loci) across the genome. all genes and variants listed in a dataset report are linked to reports in the genomicsdb that provide detailed information about genetic evidence for ad for the sequence feature (see next sections). dataset reports also provide quick links back to their parent accession in the niagads repository where users can download the complete p-values or make formal data access requests for the full summary statistics, related gwas, expression, or sequencing data associated with the accession. the reports also provide an inline search allowing users to mine the summary statistics in real-time via the website, setting their own p-value cut-off (see section . for more information). . detailed variant reports variant reports include a basic summary about the variant (alleles, variant type, flanking sequence, genomic location) and a graphical overview of niagads gwas summary statistics datasets in which the variant has genome-wide significance (fig. a). all other information in (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . the report is subdivided into multiple sections that can be expanded or hidden at the user’s discretion. these sections include sub-reports on genetic variation (e.g., allele population frequencies and ld), function prediction determined via the adsp annotation pipeline (incl. transcript and regulatory consequences), and comprehensive listings of gwas inferred disease or trait associations from both niagads summary statistics and the nhgri-ebi gwas catalog. tables listing summary statistics results can be dynamically filtered by p-value, dataset, phenotypes, or covariates, and the filtered results are downloadable. links to the source datasets for each reported statistic are also provided, leading to detailed dataset reports (e.g., niagads gwas summary statistics) or to the source publication (e.g., curated variant catalogs). these tables are paired with browsable locuszoom.js views of the ld structure surrounding the variant in the context of selected gwas summary statistics datasets. links to the niagads alzheimer’s disease variant portal (advp) and external resources for additional information (e.g., dbsnp, clinvar) are also provided. . detailed gene reports like the variant reports, gene reports provide basic summary information about the gene (nomenclature, gene type, genomic span) and a graphical overview of niagads gwas summary statistics-linked variants proximal to or within the footprint of the gene (fig. b). two types of gene-linked genetic evidence for ad are provided in the genomicsdb gene reports. first, we have surveyed the top risk-associated variants from the niagads gwas summary statistics datasets and provide a comprehensive listing of and links to those contained within ± kb of each gene (fig. c). second, we report meta-analysis results from gene-based rare variant aggregation tests performed as part of the adsp discovery phase case/control analysis [ ]. genes found to have a significant p-value in these results are flagged as being associated with genetic-evidence for ad. also provided on the gene report are sections reporting function prediction (gene ontology associations and evidence) and pathway membership (kegg and reactome). tables reporting these results or annotations can be dynamically filtered or downloaded. links to the niagads advp and to external resources (e.g., uniprotkb, omim, and exac) are also provided. . workspaces the genomicsdb provides an interactive workspace for exploring a dataset in more depth. as an example, dataset reports provide an inline search allowing users to mine the summary statistics. variants meeting the search criterion are reported in an interactive workspace that includes both tabular and graphical summaries. users are initially presented with a table that can be sorted or filtered by annotations (e.g., variant type, predicted effect, deleteriousness) (fig. b). a per-chromosome genome view is also available allowing users to explore an interactive ideogram depicting the distribution of variants meeting the search and filter criteria across the genome and allowing inspection of ld structure among proximal variants (fig. c). tables of results can be downloaded or requested via the api for programmatic processing. registered users also have the option to save and share search results both privately and (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . publicly; publicly shared search results are assigned a stable url that can be referenced in publications. . genome browser the niagads genome browser can be used to visually inspect any of the niagads gwas summary statistics datasets in a broader genomic context and compare against annotated adsp variant tracks or other ′omics tracks in the genomicsdb or filer (see section . , fig. b). discussion the niagads alzheimer’s genomics database is a user-friendly platform for interactive browsing and real-time in-depth mining of published genetic evidence and genetic risk-factors for ad. it provides open, real-time access to summary statistics datasets from genome-wide association analysis (gwas) of alzheimer’s disease and related neuropathologies. flexible search options allow users to easily retrieve ad risk-associated variants, conditioned on phenotypes such as ethnicity and age of onset. users can compare the niagads datasets against personal gene or variant lists. every entry in the genomicsdb has been linked with relevant external resources and functional genomics annotations to supply further information and assist researchers in interpreting the potential functional or regulatory role of risk-associated variants and susceptibility loci. the genomicsdb is updated periodically with enhanced features and new datasets and annotations when they are reported. the ad research community is actively encouraged through outreach and collaboration to submit data to niagads to keep this public platform updated and timely. the genomicsdb is integrated with other resources available at niagads. users can follow links back to the niagads repository to view comprehensive details about all gwas summary statistics datasets from niagads accession or request access to the primary data. the rest services used to query the database and generate data or feature reports provide the foundation of an api that allows programmatic access to the database, which we plan to integrate with cloud based niagads analysis pipelines. the genomicsdb is regularly updated to keep up with advances in alzheimer’s disease genomics research. new ad-related gwas summary statistics datasets and meta-analysis results from the adsp are added as they become available. reference databases are updated yearly. all genomics data in the current version of the genomicsdb are aligned and mapped to the grch .p genome build. a grch version of the database is planned for release in early , which will include variants from the ongoing adsp sequencing effort, including k wes in and k wgs in . genomicsdb is a potent platform for the ad genetics community to host comprehensive ad genetic and genomic findings. it uses the latest web and database technologies to allow integration with new tools, and niagads is constantly improving. as more data and tools (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . become available the niagads alzheimer’s genomics database will become a central hub for ad/adrd research and data analysis. conflicts of interest the authors have no financial interests to disclose. acknowledgements and funding information this work is supported by the nih national institute on aging (grant number u -ag ). the adsp discovery phase analysis of sequence data is supported through uf ag (to drs. schellenberg, farrer, pericak-vance, mayeux, and haines); u ag to dr. seshadri; u ag to dr. boerwinkle; u ag to dr. wijsman; and u ag to dr. goate. additional funding and acknowledgement statements for the adsp can be found in the supplement. references [ ] gatz m, reynolds ca, fratiglioni l, johansson b, mortimer ja, berg s, et al. role of genes and environments for explaining alzheimer disease. arch gen psychiatry ; : – . https://doi.org/ . /archpsyc. . . . [ ] jansen ie, savage je, watanabe k, bryois j, williams dm, steinberg s, et al. genome-wide meta-analysis identifies new loci and functional pathways influencing alzheimer’s disease risk. nature genetics ; : – . https://doi.org/ . /s - - - . [ ] hollingworth p, harold d, sims r, gerrish a, lambert j-c, carrasquillo mm, et al. common variants in abca , ms a a/ms a e, epha , cd and cd ap are associated with alzheimer’s disease. nat genet ; : – . https://doi.org/ . /ng. . [ ] lambert j-c, ibrahim-verbaas ca, harold d, naj ac, sims r, bellenguez c, et al. meta- analysis of , individuals identifies new susceptibility loci for alzheimer’s disease. nature genetics ; : – . https://doi.org/ . /ng. . [ ] kunkle bw, grenier-boley b, sims r, bis jc, damotte v, naj ac, et al. genetic meta- analysis of diagnosed alzheimer’s disease identifies new risk loci and implicates aβ, tau, immunity and lipid processing. nat genet ; : – . https://doi.org/ . /s - - - . [ ] naj ac, jun g, beecham gw, wang l-s, vardarajan bn, buros j, et al. common variants at ms a /ms a e , cd ap , cd and epha are associated with late-onset alzheimer’s disease. nature genetics ; : – . https://doi.org/ . /ng. . [ ] bis jc, jian x, kunkle bw, chen y, hamilton-nelson kl, bush ws, et al. whole exome sequencing study identifies novel rare and common alzheimer’s-associated variants involved in immune response and transcriptional regulation. molecular psychiatry : – . https://doi.org/ . /s - - - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . [ ] kuzma a, valladares o, cweibel r, greenfest-allen e, childress dm, malamon j, et al. niagads: the nia genetics of alzheimer’s disease data storage site. alzheimer’s & dementia ; : – . https://doi.org/ . /j.jalz. . . . [ ] buniello a, macarthur jal, cerezo m, harris lw, hayhurst j, malangone c, et al. the nhgri-ebi gwas catalog of published genome-wide association studies, targeted arrays and summary statistics . nucleic acids res ; :d – . https://doi.org/ . /nar/gky . [ ] karczewski kj, francioli lc, tiao g, cummings bb, alföldi j, wang q, et al. variation across , human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. biorxiv : . https://doi.org/ . / . [ ] gamazon er, segrè av, van de bunt m, wen x, xi hs, hormozdiari f, et al. using an atlas of gene regulation across human tissues to inform complex disease- and trait- associated variation. nature genetics ; : – . https://doi.org/ . /s - - - . [ ] sherry st, ward m-h, kholodov m, baker j, phan l, smigielski em, et al. dbsnp: the ncbi database of genetic variation. nucleic acids res ; : – . [ ] butkiewicz m, blue ee, leung yy, jian x, marcora e, renton ae, et al. functional annotation of genomic variants in studies of late-onset alzheimer’s disease. bioinformatics ; : – . https://doi.org/ . /bioinformatics/bty . [ ] mclaren w, gil l, hunt se, riat hs, ritchie grs, thormann a, et al. the ensembl variant effect predictor. genome biol ; . https://doi.org/ . /s - - - . [ ] wheeler nr, benchek p, kunkle bw, hamilton-nelson kl, warfe m, fondran jr, et al. hadoop and pyspark for reproducibility and scalability of genomic sequencing studies. pac symp biocomput ; : – . [ ] auton a, abecasis gr, altshuler dm, durbin rm, abecasis gr, bentley dr, et al. a global reference for human genetic variation. nature ; : – . https://doi.org/ . /nature . [ ] lek m, karczewski kj, minikel ev, samocha ke, banks e, fennell t, et al. analysis of protein-coding genetic variation in , humans. nature ; : – . https://doi.org/ . /nature . [ ] purcell s, neale b, todd-brown k, thomas l, ferreira mar, bender d, et al. plink: a tool set for whole-genome association and population-based linkage analyses. am j hum genet ; : – . https://doi.org/ . / . [ ] pruim rj, welch rp, sanna s, teslovich tm, chines ps, gliedt tp, et al. locuszoom: regional visualization of genome-wide association scan results. bioinformatics ; : – . https://doi.org/ . /bioinformatics/btq . [ ] clark cp, flickinger m, welch r, vandehaar p, taliun d, boehnke m, et al. locuszoom.js: web-based plugin for interactive analysis of genome and phenome wide association studies. presented at the th annual meeting of the american society of human genetics, vancouver: , p. t. [ ] frankish a, diekhans m, ferreira a-m, johnson r, jungreis i, loveland j, et al. gencode reference annotation for the human and mouse genomes. nucleic acids res ; :d – . https://doi.org/ . /nar/gky . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . [ ] braschi b, denny p, gray k, jones t, seal r, tweedie s, et al. genenames.org: the hgnc and vgnc resources in . nucleic acids res ; :d – . https://doi.org/ . /nar/gky . [ ] uniprot: a worldwide hub of protein knowledge. nucleic acids res ; :d – . https://doi.org/ . /nar/gky . [ ] kent wj, sugnet cw, furey ts, roskin km, pringle th, zahler am, et al. the human genome browser at ucsc. genome res ; : – . https://doi.org/ . /gr. . [ ] amberger js, bocchini ca, schiettecatte f, scott af, hamosh a. omim.org: online mendelian inheritance in man (omim®), an online catalog of human genes and genetic disorders. nucleic acids res ; :d - . https://doi.org/ . /nar/gku . [ ] amberger js, bocchini ca, scott af, hamosh a. omim.org: leveraging knowledge across phenotype-gene relationships. nucleic acids res ; :d – . https://doi.org/ . /nar/gky . [ ] the gene ontology resource: years and still going strong. nucleic acids res ; :d – . https://doi.org/ . /nar/gky . [ ] thomas pd, hill dp, mi h, osumi-sutherland d, auken kv, carbon s, et al. gene ontology causal activity modeling (go-cam) moves beyond go annotations to structured descriptions of biological functions and systems. nature genetics ; : – . https://doi.org/ . /s - - - . [ ] klopfenstein dv, zhang l, pedersen bs, ramírez f, warwick vesztrocy a, naldi a, et al. goatools: a python library for gene ontology analyses. scientific reports ; : – . https://doi.org/ . /s - - -z. [ ] kanehisa m, goto s. kegg: kyoto encyclopedia of genes and genomes. nucleic acids res ; : – . https://doi.org/ . /nar/ . . . [ ] jassal b, matthews l, viteri g, gong c, lorente p, fabregat a, et al. the reactome pathway knowledgebase. nucleic acids res ; :d – . https://doi.org/ . /nar/gkz . [ ] layer rm, pedersen bs, disera t, marth gt, gertz j, quinlan ar. giggle: a search engine for large-scale integrated genome analysis. nat methods ; : – . https://doi.org/ . /nmeth. . [ ] kuksa pp, gangadharan p, katanic z, kleidermacher l, amlie-wolf a, lee c-y, et al. filer: large-scale, harmonized functional genomics repository. biorxiv : . . . . https://doi.org/ . / . . . . [ ] encode project consortium. an integrated encyclopedia of dna elements in the human genome. nature ; : – . https://doi.org/ . /nature . [ ] davis ca, hitz bc, sloan ca, chan et, davidson jm, gabdank i, et al. the encyclopedia of dna elements (encode): data portal update. nucleic acids res ; :d – . https://doi.org/ . /nar/gkx . [ ] andersson r, gebhard c, miguel-escalada i, hoof i, bornholdt j, boyd m, et al. an atlas of active enhancers across human cell types and tissues. nature ; : – . https://doi.org/ . /nature . [ ] kundaje a, meuleman w, ernst j, bilenky m, yen a, heravi-moussavi a, et al. integrative analysis of reference human epigenomes. nature ; : – . https://doi.org/ . /nature . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . [ ] fischer s, aurrecoechea c, brunk bp, gao x, harb os, kraemer et, et al. the strategies wdk: a graphical search interface and web development kit for functional genomics databases. database (oxford) ; . https://doi.org/ . /database/bar . [ ] aurrecoechea c, barreto a, basenko ey, brestelli j, brunk bp, cade s, et al. eupathdb: the eukaryotic pathogen genomics database resource. nucleic acids res ; :d – . https://doi.org/ . /nar/gkw . [ ] ison j, kalas m, jonassen i, bolser d, uludag m, mcwilliam h, et al. edam: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. bioinformatics ; : – . https://doi.org/ . /bioinformatics/btt . [ ] robinson jt, thorvaldsdóttir h, turner d, mesirov jp. igv.js: an embeddable javascript implementation of the integrative genomics viewer (igv). biorxiv : . . . . https://doi.org/ . / . . . . [ ] bis jc, jian x, kunkle bw, chen y, hamilton-nelson kl, bush ws, et al. whole exome sequencing study identifies novel rare and common alzheimer’s-associated variants involved in immune response and transcriptional regulation. mol psychiatry . https://doi.org/ . /s - - - . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . gwas summary statistics gus api provides transaction management and ensures data harmonization and referential integrity variant annotations gene annotations filer: functional genomics gus database modular, scalable and big-data optimized for quick look ups and real- time analysis adsp meta-analysis results genomicsdb website scalable restful services and graphical front-end for interactively browsing detailed feature reports and real-time mining of datasets {json} programmatic access for integration with analysis pipelines interactively browse or mine data and annotations using popular web-browsers link back to the niagads repository to learn more about accessions and make formal data- access requests niagads (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . , kb , kb , kb , kb , kb , kb ensembl genes adsp single-variant risk association: european (model ) (bis et al. ) adsp variants (wes) igap: stage (kunkle et al. ) igap apoe-stratified analysis: apoeε non-carriers (jun et al. ) igap apoe-stratified analysis: apoeε carriers (jun et al. ) roadmap enh: nh-a astrocytes > -log p < b ms a e ms a a ms a stx ms a a ms a e ms a ms a ms a ms a ms a zp linc ms a tcn gif a (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a b (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . c variant span containing multiple variants b a (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a b c (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . biapss - bioinformatic analysis of liquid-liquid phase-separating protein sequences dr af t biapss - bioinformatic analysis of liquid-liquid phase-separating protein sequences aleksandra e. badaczewska-dawid ,� and davit a. potoyan , , � department of chemistry, iowa state university, ames ia usa department of biochemistry biophysics and molecular biology, iowa state university, ames ia usa bioninformatics and computational biology program, iowa state university, ames ia usa liquid-liquid phase separation (llps) has recently emerged as a foundational mechanism for order and regulation in bi- ology. however, a quantitative molecular grammar of protein sequences underlying llps remains unclear. the comprehen- sive databases and associated computational infrastructure for biophysical and statistical analysis can enable rapid progress in the field. therefore, we have created a novel open-source web platform named biapss (bioinformatic analysis of liquid- liquid phase-separating protein sequences) which offers the users interactive data analytic tools for facilitating the discov- ery of statistically significant sequence signals for proteins with llps behavior. availability: biapss is freely available on- line at https://biapss.chem.iastate.edu/. website is implemented within the python framework using html, css, and plotly- dash graphing libraries, with all the major browsers supported including the mobile device accessibility. llps | biapss | plotly-dash correspondence: abadacz@iastate.edu, potoyan@iastate.edu introduction in the past few years, llps of biomolecules has become a universal language for interpreting intracellular signaling, compartmentalization, and regulation ( – ). the ability to phase separate appears to be encoded primarily in the protein sequences, frequently containing disordered and low com- plexity domains, which are enriched in charged and multi- valent interaction centers ( – ). nevertheless, the quanti- tative aspects of how amino acids encode and decode the phase separation remain largely unknown ( – ). this is be- cause many different combinations of relevant interactions seem to be contributing to phase separation without any- one being universally necessary ( ). so far, however, with a few exceptions ( – ) mostly case by case studies of different sequences are performed, with the broader context of many findings, including their statistical significance re- maining unknown. to this end, we have developed a web framework biapss: bioinformatic analysis of liquid-liquid phase-separating protein sequences. the objective of bi- apss is to enable a rapid and on-the-fly deep statistical anal- ysis of llps-driver proteins using the pool of sequences with empirically confirmed phase behavior. implementation the back-end processing pipeline of biapss is implemented in a python framework, where in-house developed algorithms parse pre-computed data and perform on-the-fly analysis. the basic front-end user interface of the biapss web plat- form is implemented with html , css, javascript, and bootstrap components which support the responsiveness and mobile-accessibility of the website. specifically, our cross- platform framework is adjusted to be run on multiple operat- ing systems and popular browsers. modern display-layer so- lutions improve user experience by enabling smooth loading of contents, page transitions, and accompanying an in-depth presentation of the results. for instance, we included a light- box slideshow with a brief overview of the features, collapsed menu, and modal images of quick guide within individual applications, side navigation, and more. interactive graph plotting and data visualization accessible through web ap- plications in singleseq and multiseq tabs were developed with the plotly-dash ( ) browser-based graphing libraries for python which create a user-responsive environment and follow remote, customized instructions. thanks to the inter- active interface users can go directly from exploratory analyt- ics to the creation of publication-ready high-quality images. results biapss is designed as a user-friendly web platform that is billing itself as a central resource for systematic and stan- dardized statistical analysis of biophysical characteristics of known llps sequences. the web service provides users with (i) a database of the superset of experimentally evi- denced llps-driver protein sequences, (ii) a repository of pre-computed bioinformatics and statistics data, and (iii) two sets of web applications supporting the interactive analysis and visualization of physicochemical and biomolecular char- acteristics of llps proteins. the initial llps sequence set leverages the data from manually curated primary llps databases, namely phasepro ( ) and llpsdb ( ). given that the number of experimentally confirmed llps driver proteins is constantly growing, the biapss pre-computed repository is updated annually and released to the public, which significantly saves the users time eliminating the need for exhaustive in-house calculations. the apps integrate the badaczewska-dawid et al. | biorχiv | february , | – (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://biapss.chem.iastate.edu/ https://doi.org/ . / . . . dr af t results from our extensive studies, described in more detail elsewhere (). one of the aims of biapss is to get an in- sight into the overall characteristics of the sufficient non- redundant set of llps-driver protein sequences. the com- parison to benchmarks of various protein groups enables sta- tistical inference of specific phase-separating affinities. fur- thermore, the residue-resolution biophysical regularities in- ferred from biapss will help not only to accurately iden- tify regions prone to phase separation but also to design se- quence modifications targeting various biomedical applica- tions. the extended cross-references section is designed as a central navigation hub for researchers for keeping track of the corresponding entries in the primary llps databases along with the other external resources relevant to the phase separation field. since many users usually have specific sin- gle sequences of interest (natural or designed) our future ef- forts will be directed towards the creation of an upload sec- tion for parse user-defined cases and compare them with the benchmark of known llps-driver proteins. the layout and main functionalities of biapss services are summarized in the figure . the general outline of the plat- form is designed to provide clarity and intuitive navigation by avoiding the excess of permanently visible information. due to the multitude of analyses, available to meet the needs of a diverse audience of scientists, the extensive content of biapss has been divided into main tabs. the home tab is a place where the user gets a high-level overview of the features of biapss services. next comes the singleseq tab which is dedicated to the exploration of individual llps se- quence characteristics. besides a case summary and cross- reference section, there are multiple web applications dedi- cated to the in-depth analysis of biomolecular features, such as sequence conservation with multiple sequence alignment (msa) ( ), various sequence-based predictions by the state- of-the-art methods for secondary structure ( – ), solvent accessibility ( – ), structural disorder ( , – ), con- tact maps ( , , ), and uniquely proposed detection of numerous short linear motifs (slims) ( – ) recently high- lighted as key regions for driving the llps ( ). the multi- seq tab provides the user with a set of web applications for a broad array of statistics on a superset of llps sequences. one may there investigate the regularities and trends specific only for disordered regions, such as amino acid (aa) compo- sition, including aa diversity or regions rich in a given aa, general physicochemical patterns of polarity, hydrophobic- ity, the distribution of aromatic or charged residues, includ- ing not only the overall net charge but also charge decora- tion parameters that emerged as a relevant factor for electro- static interactions of intrinsically disordered proteins (idps) ( ), and more. also, a deeper focus on the general fre- quency of particular short linear motifs, including larks ( ), gars ( ), elms ( ), and steric zippers ( ), as well as pioneering identification of specific n-mers, can bring new perspectives in the field. the download tab facilitates access- ing the biapss repository. the available data includes raw predictions pre-calculated using the well-established tools as well as the findings of our deep statistical analysis. for the fig. . the overall layout of biapss web platform (https://biapss.chem.iastate.edu/) for comprehensive sequence-based analysis of llps proteins. the core of the implemented web applications and data repository is contained in the singleseq, multiseq, and download tabs. convenience of users, we have unified and integrated the pre- processed results into a standardized csv format accompa- nied with intuitive descriptors to facilitate reuse and, specif- ically, allows the researcher to implement the pre-computed data directly or carry out further analysis. finally, in the docs tab, the user can follow the detailed data-analytic workflow and learn more about used tools with corresponding refer- ences to the original literature. the documentation also in- cludes an easy-to-use tutorial dedicated to individual web applications, where all of the features are presented graph- ically with detailed descriptions (see also the user’s manual attached in the supplementary information). funding a.e.b-d. acknowledges a generous financial support by roy j. carver charitable trust through iowa state university bio- science innovation postdoctoral fellowship. this work was supported by the national institute of general medical sci- ences of the national institutes of health [r gm to d.a.p.]. the content is solely the responsibility of the au- thors and does not necessarily represent the official views of the national institutes of health. conflict of interest: none declared. author contribution conceptualization, a.e.b-d.; software development, a.e.b- d.; writing an original draft, a.e.b-d. and d.a.p. . clifford p brangwynne, christian r eckmann, david s courson, agata rybarska, carsten hoege, jöbin gharakhani, frank jülicher, and anthony a hyman. germline p granules are liquid droplets that localize by controlled dissolution/condensation. science, ( ): – , june . . clifford p brangwynne, timothy j mitchison, and anthony a hyman. active liquid-like be- havior of nucleoli determines their size and shape in xenopus laevis oocytes. proc. natl. acad. sci. u. s. a., ( ): – , march . . iain a sawyer, jiri bartek, and miroslav dundr. phase separated microenvironments inside the cell nucleus are linked to disease and regulate epigenetic state, transcription and rna processing. semin. cell dev. biol., july . . sudeep banjade, qiong wu, anuradha mittal, william b peeples, rohit v pappu, and michael k rosen. conserved interdomain linker promotes phase separation of the mul- tivalent adaptor protein nck. proc. natl. acad. sci. u. s. a., ( ):e – , november . . sudeep banjade and michael k rosen. phase transitions of multivalent proteins can pro- mote clustering of membrane receptors. elife, , october . . jeong-mo choi, alex s holehouse, and rohit v pappu. physical principles underlying the complex biology of intracellular phase transitions. annu. rev. biophys., january . | biorχiv badaczewska-dawid et al. | (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://biapss.chem.iastate.edu/ https://doi.org/ . / . . . dr af t . jie wang, jeong-mo choi, alex s holehouse, hyun o lee, xiaojie zhang, marcus jahnel, shovamayee maharana, régis lemaitre, andrei pozniakovsky, david drechsel, ina poser, rohit v pappu, simon alberti, and anthony a hyman. a molecular grammar governing the driving forces for phase separation of prion-like rna binding proteins. cell, ( ): – .e , july . . gregory l dignon, robert b best, and jeetain mittal. biomolecular phase separation: from molecular driving forces to macroscopic properties. annu. rev. phys. chem., : – , april . . castrense savojardo, pier luigi martelli, and rita casadio. protein–protein interaction methods and protein phase separation. annu. rev. biomed. data sci., ( ): – , july . . wade borcherds, anne bremer, madeleine b borgia, and tanja mittag. how do intrinsically disordered protein regions encode a driving force for liquid-liquid phase separation? curr. opin. struct. biol., : – , october . . boris y zaslavsky, luisa a ferreira, and vladimir n uversky. driving forces of liquid-liquid phase separation in biological systems. biomolecules, ( ), september . . brian tsang, iva pritišanac, stephen w scherer, alan m moses, and julie d forman-kay. phase separation as a missing mechanism for interpretation of disease mutations. cell, ( ): – , december . . kadi l saar, alexey s morgunov, runzhang qi, william e arter, georg krainer, alpha albert lee, and tuomas knowles. machine learning models for predicting protein condensate formation from sequence determinants and embeddings. october . . kaiqiang you, qi huang, chunyu yu, boyan shen, cristoffer sevilla, minglei shi, henning hermjakob, yang chen, and tingting li. phasepdb: a database of liquid-liquid phase separation related proteins. nucleic acids res., (d ):d –d , january . . bálint mészáros, gábor erdős, beáta szabó, Éva schád, Ágnes tantos, rawan abukhairan, tamás horváth, nikoletta murvai, orsolya p kovács, márton kovács, silvio c e tosatto, péter tompa, zsuzsanna dosztányi, and rita pancsa. phasepro: the database of proteins driving liquid-liquid phase separation. nucleic acids res., (d ):d –d , january . . qian li, xiaojun peng, yuanqing li, wenqin tang, jia’an zhu, jing huang, yifei qi, and zhuqing zhang. llpsdb: a database of proteins undergoing liquid–liquid phase separation in vitro. nucleic acids res., september . . plotly technologies inc. collaborative data science, . . jaina mistry, robert d finn, sean r eddy, alex bateman, and marco punta. challenges in homology search: hmmer and convergent evolution of coiled-coil regions. nucleic acids res., ( ):e , july . . damiano piovesan, ian walsh, giovanni minervini, and silvio c e tosatto. fells: fast estimator of latent local structure. bioinformatics, ( ): – , june . . rhys heffernan, kuldip paliwal, james lyons, jaswinder singh, yuedong yang, and yaoqi zhou. single-sequence-based prediction of protein secondary structures and solvent acces- sibility by deep whole-sequence learning. j. comput. chem., ( ): – , october . . mirko torrisi, manaz kaleel, and gianluca pollastri. porter : fast, state-of-the-art ab initio prediction of protein secondary structure in and classes. october . . zhiyong wang, feng zhao, jian peng, and jinbo xu. protein -class secondary structure prediction using conditional neural fields. proteomics, ( ): – , october . . daniel w a buchan and david t jones. the psipred protein analysis workbench: years on. nucleic acids res., (w ):w –w , july . . jack hanson, kuldip paliwal, thomas litfin, yuedong yang, and yaoqi zhou. improving prediction of protein secondary structure, backbone angles, solvent accessibility and con- tact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. bioinformatics, ( ): – , july . . bin xue, roland l dunbrack, robert w williams, a keith dunker, and vladimir n uversky. pondr-fit: a meta-predictor of intrinsically disordered amino acids. biochim. biophys. acta, ( ): – , april . . kang peng, predrag radivojac, slobodan vucetic, a keith dunker, and zoran obradovic. length-dependent prediction of protein intrinsic disorder. bmc bioinformatics, : , april . . jack hanson, kuldip k paliwal, thomas litfin, and yaoqi zhou. spot-disorder : improved protein intrinsic disorder prediction by ensembled deep learning. genomics proteomics bioinformatics, ( ): – , december . . david t jones and domenico cozzetto. disopred : precise disordered region predictions with annotated protein-binding activity. bioinformatics, ( ): – , march . . yang li, jun hu, chengxin zhang, dong-jun yu, and yang zhang. respre: high-accuracy protein contact prediction by coupling precision matrix with deep residual neural networks. bioinformatics, ( ): – , november . . manjeet kumar, marc gouw, sushama michael, hugo sámano-sánchez, rita pancsa, ju- liana glavina, athina diakogianni, jesús alvarado valverde, dayana bukirova, jelena ča- lyševa, et al. elm—the eukaryotic linear motif resource in . nucleic acids research, (d ):d –d , . . michael p hughes, michael r sawaya, david r boyer, lukasz goldschmidt, jose a ro- driguez, duilio cascio, lisa chong, tamir gonen, and david s eisenberg. atomic struc- tures of low-complexity protein segments reveal kinked β sheets that assemble networks. science, ( ): – , . . p andrew chong, robert m vernon, and julie d forman-kay. rgg/rg motif regions in rna binding and phase separation. journal of molecular biology, ( ): – , . . izzy owen and frank shewmaker. the role of post-translational modifications in the phase transitions of intrinsically disordered proteins. int. j. mol. sci., ( ), november . . roland riek. the three-dimensional structures of amyloids. cold spring harb. perspect. biol., ( ), february . . simon alberti, amy gladfelter, and tanja mittag. considerations and challenges in studying liquid-liquid phase separation and biomolecular condensates. cell, ( ): – , . . greta bianchi, sonia longhi, rita grandori, and stefania brocca. relevance of electrostatic charges in compactness, aggregation, and phase separation of intrinsically disordered pro- teins. international journal of molecular sciences, ( ): , . badaczewska-dawid et al. | biorχiv | (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . learning sparse log-ratios for high-throughput sequencing data learning sparse log-ratios for high-throughput sequencing data elliott gordon-rodriguez thomas p. quinn john p. cunningham abstract the automatic discovery of interpretable features that are associated to an outcome of interest is a central goal of bioinformatics. in the context of high-throughput genetic sequencing data, and compositional data more generally, an important class of features are the log-ratios between sub- sets of the input variables. however, the space of these log-ratios grows combinatorially with the dimension of the input, and as a result, existing learning algorithms do not scale to increasingly common high-dimensional datasets. building on recent literature on continuous relaxations of dis- crete latent variables, we design a novel learning algorithm that identifies sparse log-ratios several orders of magnitude faster than competing meth- ods. as well as dramatically reducing runtime, our method outperforms its competitors in terms of sparsity and predictive accuracy, as measured across a wide range of benchmark datasets. . introduction much recent work has been devoted to designing differen- tiable relaxations of discrete latent variables. these relax- ations can be used to learn class membership (jang et al., ; maddison et al., ; potapczynski et al., ), permutations (linderman et al., ; mena et al., ), subsets (xie & ermon, ; yang et al., ), and rank- ings (cuturi et al., ; blondel et al., ). depend- ing on their use case, existing methods range in complex- ity, from the simple-but-effective straight-through estimator (bengio et al., ), to mathematically intricate schemes based on optimal transport (xie et al., ). however, the driving principle is always the same: to enable efficient gradient-based optimization on an otherwise intractable dis- crete space. the goal of our work is to extend this principle to a novel setting where, to the best of our knowledge, no dif- ferentiable relaxations have yet been proposed. motivated department of statistics, columbia university applied arti- ficial intelligence institute (a i ), deakin university. correspon- dence to: elliott gordon-rodriguez . by domain applications, our objective is to select log-ratios from a set of covariates, a problem that is equivalent to a dis- crete optimization over pairs of disjoint subsets (where the pair represents the numerator and denominator of the ratio, respectively). our novel relaxation will result in dramatic speedups over several recent state-of-the-art learning algo- rithms from the field of bioinformatics, thereby enabling the analysis of much larger datasets than previously possible. log-ratios are an important class of features for analyz- ing high-throughput sequencing (hts) metagenomic data (wooley et al., ; gloor & reid, ; gloor et al., ; quinn et al., ). for example, in microbiome count data, the relative weight between two sub-populations of related microorganisms can serve as a clinically useful biomarker (rahat-rozenbloom et al., ; crovesy et al., ; magne et al., ). more generally, log-ratios are of fundamental importance to the field of compositional data (coda), of which hts data can be seen as a special case (pawlowsky-glahn & egozcue, ; pawlowsky-glahn & buccianti, ). coda can be defined as simplex-valued data, or equivalently, non-negative vectors whose totals are uninformative, i.e., relative data. due to the nature of the recording technique, hts data represents the relative abun- dance of different microbial signatures in a given sample, and therefore is an instance of coda (gloor & reid, ; gloor et al., ; quinn et al., ). indeed, the ap- plication of coda methodology to hts data has become increasingly popular in recent years (fernandes et al., ; ; rivera-pinto et al., ; quinn et al., ; calle, ), with log-ratios serving as the basic building blocks for statistical analysis. but why do log-ratios form the basis of coda methodology? unlike unconstrained real-valued data, the relative nature of hts data and coda results in each covariate becoming neg- atively correlated to all others (increasing one component of a composition implies a relative decrease of the other components). it is well known that, as a result, the usual measures of association and feature attribution are problem- atic when applied to coda (pearson, ; filzmoser et al., ; van den boogaart & tolosana-delgado, ; lovell et al., ). log-ratios account for this idiosyncratic struc- ture by transforming coda onto unconstrained feature space, where the usual tools of statistical learning apply (aitchison, ; pawlowsky-glahn & egozcue, ). the choice learning sparse log-ratios for high-throughput sequencing data of the log-ratio transform offers the necessary property of scale invariance, but in the coda literature it holds primacy for a variety of other technical reasons, including so-called subcompositional coherence (aitchison, ; pawlowsky- glahn & buccianti, ; egozcue & pawlowsky-glahn, ). log-ratios can be taken over pairs of individual covariates (aitchison, ; greenacre, b) or aggrega- tions thereof, typically geometric means (aitchison, ; egozcue et al., ; egozcue & pawlowsky-glahn, ; rivera-pinto et al., ; quinn & erb, ) or summa- tions (greenacre, a; ; quinn & erb, ). the resulting features work well empirically, but also imply a clear interpretation: a log-ratio is a single composite score that expresses the overall quantity of one sub-population as compared with another. when the log-ratios are sparse, meaning they are taken over a small number of covariates, they define biomarkers that are particularly intuitive to un- derstand, a key desiderata for predictive models that are of clinical relevance (goodman & flaxman, ). thus, learning sparse log-ratios is a central problem in coda. this problem is especially challenging in the context of hts data, due to its high dimensionality (ranging from to over , covariates). existing methods rely on stepwise search or evolutionary algorithms (rivera-pinto et al., ; greenacre, b; quinn & erb, ; prifti et al., ), which scale very poorly with the dimension of the input. these algorithms are prohibitively slow for most hts datasets, and thus there is a new demand for sparse and interpretable models that scale to high dimensions (li, ; cammarota et al., ; susin et al., ). this demand motivates the present work, in which we present codacore, a novel learning algorithm for compositional data via continuous relaxations. the key idea behind codacore is to approximate a combinatorial optimization over the set of log-ratios (equivalent to the set of pairs of disjoint subsets of the covariates), by means of a continuous relaxation that can be optimized efficiently using gradient descent. to the best of our knowledge, codacore is the first coda method that scales to high dimensions, and that simultaneously produces sparse, interpretable, and accurate models. the main contributions of our method can be summarized as follows: • computational efficiency. codacore scales linearly with the dimension of the input. it runs several orders of magnitude faster than its competitors. • interpretability. codacore identifies a set of log- ratios that are sparse, biologically meaningful, and ranked in order of importance. our model is highly interpretable, and much sparser, relative to compet- ing methods of similar accuracy and computational our implementation can be downloaded from https://github.com/cunningham-lab/codacore. complexity. • predictive accuracy. codacore achieves better out- of-sample accuracy than existing coda methods, and performs similarly to state-of-the-art black-box classi- fiers (which are neither sparse nor interpretable). • optimization robustness. we leverage the functional form of our continuous relaxation to identify an adap- tive learning rate that enables codacore to converge reliably, requiring no additional hyperparameter tuning when deployed on novel datasets. . background our work focuses on the supervised learning problem with compositional predictors. namely, we are given data {xi,yi}ni= , where xi is compositional (e.g., hts data), and our goal is to learn an association xi → yi. for many mi- crobiome applications, xi represents a vector of frequencies of the different species of bacteria that compose the micro- biome of the ith subject. in other words, xij denotes the abundance of the jth species (of which there are p total) in the ith subject. the response yi is a binary variable indicat- ing whether the ith subject belongs to the case or the control groups (e.g., sick vs. healthy). due to the nature of hts, the input frequencies xij arise from an inexhaustive sampling procedure, so that the totals ∑p j= xij are arbitrary and the components should only be interpreted in relative terms (i.e., as coda) (gloor & reid, ; gloor et al., ; quinn et al., ; calle, ). while we mainly consider applications to microbiome data, our method applies more generally to any high-dimensional coda, including those produced by liquid chromatography mass spectrometry (filzmoser & walczak, ). in order to account for the compositional nature of xi, we seek log-ratio transformed features that can be passed to a regression function downstream. as discussed, these log- ratios will result in interpretable features and scale-invariant models (that are also subcompositionally coherent). the simplest such choice is to take the pairwise log-ratios be- tween input variables, i.e., log(xij+/xij−), where (j +,j−) indexes a pair of covariates (aitchison, ). note that the ratio cancels out any scaling factor applied to xi, preserv- ing only the relative information in the data, while the log transform ensures the output is (unconstrained) real-valued. in order to select a good pair (j+,j−) from the input co- variates, greenacre ( b) proposed a step-wise algorithm for identifying pairwise log-ratios that explain the most vari- ation in a dataset. this algorithm produces a sparse and interpretable set of features, but it is prohibitively slow on high-dimensional datasets, as a result of the step-wise search scaling poorly in the dimension of the input. a heuristic search algorithm that is less accurate but computationally faster has been developed as part of quinn et al. ( ), learning sparse log-ratios for high-throughput sequencing data though its computational cost is still troublesome (as we shall see in section ). . . balances recently, a class of log-ratios known as balances (egozcue & pawlowsky-glahn, ) have become of interest in mi- crobiome applications, due to their interpretability as the relative weight between two sub-populations of bacteria (morton et al., b; quinn & erb, ). balances are defined as the log-ratios between geometric means of two subsets of the covariates: b(xi;j +,j−) = log  (∏j∈j+ xij) p+ ( ∏ j∈j− xij) p−   ( ) = p+ ∑ j∈j+ log xij − p− ∑ j∈j− log xij, where j+ and j− denote a pair of disjoint subsets of the indices { , . . . ,p}, and p+ and p− denote their respective sizes. for example, in microbiome data, j+ and j− are groups of bacteria species that may be related by their envi- ronmental niche (morton et al., ) or genetic similarity (silverman et al., ; washburne et al., ). note that when p+ = p− = (i.e., j+ and j− each contain a single element), b(x;j+,j−) reduces to a pairwise log-ratio. by allowing for the aggregation of more than one covariate in the numerator and denominator of the log-ratio, balances provide a richer set of features that allows for more flexible models than pairwise log-ratios. insofar as the balances are taken over a small number of covariates (i.e., j+ and j− are sparse), they also provide highly interpretable biomarkers. the selbal algorithm (rivera-pinto et al., ) has gained popularity as a method for automatically identifying bal- ances that predict a response variable. however, this algo- rithm is also based on a step-wise search through the combi- natorial space of subset pairs (j+,j−), which scales poorly in the dimension of the input and becomes prohibitively slow for hts data (susin et al., ). . . amalgamations an alternative to balances, known as amalgamations, is defined by aggregating components through summation: a(xi;j +,j−) = log (∑ j∈j+ xij∑ j∈j− xij ) , ( ) where again j+ and j− denote disjoint subsets of the input components. amalgamations have the advantage of reduc- note that the original definition of balances includes a “nor- malization” constant, which we omit for clarity. this constant is in fact unnecessary, as it will get absorbed into a regression coefficient downstream. ing the dimensionality of the data through an operation, the sum, that some authors argue is more interpretable than a geometric mean (greenacre, a; greenacre et al., ). on the other hand, amalgamations can be less effective than balances for identifying components that are statistically important, but small in magnitude, e.g., rare bacteria species (since small terms will have less impact on a summation than on a product). recently, greenacre ( ) has advocated for the use of expert-driven amalgamations, using domain knowledge to construct the relevant features. on the other hand, quinn & erb ( ) proposed amalgam, an evolutionary algorithm to automatically identify amalgamated log-ratios (eq. ) that are predictive of a response variable. however, this algorithm does not scale to high-dimensional data (albeit, comparing favorably to selbal), nor does it produce sparse models (hindering interpretability of the results). . . other related work coda methodology has also recently attracted interest from the machine learning community (tolosana-delgado et al., ; quinn et al., ; gordon-rodriguez et al., a;b; templ, ). relevant to us is deepcoda (quinn et al., ), which combines self-explaining neural networks with log-ratio transformed features. in particular, deepcoda learns a set of log-contrasts, in which the numerator and denominator are defined as unequally weighted geometric averages of components. as a result of this weighting, deep- coda loses much of the interpretability and intuitive appeal of balances (or amalgamations), which is exacerbated by its lack of sparsity (in spite of regularization). moreover, like most deep architectures, deepcoda is sensitive to ini- tialization and optimization hyperparameters (which limits its ease of use) and is susceptible to overfitting (which can further compromise interpretability of the model). the special case of a linear log-contrast model has been referred to as coda-lasso, and was separately proposed by lu et al. ( ). while coda-lasso scales better than selbal, it has been found to perform worse in terms of predictive accuracy (susin et al., ). more importantly, coda- lasso is still prohibitively slow on the high-dimensional hts data that we wish to consider. last, we highlight another common set of features that are also a special case of log- contrasts: centered-log-ratios, where individual covariates are divided by the geometric mean of all input variables (aitchison, ). models using these features, such as susin et al. ( ), can be accurate and computationally efficient, however they are inherently not sparse and are difficult to interpret scientifically (greenacre, a). learning sparse log-ratios for high-throughput sequencing data table . qualitative comparison of the methods discussed, ordered from most sparse (top) to least (bottom). codacore is the only learning algorithm that performs on all of our criteria. see table for a corresponding quantitative comparison. scalability interpretability sparsity accuracy codacore (ours) + + + + pairwise log-ratios (greenacre, b) − + + − selbal (rivera-pinto et al., ) − + + · lasso + · + − coda-lasso (lu et al., ) − · · · amalgam (quinn & erb, ) − + − · deepcoda (quinn et al., ) · · − · clr-lasso (susin et al., ) + − − + black-box (random forest, xgboost) + − − + . methods we now present codacore, a novel learning algorithm for hts data, and more generally, high-dimensional coda. unlike existing methods, codacore is simultaneously scal- able, interpretable, sparse, and accurate. we compare the relative merits of codacore and its competitors in table . . . continuous relaxation in its basic formulation, codacore learns a regression function of the form: f(x) = α + β ·b(x;j+,j−), ( ) where b denotes a balance (eq. ), and α and β are scalar parameters. for clarity, we will restrict our exposition to this formulation, but note that our algorithm can be applied equally to learn amalgamations instead of balances (see section . ), as well as generalizing straightforwardly to nonlinear functions (provided they are suitably parameter- ized and differentiable). let l(y,f) denote the cross-entropy loss, with f ∈ r given in logit space. the goal of codacore is to find the balance that is maximally associated of the response. mathematically, this can be written as an empirical risk minimization: min (j+,j−,α,β) ∑ i l ( yi,α + β ·b(xi;j+,j−) ) . ( ) this objective involves a discrete optimization over pairs (j+,j−) of disjoint subsets, a combinatorially hard prob- lem. the key insight of codacore is to approximate this combinatorial optimization with a continuous relaxation that can be trained efficiently by gradient descent. our relaxation is parameterized by an unconstrained vec- tor of “assignment weights”, w ∈ rp, with one scalar parameter per input dimension (e.g., one weight per bacte- ria species). the weights are mapped to a vector of “soft assignments” via: w̃ = · sigmoid(w)− = + exp(−w) − , ( ) where the sigmoid is applied component-wise. eq. maps onto the interval (− , ), which can be understood straight- forwardly as a relaxation of the set {− , , }, denoting membership to j−, j+, or neither, respectively. let us write w̃+ = relu(w̃) and w̃− = relu(−w̃) for the pos- itive and negative parts of w̃, respectively. we approximate balances (eq. ) with the following relaxation: b̃(xi; w) = ∑ j w̃ + j log xij∑ j w̃ + j − ∑ j w̃ − j log xij∑ j w̃ − j ( ) = w̃+ · log xi ‖w̃+‖ − w̃− · log xi ‖w̃−‖ . ( ) in other words, we approximate geometric averages over subsets of the inputs, by weighted geometric averages over all components (compare equations and ). crucially, this relaxation is differentiable in w, allowing us to construct a surrogate objective function that can be optimized jointly in (w,α,β) by gradient descent: min (w,α,β) ∑ i l ( yi,α + β · b̃(xi; w) ) . ( ) we defer the details of our implementation of gradient de- scent to the supplement (section a), but we highlight two observations. first, the computational cost of the gradient of eq. is linear in the dimension of w. as a result, our algorithm scales linearly with the dimension of the input, and is fast to fit on large datasets (see section . ). second, knowledge of the functional form of our relaxation (eq. ) can be exploited in order to select the learning rate adap- tively (i.e., without tuning), resulting in robust convergence across all real and simulated datasets that we considered. . . discretization while a set of features in the form of eq. may perform accurate classification, a weighted geometric average over learning sparse log-ratios for high-throughput sequencing data all covariates is much harder for a biologist to interpret (and less intuitively appealing) than a bona fide balance over a small number of covariates. on these grounds, codacore implements a “discretization” procedure that exploits the in- formation learned by the soft assignment vector w̃, in order to efficiently identify a pair of sparse subsets (ĵ+, ĵ−). the most straightforward way to convert the (soft) assign- ment w̃ into a (hard) pair of subsets is by fixing a threshold t ∈ ( , ): j̃+ = {j : w̃j > t}, ( ) j̃− = {j : w̃j < −t}. ( ) note that given a trained w̃ and a fixed threshold t, we can evaluate the quality of the corresponding balance b(x; j̃+, j̃−) (resp. amalgamation) by optimizing eq. over (α,β) alone, i.e., fitting a linear model. computation- ally, fitting a linear model is much faster than optimizing eq. , and can be done repeatedly for a range of values of t with little overhead. in codacore, we combine this strat- egy with cross-validation in order to select the threshold, t̂, that optimizes predictive performance (see section a of the supplement for full detail). finally, the trained regression function is: f̂(x) = α̂ + β̂ ·b(x; ĵ+, ĵ−), ( ) where (ĵ+, ĵ−) are the subsets corresponding to the opti- mal threshold t̂, and (α̂, β̂) are the coefficients obtained by regressing yi against b(xi; ĵ+, ĵ−) on the entire training set. . . regularization note from equations and that larger values of t result in fewer covariates assigned to the balance b(x; j̃+, j̃−), i.e., a sparser model. thus, codacore can be regularized simply by making t̂ larger. similarly to lasso regression, our implementation of codacore uses the -standard-error rule: namely, to pick the sparsest model (i.e., the highest t) with mean cross-validated score within standard error of the optimum (friedman et al., ). trivially, this rule can be generalized to a λ-standard-error rule, where λ becomes a regularization hyperparameter that can be tuned by the practitioner if so desired (with lower values trading off some sparsity in exchange for predictive accuracy). for consis- tency, we restrict our experiments to λ = , however our results can be improved further by tuning λ on each dataset. in practice, we recommend choosing a lower value (e.g., λ = ) when the emphasis is on predictive accuracy rather than interpretability or sparsity, though our benchmarks still show competitive performance with the choice of λ = . algorithm codacore inputs: training data: (xi,yi)ni= . initialize ĝ(x) = . repeat initialize a new relaxation (w,α,β). train (w,α,β) by gradient descent. use cross-validation to find the optimal threshold, t̂. retrain (α,β) using (ĵ+, ĵ−). update ensemble ĝ(x) ← ĝ(x) + f̂(x). until ĵ+ = ∅ or ĵ− = ∅. return ĝ(x). . . codacore algorithm the computational efficiency of our continuous relaxation allows us to train multiple regressors of the form of eq. within a single model. in the full codacore algorithm, we ensemble multiple such regressors in a stage-wise additive fashion, where each successive balance is fitted on the resid- ual from the current model. thus, codacore identifies a sequence of balances, in decreasing order of importance, each of which is sparse and interpretable. training termi- nates when an additional relaxation (eq. ) cannot improve the cross-validation score relative to the existing ensemble (equivalently, when we obtain t̂ = ). typically, only a small number of balances is required to capture the signal in the data, and as a result codacore produces very sparse models overall, further enhancing interpretability. our pro- cedure is summarized in algorithm . . . amalgamations codacore can be used to learn amalgamations (eq. ) much in the same way as for balances (the choice of which to use depending on the goals of the biologist). in this case, our relaxation is defined as: ã(xi; w) = log (∑ j w̃ + j xij∑ j w̃ − j xij ) ( ) = log ( w̃+ ·xi w̃− ·xi ) , ( ) i.e., we approximate summations over subsets of the in- puts, with weighted summations over all components (com- pare eq. and eq. ). the rest of the argument follows verbatim, replacing b(·) with a(·) and b̃(·) with ã(·) in equations , , , and . . . extensions our model allows for a number of extensions: • unsupervised learning. by means of a suitable unsu- pervised loss function, codacore can be extended to unlabelled datasets, {xi}ni= , as a method for identi- learning sparse log-ratios for high-throughput sequencing data fying log-ratios that provide a useful low-dimensional representation. such a method would automatically provide a scalable alternative to several existing dimen- sionality reduction techniques for coda (pawlowsky- glahn et al., ; mert et al., ; martı́n-fernández et al., ; greenacre, b; martino et al., ). • incorporating confounders. in addition to (xi,yi)ni= , in some applications the effect of additional (non- compositional) predictors, zi, is also of interest. in this case, the effect of zi can be “partialled out” a pri- ori by first regressing yi on zi alone, and using this regression as the initialization of the codacore en- semble. alternatively, zi can also be modeled jointly in equations and (e.g., by adding a linear term γ · zi) (forslund et al., ; noguera-julian et al., ; rivera-pinto et al., ). • nonlinear regression functions. our method extends naturally to nonlinear regression functions of the form f(x) = hθ(b(x;j +,j−)), where hθ is a parameter- ized differentiable family. these functions include neural networks, which have recently become of in- terest in microbiome research (morton et al., a; quinn et al., ). • applications to non-compositional data. aggregations of parts can be useful outside the realm of coda; for example, an amalgamation applied to a categorical variable with many levels represents a grouping of the categories (bondell & reich, ; gertheiss & tutz, ; tutz & gertheiss, ). . experiments we evaluate codacore on a collection of benchmark datasets including datasets from the microbiome learn- ing repo (vangay et al., ), and microbiome, metabo- lite, and microrna datasets curated by quinn & erb ( ). these data vary in dimension from to , covariates (see section b of the supplement for a full description). for each dataset, we fit codacore on random / train/test splits, sampled with stratification by case-control (he & ma, ). we compare against: • interpretable models (sections . and . ): pairwise log-ratios (greenacre, b) , selbal (rivera-pinto et al., ), and amalgam (quinn & erb, ). we also consider lasso logistic regression (with regular- ization parameter chosen by cross-validation with the -standard-error rule). • other coda models (section . ): coda-lasso (lu et al., ), deepcoda (quinn et al., ), and susin et al. ( ). note that these methods learn (weighted) geometric averages over a large number of implemented using a heuristic search for improved computa- tional efficiency (quinn et al., ). runtime (s) − − ac cu ra cy g ai n ov er b as el in e (% ) codacore (ours) selbal pairwise log-ratios coda-lasso amalgam inputs , inputs figure . classification accuracy (over baseline) against runtime. each point represents one of datasets, with size proportional to the input dimension. note the x-axis is drawn on the log-scale. co- dacore (with balances) is the only method that scales effectively to our larger datasets, while consistently achieving high predictive accuracy. moreover, its performance is broadly consistent across smaller and larger datasets. input variables, which are evidently not as straightfor- ward to interpret as simple balances or amalgamations. • black box classifiers: random forest and xgboost, where we tune the model complexity parameters by cross-validation (subsample size and early stopping, respectively). . . results we evaluate the quality of our models across the following criteria: computational efficiency (as measured by runtime), sparsity (as measured by the percentage of inpute variables that are active in the model), and predictive accuracy (as measured by out-of-sample accuracy and roc auc). table provides an aggregated summary of the results; coda- core (with balances) is performant on all metrics. indeed, our method provides the only interpretable model that is simultaneously scalable, sparse, and accurate. detailed per- formance metrics on each of the datasets are provided in section c of the supplement. figure shows the average runtime of our classifiers on each dataset, with larger points denoting larger datasets. co- dacore trains orders of magnitude faster and scales better than existing interpretable coda methods. on our larger datasets ( , inputs), selbal runs in ∼ hours, pairwise log-ratios and amalgam both run in ∼ hours, and coda- core runs in under seconds (full runtimes are provided in table in the supplement). all runs, including those in- volving gradient descent, were performed on identical cpu learning sparse log-ratios for high-throughput sequencing data table . evaluation metrics shown for each method, averaged over datasets × random train/test splits. standard errors are computed independently on each dataset, and then averaged over the datasets. the models are ordered by sparsity, i.e., percentage of active input variables. codacore (with balances) is the only learning algorithm that is simultaneously fast, sparse, and accurate. runtime (s) active vars (%) accuracy (%) auc (%) majority class . ± . . ± . . ± . . ± . codacore - balances (ours) . ± . . ± . . ± . . ± . codacore - amalgamations (ours) . ± . . ± . . ± . . ± . selbal (rivera-pinto et al., ) , . ± , . . ± . . ± . . ± . pairwise log-ratios (greenacre, b) , . ± . . ± . . ± . . ± . lasso . ± . . ± . . ± . . ± . coda-lasso (lu et al., ) , . ± . . ± . . ± . . ± . amalgam (quinn & erb, ) , . ± . . ± . . ± . . ± . deepcoda (quinn et al., ) . ± . . ± . . ± . . ± . clr-lasso (susin et al., ) . ± . . ± . . ± . . ± . random forest . ± . · . ± . . ± . xgboost . ± . · . ± . . ± . cores; codacore can be accelerated further using gpus, but we did not find it necessary to do so. it is also worth noting that the outperformance of codacore is not merely as a result of the other methods failing on high-dimensional datasets. the consistent performance of codacore across smaller and larger datasets is demonstrated in supplemen- tary tables , , and , which show a full breakdown of results across each dataset. not only is codacore sparser and more accurate than other interpretable models, it also performs on par with state-of- the-art black-box classifiers. by simply reducing the regular- ization parameter, from λ = to λ = , codacore (with balances) achieved an average . % out-of-sample accu- racy of and . % auc, on par with random forest and xgboost (bottom rows of table ), while only using . % of the input variables, on average. this result indicates, first, that codacore provides a highly effective algorithm for variable selection in high-dimensional hts data. second, the fact that codacore achieves similar predictive accu- racy as best-in-class black-box classifiers, suggests that our model may have captured a near-complete representation of the signal in the data. at any rate, we take this as evidence that log-ratio transformed features are indeed of biological importance in the context of hts data, corroborating previ- ous microbiome research (rahat-rozenbloom et al., ; crovesy et al., ; magne et al., ). . . interpretability the codacore algorithm offers two kinds of interpretabil- ity. first, it provides the analyst with sets of covariates whose aggregated ratio predicts the outcome of interest. these sets are easy to understand because they are discrete, with each component making an equivalent (unweighted) contribution. they are also sparse, usually containing fewer than features per ratio, and can be made sparser by adjust- ing the regularization parameter λ. such ratios have a prece- dent in microbiome research, for example the firmicutes- to-bacteroidetes ratio is used as a biomarker of gut health (crovesy et al., ; magne et al., ). second, co- dacore ranks predictive ratios hierarchically. due to the ensembling procedure, the first ratio learned is the most predictive, the second ratio predicts the residual from the first, and so forth. like principal components, the balances (or amalgamations) learned by codacore are naturally or- dered in terms of their explanatory power. this ordering aids interpretability by decomposing a multivariable model into comprehensible “chunks” of information. notably, we find a high degree of stability in the log-ratios selected by the model. we repeated codacore on inde- pendent training set splits of the crohn disease data provided by rivera-pinto et al. ( ), and found consensus among the learned models. figure shows which bacteria were included for each split, in both versions of codacore (bal- ances and amalgamations). importantly, most of the bacteria that were selected consistently by codacore – notably di- alister, roseburia and clostridiales – were also identified by rivera-pinto et al. ( ). differences between the sets selected by codacore with balances vs. codacore with amalgamations can be explained by differences in how the geometric mean vs. summation operations impact the log- ratio. the geometric mean, being more sensitive to small numbers, is more affected by the presence of rarer bacte- ria species like dialister and roseburia (as compared with the more common bacteria species like haemophilus and faecalibacterium). . . scaling to liquid biopsy data hts data generated from from clinical blood samples can be described as a “liquid biopsy” that can be used for cancer di- agnosis and surveillance (best et al., ; alix-panabières learning sparse log-ratios for high-throughput sequencing data dialister aggregatibacter lactobacillales streptococcus parabacteroides peptostreptococcaceae faecalibacterium lachnospira clostridiales roseburia codacore - balances independent % training set splits haemophilus enterobacteriaceae fusobacterium blautia streptococcus dialister lachnospiraceae roseburia prevotella clostridiales parabacteroides bacteroides faecalibacterium codacore - amalgamations figure . codacore variable selection for the first (most explana- tory) log-ratio on the crohn disease data (rivera-pinto et al., ). for each of independent training set splits ( % of the data), we show which variables are selected in the numerator (blue) and de- nominator (orange) of the log-ratio. both versions of codacore, with balances (top) or amalgamations (bottom), learn remarkably consistent log-ratios across independent training sets. & pantel, ). these data can be very high-dimensional, especially when they include all gene transcripts as covari- ates. in a clinical context, the use of log-ratio predictors is an attractive option because they automatically correct for inter-sample sequencing biases that might otherwise limit the generalizability of the models (dillies et al., ). unfortunately, existing log-ratio methods like selbal and amalgam simply cannot scale to liquid biopsy data sets that contain as many as , or more input variables. the large dimensionality of such data has restricted its anal- ysis to overly simplistic linear models, black-box models that are scalable but not interpretable, or suboptimal hybrid approaches where covariates must be pre-selected based on univariate measures (best et al., ; zhang et al., ; sheng et al., ). owing to its linear scaling, codacore table . evaluation metrics for the liquid biopsy data (best et al., ), averaged over independent / train/test splits. co- dacore (with balances) achieves equal predictive accuracy as competing methods, but with much sparser solutions. note that sparsity is expressed as an (integer) number of active variables in the model (not as a percentage of the total, as was done in table ). running time is shown in seconds (standard errors were small and are omitted for brevity). time # vars acc. (%) auc (%) baseline ± . ± . . ± . codacore ± . ± . . ± . lasso ± . ± . . ± . rf · . ± . . ± . xgboost · . ± . . ± . can be fitted to these data at a similar computational cost to a single lasso regression, i.e., under a minute on a single cpu core. thus, codacore can be used to discover interpretable and predictive log-ratios that are suitable for liquid biopsy cancer diagnostics, among other similar applications. we showcase the capabilities of codacore in this high- dimensional setting, by applying our algorithm to the liquid biopsy data of (best et al., ). these contain p = , genes sequenced in n = human subjects, of whom were healthy controls, the others having been previously diagnosed with cancer. averaging over random / train/test splits of this dataset, we found that codacore achieved the same predictive accuracy as competing meth- ods (within error), but obtained a much sparser model. re- markably, codacore identified log-ratios involving just genes, that were equally predictive to both black-box classi- fiers and linear models with over covariates. this case study again illustrates the potential of codacore to derive novel biological insights, and also to develop learning al- gorithms for cancer diagnosis, a domain in which model interpretability – including sparsity – is of paramount im- portance (wan et al., ). . conclusion our results corroborate the summary in table : codacore is the first sparse and interpretable coda model that can scale to high-dimensional hts data. it does so convinc- ingly, with linear scaling that results in runtimes similar to linear models. our method is also competitive in terms of predictive accuracy, performing comparably to powerful black-box classifiers, but with interpretability. our findings suggest that codacore could play a significant role in the future analysis of high-throughput sequencing data, with broad implications in microbiology, statistical genetics, and more generally, in the field of coda. learning sparse log-ratios for high-throughput sequencing data references aitchison, j. the statistical analysis of compositional data. journal of the royal statistical society: series b (method- ological), ( ): – , . alix-panabières, c. and pantel, k. clinical applications of circulating tumor cells and circulating tumor dna as liquid biopsy. cancer discovery, ( ): – , . bengio, y., léonard, n., and courville, a. estimating or propagating gradients through stochastic neurons for con- ditional computation. arxiv preprint arxiv: . , . best, m. g., sol, n., kooi, i., tannous, j., westerman, b. a., rustenburg, f., schellen, p., verschueren, h., post, e., koster, j., et al. rna-seq of tumor-educated platelets en- ables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics. cancer cell, ( ): – , . blondel, m., teboul, o., berthet, q., and djolonga, j. fast differentiable sorting and ranking. in international con- ference on machine learning, pp. – . pmlr, . bondell, h. d. and reich, b. j. simultaneous factor selec- tion and collapsing levels in anova. biometrics, ( ): – , . calle, m. l. statistical analysis of metagenomics data. genomics & informatics, ( ), . cammarota, g., ianiro, g., ahern, a., carbone, c., temko, a., claesson, m. j., gasbarrini, a., and tortora, g. gut microbiome, big data and machine learning to promote precision medicine for cancer. nature reviews gastroen- terology & hepatology, ( ): – , . crovesy, l., masterson, d., and rosado, e. l. profile of the gut microbiota of adults with obesity: a systematic review. european journal of clinical nutrition, ( ): – , . cuturi, m., teboul, o., and vert, j.-p. differentiable rank- ing and sorting using optimal transport. in advances in neural information processing systems, pp. – , . dillies, m.-a., rau, a., aubert, j., hennequet-antier, c., jeanmougin, m., servant, n., keime, c., marot, g., cas- tel, d., estelle, j., et al. a comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. briefings in bioinformatics, ( ): – , . egozcue, j. j. and pawlowsky-glahn, v. groups of parts and their balances in compositional data analysis. mathe- matical geology, ( ): – , . egozcue, j. j. and pawlowsky-glahn, v. compositional data: the sample space and its structure. test, ( ): – , . egozcue, j. j., pawlowsky-glahn, v., mateu-figueras, g., and barcelo-vidal, c. isometric logratio transformations for compositional data analysis. mathematical geology, ( ): – , . fernandes, a. d., macklaim, j. m., linn, t. g., reid, g., and gloor, g. b. anova-like differential expression (aldex) analysis for mixed population rna-seq. plos one, ( ):e , . fernandes, a. d., reid, j. n., macklaim, j. m., mcmur- rough, t. a., edgell, d. r., and gloor, g. b. unifying the analysis of high-throughput sequencing datasets: charac- terizing rna-seq, s rrna gene sequencing and selective growth experiments by compositional data analysis. mi- crobiome, ( ): , . filzmoser, p. and walczak, b. what can go wrong at the data normalization step for identification of biomarkers? journal of chromatography a, : – , . filzmoser, p., hron, k., and reimann, c. univariate sta- tistical analysis of environmental (compositional) data: problems and possibilities. science of the total environ- ment, ( ): – , . forslund, k., hildebrand, f., nielsen, t., falony, g., le chatelier, e., sunagawa, s., prifti, e., vieira-silva, s., gudmundsdottir, v., pedersen, h. k., et al. disentangling type diabetes and metformin treatment signatures in the human gut microbiota. nature, ( ): – , . friedman, j., hastie, t., tibshirani, r., et al. the elements of statistical learning, volume . springer series in statistics new york, . gertheiss, j. and tutz, g. sparse modeling of categorial explanatory variables. the annals of applied statistics, pp. – , . gloor, g. b. and reid, g. compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. canadian journal of microbiology, ( ): – , . gloor, g. b., macklaim, j. m., pawlowsky-glahn, v., and egozcue, j. j. microbiome datasets are compositional: and this is not optional. frontiers in microbiology, : , . goodman, b. and flaxman, s. european union regulations on algorithmic decision-making and a “right to explana- tion”. ai magazine, ( ): – , . learning sparse log-ratios for high-throughput sequencing data gordon-rodriguez, e., loaiza-ganem, g., and cunning- ham, j. the continuous categorical: a novel simplex- valued exponential family. in international conference on machine learning, pp. – . pmlr, a. gordon-rodriguez, e., loaiza-ganem, g., pleiss, g., and cunningham, j. p. uses and abuses of the cross-entropy loss: case studies in modern deep learning. arxiv preprint arxiv: . , b. greenacre, m. comments on: compositional data: the sample space and its structure. test, ( ): – , a. greenacre, m. variable selection in compositional data anal- ysis using pairwise logratios. mathematical geosciences, ( ): – , b. greenacre, m. amalgamations are valid in compositional data analysis, can be used in agglomerative clustering, and their logratios have an inverse transformation. ap- plied computing and geosciences, : , . greenacre, m., grunsky, e., and bacon-shone, j. a compar- ison of isometric and amalgamation logratio balances in compositional data analysis. computers & geosciences, pp. , . he, h. and ma, y. imbalanced learning: foundations, algo- rithms, and applications. . jang, e., gu, s., and poole, b. categorical repa- rameterization with gumbel-softmax. arxiv preprint arxiv: . , . li, h. microbiome, metagenomics, and high-dimensional compositional data analysis. annual review of statistics and its application, : – , . linderman, s., mena, g., cooper, h., paninski, l., and cunningham, j. reparameterizing the birkhoff polytope for variational permutation inference. in international conference on artificial intelligence and statistics, pp. – . pmlr, . lovell, d., pawlowsky-glahn, v., egozcue, j. j., marguerat, s., and bähler, j. proportionality: a valid alternative to correlation for relative data. plos comput biol, ( ): e , . lu, j., shi, p., and li, h. generalized linear models with lin- ear constraints for microbiome compositional data. bio- metrics, ( ): – , . maddison, c. j., mnih, a., and teh, y. w. the concrete distribution: a continuous relaxation of discrete random variables. in international conference on learning rep- resentations, . magne, f., gotteland, m., gauthier, l., zazueta, a., pe- soa, s., navarrete, p., and balamurugan, r. the firmi- cutes/bacteroidetes ratio: a relevant marker of gut dysbio- sis in obese patients? nutrients, ( ): , . martı́n-fernández, j., pawlowsky-glahn, v., egozcue, j., and tolosona-delgado, r. advances in principal balances for compositional data. mathematical geosciences, ( ): – , . martino, c., morton, j. t., marotz, c. a., thompson, l. r., tripathi, a., knight, r., and zengler, k. a novel sparse compositional technique reveals microbial perturbations. msystems, ( ), . mena, g., snoek, j., linderman, s., and belanger, d. learn- ing latent permutations with gumbel-sinkhorn networks. in international conference on learning representations, . mert, m. c., filzmoser, p., and hron, k. sparse principal balances. statistical modelling, ( ): – , . morton, j. t., sanders, j., quinn, r. a., mcdonald, d., gonzalez, a., vázquez-baeza, y., navas-molina, j. a., song, s. j., metcalf, j. l., hyde, e. r., et al. balance trees reveal microbial niche differentiation. msystems, ( ), . morton, j. t., aksenov, a. a., nothias, l. f., foulds, j. r., quinn, r. a., badri, m. h., swenson, t. l., van goethem, m. w., northen, t. r., vazquez-baeza, y., et al. learn- ing representations of microbe–metabolite interactions. nature methods, ( ): – , a. morton, j. t., marotz, c., washburne, a., silverman, j., zaramela, l. s., edlund, a., zengler, k., and knight, r. establishing microbial composition measurement stan- dards with reference frames. nature communications, ( ): – , b. noguera-julian, m., rocafort, m., guillén, y., rivera, j., casadellà, m., nowak, p., hildebrand, f., zeller, g., par- era, m., bellido, r., et al. gut microbiota linked to sexual preference and hiv infection. ebiomedicine, : – , . pawlowsky-glahn, v. and buccianti, a. compositional data analysis: theory and applications. john wiley & sons, . pawlowsky-glahn, v. and egozcue, j. j. compositional data and their analysis: an introduction. geological society, london, special publications, ( ): – , . pawlowsky-glahn, v., egozcue, j. j., tolosana delgado, r., et al. principal balances. proceedings of codawork, pp. – , . learning sparse log-ratios for high-throughput sequencing data pearson, k. vii. mathematical contributions to the theory of evolution.—iii. regression, heredity, and panmixia. philo- sophical transactions of the royal society of london. series a, containing papers of a mathematical or physi- cal character, ( ): – , . potapczynski, a., loaiza-ganem, g., and cunningham, j. p. invertible gaussian reparameterization: revisiting the gumbel-softmax. advances in neural information processing systems, , . prifti, e., chevaleyre, y., hanczar, b., belda, e., danchin, a., clément, k., and zucker, j.-d. interpretable and accurate prediction models for metagenomics data. giga- science, ( ):giaa , . quinn, t., nguyen, d., rana, s., gupta, s., and venkatesh, s. deepcoda: personalized interpretability for composi- tional health data. in international conference on ma- chine learning, pp. – . pmlr, . quinn, t. p. and erb, i. using balances to engineer fea- tures for the classification of health biomarkers: a new approach to balance selection. biorxiv, pp. , . quinn, t. p. and erb, i. amalgams: data-driven amalga- mation for the dimensionality reduction of compositional data. nar genomics and bioinformatics, ( ):lqaa , . quinn, t. p., richardson, m. f., lovell, d., and crowley, t. m. propr: an r-package for identifying proportion- ally abundant features using compositional data analysis. scientific reports, ( ): – , . quinn, t. p., erb, i., richardson, m. f., and crowley, t. m. understanding sequencing data as compositions: an out- look and review. bioinformatics, ( ): – , . quinn, t. p., erb, i., gloor, g., notredame, c., richardson, m. f., and crowley, t. m. a field guide for the compo- sitional analysis of any-omics data. gigascience, ( ): giz , . rahat-rozenbloom, s., fernandes, j., gloor, g. b., and wolever, t. m. evidence for greater production of colonic short-chain fatty acids in overweight than lean humans. international journal of obesity, ( ): – , . rivera-pinto, j., egozcue, j. j., pawlowsky-glahn, v., pare- des, r., noguera-julian, m., and calle, m. l. balances: a new perspective for microbiome analysis. msystems, ( ), . sheng, m., dong, z., and xie, y. identification of tumor- educated platelet biomarkers of non-small-cell lung can- cer. oncotargets and therapy, : , . silverman, j. d., washburne, a. d., mukherjee, s., and david, l. a. a phylogenetic transform enhances analysis of compositional microbiota data. elife, :e , . susin, a., wang, y., lê cao, k.-a., and calle, m. l. vari- able selection in microbiome compositional data analysis. nar genomics and bioinformatics, ( ):lqaa , . templ, m. artificial neural networks to impute rounded zeros in compositional data. arxiv preprint arxiv: . , . tolosana-delgado, r., talebi, h., khodadadzadeh, m., and van den boogaart, k. on machine learning algorithms and compositional data. in proceedings of the th in- ternational workshop on compositional data analysis, terrassa, spain, pp. – , . tutz, g. and gertheiss, j. regularized regression for cate- gorical data. statistical modelling, ( ): – , . van den boogaart, k. g. and tolosana-delgado, r. ana- lyzing compositional data with r, volume . springer, . vangay, p., hillmann, b. m., and knights, d. microbiome learning repo (ml repo): a public repository of micro- biome regression and classification tasks. gigascience, ( ), . wan, j. c., massie, c., garcia-corbacho, j., mouliere, f., brenton, j. d., caldas, c., pacey, s., baird, r., and rosen- feld, n. liquid biopsies come of age: towards implemen- tation of circulating tumour dna. nature reviews cancer, ( ): , . washburne, a. d., silverman, j. d., leff, j. w., bennett, d. j., darcy, j. l., mukherjee, s., fierer, n., and david, l. a. phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets. peerj, :e , . wooley, j. c., godzik, a., and friedberg, i. a primer on metagenomics. plos comput biol, ( ):e , . xie, s. m. and ermon, s. reparameterizable subset sam- pling via continuous relaxations. in international joint conferences on artificial intelligence, . xie, y., dai, h., chen, m., dai, b., zhao, t., zha, h., wei, w., and pfister, t. differentiable top-k operator with optimal transport. advances in neural information processing systems, , . yang, j., zhang, q., ni, b., li, l., liu, j., zhou, m., and tian, q. modeling point clouds with self-attention and gumbel subset sampling. in proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. – , . learning sparse log-ratios for high-throughput sequencing data zhang, y.-h., huang, t., chen, l., xu, y., hu, y., hu, l.-d., cai, y., and kong, x. identifying and analyzing different cancer subtypes using rna-seq data of blood platelets. oncotarget, ( ): , . strainflair: strain-level profiling of metagenomic samples using variation graphs strainflair: strain-level profiling of metagenomic samples using variation graphs kévin da silva , ,*, nicolas pons , magali berland , florian plaza oñate , mathieu almeida , and pierre peterlongo univ rennes, inria, cnrs, irisa - umr , f- rennes, france université paris-saclay, inrae, mgp, jouy-en-josas, france corresponding author: ∗kévin da silva kevin.da-silva@inria.fr email address: abstract current studies are shifting from the use of single linear references to representation of multiple genomes organised in pangenome graphs or variation graphs. meanwhile, in metagenomic samples, resolving strain-level abundances is a major step in microbiome studies, as associations between strain variants and phenotype are of great interest for diagnostic and therapeutic purposes. we developed strainflair with the aim of showing the feasibility of using variation graphs for indexing highly similar genomic sequences up to the strain level, and for characterizing a set of unknown sequenced genomes by querying this graph. on simulated data composed of mixtures of strains from the same bacterial species escherichia coli, results show that strainflair was able to distinguish and estimate the abundances of close strains, as well as to highlight the presence of a new strain close to a referenced one and to estimate its abundance. on a real dataset composed of a mix of several bacterial species and several strains for the same species, results show that in a more complex configuration strainflair correctly estimates the abundance of each strain. hence, results demonstrated how graph representation of multiple close genomes can be used as a reference to characterize a sample at the strain level. availability: http://github.com/kevsilva/strainflair introduction the use of reference genomes has shaped the way genomics studies are currently conducted. reference genomes are particularly useful for reference guided genomic assembly, variant calling or mapping sequencing reads. for the later, they provide a unique coordinate system to locate variants, allowing to work on the same reference and easily share information. however, the usage of reference genomes represented as flat sequences reaches some limits (ballouz et al., ). close reference genomes or genomes of strains from the same species show a high sequence similarity. mapping sequencing reads on similar reference genomes results in mis-mapped reads or ambiguous alignments generating noise in the downstream analysis, that has yet to be clarified (na et al., ). this has led recent methods to provide a representation of multiple genomes as genome graphs, also called variation graphs, in which each path is a different known variation. such graph representations are well defined, and tools to build and manipulate graphs are under active development (garrison et al., ; kim et al., ; rakocevic et al., ; li et al., ). this graph structure provides obvious advantages such as the reduction of the data redundancy, while highlighting variations (garrison et al., ). however, it also introduces novel difficulties. updating a graph with novel sequences, adapting existing efficient algorithms for read mapping, and, mainly, developing new ways to analyse sequence-to-graph mapping results for downstream analyses are among those new challenges. the work presented here primarily focuses on this latest point and proposes to show the feasibility of using a variation graph for identifying and estimating abundances, at the strain .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / level, from an unknown metagenomic read set. in the context of metagenomics, representing genomes in graphs is of particular interest for indexing microorganism genomes. microorganisms are predominant in almost every ecosystems from ocean water (sunagawa et al., ) to human body (clemente et al., ), and play major functioning roles in them (new and brito, ). while studies in microbial ecology are facing a bottleneck due to the difficulty of isolating and cultivating most of those microbes in laboratory, preventing the analysis of the complex structure and dynamics of the microbial communities (stewart, ), high-throughput sequencing in metagenomics offers the opportunity to study a whole ecosystem. in particular, shotgun sequencing allows a resolution up to the species level (jovel et al., ), and enable samples analysis in terms of population stratification, microbial diversity or bio-markers identification (quince et al., ). understanding of microbial communities structure and dynamics is usually revealed by resolving the species present in samples and their relative abundances, which can then be associated with phenotypes, notably in the field of human health (ehrlich, ; vieira-silva et al., ; solé et al., ). now, characterizing samples at the strain level has a growing interest, as it may highlight new associations with phenotypes, and a better understanding of the functional impact of strains in host-microbe interactions is crucial to new therapeutic strategies and personalized medicine. escherichia coli, which has a highly variable genome, is a well-known example since some strains are harmless commensals in the human gut microbiota while others are harmful pathogens (rasko et al., ; loman et al., ). current approaches to handle multiple similar genomes as with strains use gene clustering and then select the representative sequence of each cluster, getting rid of the redundancy but also the variations, yet crucial to distinguish the strains of a species (qin et al., ). hence, indexation of a set of known strains is a good framework for testing the ability of a variation graph to capture the diversity while offering a way to correctly assign sequenced data to the strains they belong to. in this work, we present strainflair, a novel method and its implementation that uses variation graph representation of gene sequences for strain identification and quantification. we proposed novel algorithmic and statistical solutions for managing ambiguous alignments and computing an adequate abundance metric at the graph node level. results have shown that we could correctly identify and quantify strains present in a sample. notably, we could also identify close strains not present in the reference. strainflair is available at http://github.com/kevsilva/strainflair. methods we propose here a description of our tool strainflair (strain-level profiling using variation graph). this method exploits various state-of-the-art tools and proposes novel algorithmic solutions for indexing bacterial genomes at the strain-level. it also permits to query metagenomes for assessing and quantifying their content, in regards to the indexed genomes. an overview of the index and query pipelines are presented on fig. . rational for the choice of third-party tools and their detailed usages are given in supplementary materials, section s . . indexing strains gene prediction as non-coding dna represents % in average of bacterial genomes and is not well characterized in terms of structure, strainflair focuses on protein-coding genes in order to characterize strains by their gene content and nucleotidic variations of them. moreover, non-coding dna regions can be highly variable (thorpe et al., ) and taking into account complete genomes would then lead to highly complex graphs, and combinatorial explosions when mapping reads. additionally, complete genomes are not always available. focusing on the genes allows to use also drafts and metagenome-assembled genomes or a pre-existing set of known genes (qin et al., ; li et al., ). hence, strainflair indexes genes instead of complete genomes in graphs. genes are predicted using prodigal, a tool for prokaryotic protein-coding genes prediction (hyatt et al., ). knowing that some reads map at the junction between the gene and intergenic regions, by conserving only gene sequences, mapping results are biased towards deletions and drastically lower the mapping score. in order to alleviate this situation, we extend the predicted gene sequences at both ends. hence, / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure . strainflair overview. a. indexation. input is a set of known reference genomes of various bacterial species and strains. strainflair uses a graph for indexing genes of those reference genomes. b. read mapping on the previously mentioned graph. c. mapped reads analysis. strainflair assigns and estimates species and strain abundances of a bacterial metagenomic sample represented as short reads. strainflair conserves predicted genes plus their surrounding sequences. by default, and if the sequence is long enough, we conserve bp on the left and on the right of each gene. gene clustering genes are clustered into gene families using cd-hit (li and godzik, ). for the clustering step, the genes without extensions are used in order to strictly cluster according to the exact gene sequences and no parts of intergenic regions. cd-hit-est is used to realize the clustering with an identity threshold of . and a coverage of . on the shorter sequence. the local sequence identity is calculated as the number of identical bases in alignment divided by the length of the alignment. sequences are assigned to the best fitting cluster verifying these requirements. graph construction each gene family is represented as a variation graph (fig. ). variation graphs are bidirected dna sequence graphs that represents multiple sequences, including their genetic variation. each node of the graph contains sub-sequences of the input sequences, and successive nodes draw paths on the graph. paths corresponding to reference sequences are specifically called “colored paths”. each colored path corresponds to the original sequences of a gene in the cluster. figure . illustration of a variation graph structure and colored paths. each node of the graph contains a sub-sequence of the input sequences and is integer-indexed. a path corresponding to an input sequence is called a colored path, and is encoded by its succession of node ids, e.g. , , , for the colored path in this example. / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / in the case of a cluster composed of only one sequence, vg toolkit (garrison et al., ) is used to convert the sequence into a flat graph. alternatively, when a cluster is composed of two sequences or more, minimap (li, ) is used to generate a multiple sequence alignment. then seqwish (garrison, ) is used to convert this multiple sequence alignment into a variation graph. all the so-computed graphs (one per input cluster) are then concatenated to produce a single variation graph where each cluster of genes is a connected component. the index is created once for a set of reference genomes. afterward, any set of sequenced reads can to be profiled at the strain-level based on this index. querying variation graphs mapping reads for mapping reads on the previously described reference graph, we use the sequence-to-graph mapper vg mpmap from vg toolkit. it produces a so-called “multipath alignments”. a multipath alignment is a graph of partial alignments and can be seen as a sub-graph (a subset of edges and vertices) of the whole variation graph (see fig. for an example). the mapping result describes, for each read, the nodes of the variation graph traversed by the alignment and the potential mismatches or indels between the read and the sequence of each traversed node. reads attribution when mapping a read on a graph with colored path, two key issues arise, as illustrated fig. . as mapping generates a sub-graph per mapped read, the most probable mapped path(s) has / have to be defined. in the meanwhile, the most probable mapped path(s) corresponding to a colored path also have to be defined. hence we developed an algorithm to analyse and convert, when possible, a mapping result into one or several continuous path(s) (successive nodes joined by only one edge) per mapped read. in addition we propose an algorithm to attribute such path to most probable colored path(s). path attribution a breadth first search on the multipath alignment is proposed. it starts at each node of the alignment with a user-defined threshold on the mapping score. a single path alignment with a mapping score below this threshold is ignored, and the single path alignment with the best mapping score is retained. additionally, for each alignment, nodes are associated with a so-called “horizontal coverage” value. the horizontal coverage of a node by a read corresponds to the proportion of bases of the node covered by the read. hence, a node has an horizontal coverage of if all its nucleotides are covered by the read with or without mismatches or indels. because of possible ties in mapping score, the search can result in multiple single path alignments, as illustrated fig. (a). this situation corresponds to a read which sequence is found in several different genes or to a read mapping onto the similar region of different versions of a gene. to take into account ambiguous mapping affectations, as shown below, the parsing of the mapping output is decomposed into two steps. the first step processes the reads that mapped only a unique colored path (called “unique mapped reads” here), corresponding to a single gene. the second step processes the reads with multiple alignments (called “multiple mapped reads” here). colored path attribution once a read is assigned to one or several path alignment(s), it still has to be attributed, if possible, to a colored path. the following process attributes each mapped read to a colored path and various metrics for downstream analyses are computed. in particular, an absolute abundance for each node of the variation graph, called the “node abundance”, is computed, first focusing on unique mapped reads (first step). for a given alignment, the successive nodes composing the path are compared to the existing colored paths of the variation graph. if the alignment matches part of a colored path, the number of mapped reads on this path is incremented by one (i.e. reads raw count). the node abundance for each node of the alignment is incremented with its horizontal node coverage defined by this alignment. alignments with no matching colored paths are skipped. then, we focus on multiple mapped reads (second step), as illustrated fig. (b). during this step, the alignment matches multiple colored paths. hence, the abundance is distributed to each matching colored path relatively to the ratio between them. this ratio is determined from the reads raw count of each path from the first step. for example, if unique mapped reads were found for path and for path during the first step, a read matching ambiguously both path and path during the second step counts as . for / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure . illustration of the multipath alignment concept and the read attribution process. (a) path attribution. the region of the read in blue aligns un-ambiguously to a node of the graph while the dark and light red parts can either align to the top or the bottom nodes of their respective mapping localization (due to mismatches that can align on both nodes for example), drawing an alignment as a sub-graph of the reference variation graph, and thus opening the possibility of four single path alignments. (b) colored path attribution. first, from the multipath alignment (all four read sub-paths), the breadth search finds the possible corresponding single path alignments while respecting the mapping score threshold imposed by the user. here, for the example, all four possible paths are considered valid. second, each single path is compared to the colored paths from the reference variation graph. two single path alignments matched the colored paths ( - - and - - ). as it mapped equally more than one colored path, this read falls in the multiple mapped reads case and is processed during the second step of the algorithm. path and . for path . this ratio is applied to increment both the raw count of reads and the coverage of the nodes. gene-level and strain-level abundances strainflair output is decomposed into an intermediate result describing the queried sample and gene-level abundances, and the final result describing the strain-level abundances. gene-level after parsing the mapping result, the first output provides information for each colored path, i.e. each version of a gene. thereby, this first result proposes gene-level information including abundances. exhaustive description of these intermediate results is provided in section s . in supplementary materials. we describe here three major metrics outputted by strainflair: the mean abundance of the nodes composing the path. instead of solely counting reads, we make full use of the graph structure and we propose abundances computation for each node as previously explained, and as already done for haplotype resolution (baaijens et al., ). hence, for each colored path, the gene abundance is estimated by the mean of the nodes abundance. in order to not underestimate the abundance in case of a lack of sequencing depth (which could result in certain nodes not to be traversed by sequencing reads), the mean abundance without the nodes of / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / the path never covered by a read is also outputted. the mean abundance with and without these non-covered nodes are computed using unique mapped reads only or all mapped reads. the ratio of covered nodes, defined as the proportion of nodes from the path which abundance is strictly greater than zero. strain-level strain-level abundances are then obtained by exploiting the specific genes of each reference genome from these intermediate results. first, for each genome, the proportion of detected genes is computed, as the proportion of specific genes on which at least one read maps. then, the global abundance of the genome is computed as the mean or median of all its specific gene abundances. however, if the proportion of detected genes is less than a user-defined threshold, the genome is considered absent and hence its abundance is set to zero. strainflair final output is a table where each line corresponds to one of the reference genomes, containing in columns the proportion of detected specific genes, and our proposed metrics to estimate their abundances (using mean or median, with or without never covered nodes as described for the gene-level result). results presented section s . in supplementary materials validate and motivate the proposed abundance metric by comparing it to the expected abundances and other estimations using linear models. results we validated our method on both a simulated and a real dataset. all computations were performed using strainflair, version . . , with default parameters. the relative abundances estimation was based on the mean of the specific gene abundances, computed by taking into account all the nodes (including non-covered nodes), and using a threshold on the proportion of detected specific genes of %. results were compared to kraken (wood et al., ) considered as one of the state-of-the-art tool dedicated to the characterization of read set content, and based on flat sequences as references. read counts given by kraken were normalized by the genome length and converted into relative abundances. computing setup and performances are indicated in supplementary materials, section s . . validation on a simulated dataset we first validated our method on simulated data, focusing on a single species with multiple strains. our aim was to validate the strainflair ability to identify and quantify strains given sequencing data from a mixture of several strains of uneven abundances, and with one of them absent from the index. reference variation graph we selected complete genomes of escherichia coli, a predominant aerobic bacterium in the gut micro- biota (tenaillon et al., ), and a species known for its phenotypic diversity (pathogenicity, antibiotics resistance) mostly resulting from its high genomic variability (dobrindt, ). eight strains of e. coli were selected for this experiment from the ncbi . seven were used to construct a variation graph (e. coli iai , o :h str. c- , str. k- substr. mg , se , o :h str. santai, o :h str. sakai, o str. rm ), and one was used as an unknown strain in a strains mixture (e. coli bl -de ). mixtures and sequencing simulations our aim was to simulate the co-presence of several e. coli strains. two simulations with sequencing errors were conducted in order to highlight the detection and quantification of strains in a mixture. for each one, we tested our approach with various read coverage, as described below. we simulated the sequencing of three strains to mimic complex single species composition in metagenomic samples. one of the strain was in equal abundance of one of the two others, potentially making it more difficult to distinguish, or in lower abundance, potentially making it more difficult to detect at all. the first simulation was a mixture composed of three strains contributing in the reference graph: e. coli o :h c- , iai , and k- mg . the second simulation was a mixture composed of three strains: e. coli o :h c- , iai , and bl -de . the later being absent from the reference variation graph thus simulating a new strain to be identified and quantified. https://www.ncbi.nlm.nih.gov/genome/?term=txid [orgn] / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / for both simulations, short sequencing reads of bp were simulated using vg sim from vg toolkit with a probability of errors set to . % : , reads for e. coli o :h c- (representing ≈ . x), , reads for e. coli iai (representing ≈ . x). for both simulations, various quantities of reads were generated for k- mg or bl -de : , , , , , , , , , , , or , reads, representing approximately . x, x, . x, . x, . x, . x, and . x respectively for these two strains. strain-level abundances as explained in methods, we computed the strain-level abundances using the specific gene-level abundance table obtained by mapping the simulated reads onto the variation graph. we compared our results to the expected simulated relative abundances. #reads k- method o :h iai k- sakai se santai rm expected . . . , strainflair . . kraken . . . . . . . expected . . . , strainflair . . . kraken . . . . . . . expected . . . , strainflair . . . kraken . . . . . . . table . reference strains relative abundances expected and computed by strainflair or kraken for each simulated experiment with variable coverage of the k- mg strain. best results are shown in bold. complete results are presented section s . in supplementary materials. simulation : mixtures with k- mg , present in the reference graph strainflair successfully estimated the relative abundances of the three strains present in the mixture (table ), the sum of squared errors between the estimation given by our tool and the expected relative abundance was between and for all the experiments. however, it did not detect the very low abundant strain in the case of the mixture with , simulated reads for k- mg (coverage of ≈ . x). with our methodology, the threshold on the proportion of detected genes (see methods) lead to set relative abundance to zero of likely absent strains. this reduces both the underestimation of the relative abundances of the present strains and the overestimation of the absent strains. in comparison, kraken did not provide this resolution. applied to our simulated mixtures, while kraken was slightly better for k- mg abundance estimation, it overestimated iai relative abundance and underestimated o ’s one, leading to an overall higher sum of squared errors (between and ) compared to the expected abundances. moreover, it set relative abundances to all the seven reference strains whereas four of them were absent from the mixture. this was expected as some reads (from intergenic regions for example) can randomly be similar to regions of genes from absent strains. simulation : mixtures with bl -de , absent from the reference graph here, bl -de was considered an unknown strain, not contributing to the variation graph. the closest strain of bl -de in the graph, according to fastani (jain et al., ), was k- mg ( . % of identity, see supplementary materials, section s . ). thus we expected to find signal of bl -de through the results on k- mg . as with the k- mg mixtures, strainflair successfully estimated the relative abundances of the two known strains present in the mixture (table ), the sum of squared errors between the estimation given by our tool and the expected relative abundance was between and for all the experiments. labelled as k- , it also gave close estimations for bl -de . again, it did not detect the very low abundant strain in the case of the mixture with , , , , and , simulated reads for bl -de . also similarly to the k- mg mixtures experiments, kraken overestimated iai relative abundance and underestimated o ’s one (sum of squared errors between and ), even less / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / #reads bl -de method o :h iai k- sakai se santai rm expected . . ( . ) , strainflair . . kraken . . . . . . . expected . . ( . ) , strainflair . . . kraken . . . . . . . expected . . ( . ) , strainflair . . . kraken . . . . . . . table . reference strain relative abundances expected and computed by strainflair or kraken for each simulated experiment with variable coverage of the bl -de strain, absent from the reference variation graph. bl -de strain expected abundances are given in parentheses in the k- column. best results are shown in bold. complete results are presented section s . in supplementary materials. precisely than in the previous experiment. with sufficient coverage (here from the . x for bl -de ), strainflair was closer to the expected values for all the reference strains than kraken . interestingly, the proportion of detected specific genes for each strain (fig. ) seems to highlight a pattern allowing to distinguish present strains, absent strains and likely new strains close to the reference in the graph. according to the experiments with enough coverage (from , simulated reads for bl -de ), three groups of proportions could be observed: proportion of almost % (o :h and iai : strains present in the mixtures and in the reference graph), proportion under - % (sakai, se , santai, and rm : strains absent from the mixtures), and an in-between proportion around - % for k- mg (closest strain to bl -de ). it was expected that an absent strain would have specific genes detected as strainflair detects a gene once only one read mappped on it. however, all absent strains had a proportion at around % except k- mg which proportion was twice higher. conjointly with the non-null abundance estimated for the reference k- mg , this suggests the presence of a new strain whose genome is highly similar to k- mg . validation on a real dataset we used a mock dataset available on ebi-ena repository under accession number prjeb , in order to validate our method on real sequencing data from samples composed of various species and strains. the mock dataset is composed of strains of bacterial species for which complete genomes or sets of contigs are available, including plasmids. among the species, two of them contained each two different strains. three mixes had been generated from the mock, and we used the “mix a” in the following results. even though out of strains were absents in this mix, we indexed the full set of genomes. this was done in order to mimic a classical strainflair use case where the queried data is mainly unknown, and the reference graph contains species or strains not existing in these queried data. the metagenomic sample was sequenced using illumina hiseq technology and resulted in , , short paired-end reads. we compared our results to the expected abundances of each strain in the sample defined as the theoretical experimental dna concentration proportion. as such, it has to be noted that potential contamination and/or experimental bias could have occurred and affected the expected abundances. strain detection among the strains used in the reference variation graph, strainflair detected strains. all of these strains were indeed sequenced in mix a. hence, strainflair produced no false positive. from the strains considered absent by strainflair, were not present in the sample (true negatives) and should have been detected (false negatives). however, the term false negative has to be soften as the ground truth remains uncertain. among those undetected strains, all of them had theoretical abundance below . %. / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure . proportion of detected specific genes for each simulated experiment with variable coverage of the bl -de strain, absent from the reference graph. more precisely, among the strains undetected by strainflair, had some detected genes, but below the % threshold. in this case, by default, strainflair discards these strains. finally, only one of the undetected strains (desulfovibrio desulfuricans nd ) should have been theoretically detected (even if its expected coverage was below . %), but no specific gene was identified. considering that strainflair uses a permissive definition of detected gene (at least one read maps on the gene), having strictly no specific genes detected for desulfovibrio desulfuricans nd suggests that this strain might in fact be absent from mix a. this is also supported by the result from kraken which estimated a relative abundance of ≈ e− , almost times lower than the theoretical result. as in the simulated dataset validation, kraken affected non-null abundances to all the references and thus could not be used to definitely conclude on presence/absence of strains in the sample. strain relative abundances for the estimated relative abundances, strainflair gave more similar results compared to the state-of-the-art tool kraken than the experimental values (fig. ). the sum of squared error between strainflair and kraken was around . strainflair and kraken gave similar results compared to the experimental values, with sum of squared errors of around and respectively. interestingly, thermotoga petrophila rku- is the only case where results from strainflair and kraken differs greatly, with, in addition, the theoretical abundance being in-between. moreover, thermotoga sp. rq is the strain expected to be absent that kraken estimates with the highest relative abundance among the other expected absent strains, and the only one exceeding the relative abundances of two present strains. considering the previous results on the simulated mixtures and that thermotoga / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure . experimental relative abundance compared to relative abundance as computed by strainflair and kraken . a selection of relevant results is shown here, see supplementary materials (section s . ) for the complete results. (a) represents a case where strainflair and kraken give similar results to the experimental value ( cases over ). (b) represents a case where strainflair and kraken give similar results, but lower than the experimental value ( cases over ). (c) represents a case where strainflair and kraken give similar results, but greater than the experimental value ( cases over ). (d, e, f, g) represent the two species represented by two strains each. (h, i) represent two atypical cases. petrophila rku- and thermotoga sp. rq are close species (fastani around . %) it could be an additional indicator of how tools like kraken can be mislead by too close species or strains. in the sample, the species methanococcus maripaludis was represented by two strains (s and c ) and the species shewanella baltica likewise (os and os ). strainflair successfully distinguished and estimated the relative abundances of each strain of these two genomes. in this very situation and contrary to results on e. coli strains, kraken was also able to correctly estimate the abundances. discussion recent advances in sequencing technologies have provided large reference genome resources. represen- tation and integration of those multiple genomes, often highly similar, are under active development and led to genome graphs based tools. integrating multiple genomes from the same species is particularly interesting as it provides new opportunities to characterize strains, a key resolution, for instance opening the field of precision medicine (albanese and donati, ; marchesi et al., ). in this context, we developed strainflair, a new computational approach for strain level profiling of metagenomic samples, using variation graphs for representing all reference genomes. our intention was / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / in the one hand to test whether or not indexing highly similar genomes in a graph enables to characterize queried samples at the strain level, and, in the other hand, to provide a end-user tool able to perform the indexation of genomes and the query of reads including the analyses of mapping results. the method exploits state-of-the art-tools additionally to novel algorithmic and statistical solutions. by indexing microbial species and/or strains in a graph, it enables the identification and quantification of strains from a sequenced sample, mapped onto this graph. we have demonstrated on simulated and on real datasets the ability of our method to identify and cor- rectly estimate the abundance of microbial strains in metagenomic samples. in addition, strainflair was able to highlight the presence and also to estimate a relative abundance for a strain similar to existing references, but absent from these references. we also showed that strainflair tended to set to zero the predicted abundance of low abundant strains, while a tool like kraken was able detect them. as a result, it seemed that strainflair looses the ability to detect very low abundant strains. however, in our simulations, this situation corresponded to coverages of . x or less, hence simulating a strain for which not all genomic content was present. eventually, it might be more relevant to define this strain as absent. overall, there is a need to distinguish between low abundant strains, insufficient sequencing depth, and reads from intergenic regions or other genes randomly matching genes. in this regard, strainflair integrated a threshold on the proportion of specific genes detected that can be further explored to refine which strain abundances are set to zero. importantly, results also showed that our graph-based tool had no false positive call, contrary to general purpose tool kraken that detected % of strains that were indexed but absent from queried reads. from the validation on real datasets, we showed that strainflair was still able to correctly estimate the relative abundances in a more complex context mixing both different species and different strains, without being biased by references absent in the sample. our methodology taking into account all mapped reads and imposing a threshold that sets some strains abundances to zero seems more adequate and closer to what is expected in reality. moreover, being able to detect some queried strains as absent is particularly interesting in the metagenomics context. unlike mock datasets that are of controlled and known compositions, no prior knowledge is available for real metagenomic samples. they require the most exhaustive references - including unnecessary genomes - hence strains absent from the sample. strainflair is a new step towards the objective to take into account those unnecessary genomes without biasing the downstream analysis. measured computation time performances show that strainflair enables to analyse million reads in a few hours. even if this opens the doors to routine analyses of small read sets, new development efforts will have to be made for reducing computation time in order to scale-up to very large datasets. while strainflair focuses on profiling metagenomic samples at the strain level based on genes, it opens the way to pangenomic studies. genome graphs are used to capture all the information on variation or similarity of sequences, which is particularly adapted to represent the gene repertoire diversity and the set of nucleotidic variations found between the different genomes of a species. this work highlights the importance to keep up working on pangenome graph representation. the presence of queried unknown strain(s) is revealed both by reads mapping non-colored paths and by the amount of nucleotidic variations (indels and substitutions). the natural continuation will be related to the dynamical update of the graph when novel strains are detected in this way. this dynamicity will also be particularly useful considering the future flow of new sequenced metagenomes and the development of clinical metagenomics that will help to quickly and efficiently characterize in silico emerging strains of human health interest. acknowledgments this work used the genouest bioinformatics core facility (https://www.genouest.org). we acknowledge mircea podar for the providing of the mock dataset in premium access. finally, we thank mahendra mariadassou, rayan chikhi, olivier jaillon and david vallenet for all their advice along this work. / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / references albanese, d. and donati, c. ( ). strain profiling and epidemiology of bacterial species from metage- nomic sequencing. nature communications, ( ): – . baaijens, j. a., der roest, b. v., köster, j., stougie, l., and schönhuth, a. ( ). full-length de novo viral quasispecies assembly through variation graph construction. biorxiv, page . ballouz, s., dobin, a., and gillis, j. ( ). is it time to change the reference genome? biorxiv, page . clemente, j. c., ursell, l. k., parfrey, l. w., and knight, r. ( ). the impact of the gut microbiota on human health: an integrative view. dobrindt, u. ( ). (patho-)genomics of escherichia coli. ehrlich, s. d. ( ). metahit: the european union project on metagenomics of the human intestinal tract. in metagenomics of the human body, pages – . springer new york. garrison, e. ( ). ekg/seqwish: alignment to variation graph inducer. https://github.com/ ekg/seqwish. garrison, e., novak, a., hickey, g., eizenga, j., dawson, e., jones, w., buske, o., and lin, m. ( ). sequence variation aware references and read mapping with vg : the variation graph toolkit. biorxiv. garrison, e., sirén, j., novak, a. m., hickey, g., eizenga, j. m., dawson, e. t., jones, w., garg, s., markello, c., lin, m. f., paten, b., and durbin, r. ( ). variation graph toolkit improves read mapping by representing genetic variation in the reference. hyatt, d., chen, g. l., locascio, p. f., land, m. l., larimer, f. w., and hauser, l. j. ( ). prodigal: prokaryotic gene recognition and translation initiation site identification. bmc bioinformatics, : . jain, c., rodriguez-r, l. m., phillippy, a. m., konstantinidis, k. t., and aluru, s. ( ). high throughput ani analysis of k prokaryotic genomes reveals clear species boundaries. nature communications, ( ): – . jovel, j., patterson, j., wang, w., hotte, n., o’keefe, s., mitchel, t., perry, t., kao, d., mason, a. l., madsen, k. l., and wong, g. k. ( ). characterization of the gut microbiome using s or shotgun metagenomics. frontiers in microbiology, (apr): . kim, d., paggi, j. m., park, c., bennett, c., and salzberg, s. l. ( ). graph-based genome alignment and genotyping with hisat and hisat-genotype. nature biotechnology, ( ): – . li, h. ( ). minimap : pairwise alignment for nucleotide sequences. bioinformatics, ( ): – . li, h., feng, x., and chu, c. ( ). the design and construction of reference pangenome graphs with minigraph. genome biology, ( ): . li, j., wang, j., jia, h., cai, x., zhong, h., feng, q., sunagawa, s., arumugam, m., kultima, j. r., prifti, e., nielsen, t., juncker, a. s., manichanh, c., chen, b., zhang, w., levenez, f., wang, j., xu, x., xiao, l., liang, s., zhang, d., zhang, z., chen, w., zhao, h., al-aama, j. y., edris, s., yang, h., wang, j., hansen, t., nielsen, h. b., brunak, s., kristiansen, k., guarner, f., pedersen, o., doré, j., ehrlich, s. d., and bork, p. ( ). an integrated catalog of reference genes in the human gut microbiome. nature biotechnology, ( ): – . li, w. and godzik, a. ( ). cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. bioinformatics, ( ): – . loman, n. j., constantinidou, c., christner, m., rohde, h., chan, j. z.-m., quick, j., weir, j. c., quince, c., smith, g. p., betley, j. r., aepfelbacher, m., and pallen, m. j. ( ). a culture-independent sequence-based metagenomics approach to the investigation of an outbreak of shiga-toxigenic escherichia coli o :h . jama, ( ): . marchesi, j. r., adams, d. h., fava, f., hermes, g. d., hirschfield, g. m., hold, g., quraishi, m. n., kinross, j., smidt, h., tuohy, k. m., thomas, l. v., zoetendal, e. g., and hart, a. ( ). the gut microbiota and host health: a new clinical frontier. gut, ( ): – . na, j. c., kim, h., park, h., lecroq, t., léonard, m., mouchard, l., and park, k. ( ). fm-index of alignment: a compressed index for similar strings. theoretical computer science, : – . new, f. n. and brito, i. l. ( ). what is metagenomics teaching us, and what is missed? paten, b., eizenga, j. m., rosen, y. m., novak, a. m., garrison, e., and hickey, g. ( ). superbubbles, ultrabubbles, and cacti. in journal of computational biology, volume , pages – . mary ann liebert inc. qin, j., li, r., raes, j., arumugam, m., burgdorf, k. s., manichanh, c., nielsen, t., pons, n., levenez, / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / f., yamada, t., mende, d. r., li, j., xu, j., li, s., li, d., cao, j., wang, b., liang, h., zheng, h., xie, y., tap, j., lepage, p., bertalan, m., batto, j.-m., hansen, t., le paslier, d., linneberg, a., nielsen, h. b., pelletier, e., renault, p., sicheritz-ponten, t., turner, k., zhu, h., yu, c., li, s., jian, m., zhou, y., li, y., zhang, x., li, s., qin, n., yang, h., wang, j., brunak, s., doré, j., guarner, f., kristiansen, k., pedersen, o., parkhill, j., weissenbach, j., metahit consortium, m., bork, p., ehrlich, s. d., and wang, j. ( ). a human gut microbial gene catalogue established by metagenomic sequencing. nature, ( ): – . quince, c., walker, a. w., simpson, j. t., loman, n. j., and segata, n. ( ). shotgun metagenomics, from sampling to analysis. rakocevic, g., semenyuk, v., lee, w. p., spencer, j., browning, j., johnson, i. j., arsenijevic, v., nadj, j., ghose, k., suciu, m. c., ji, s. g., demir, g., li, l., toptaş, b., dolgoborodov, a., pollex, b., spulber, i., glotova, i., kómár, p., stachyra, a. l., li, y., popovic, m., källberg, m., jain, a., and kural, d. ( ). fast and accurate genomic analyses using genome graphs. nature genetics, ( ): – . rasko, d. a., rosovitz, m. j., myers, g. s., mongodin, e. f., fricke, w. f., gajer, p., crabtree, j., sebaihia, m., thomson, n. r., chaudhuri, r., henderson, i. r., sperandio, v., and ravel, j. ( ). the pangenome structure of escherichia coli: comparative genomic analysis of e. coli commensal and pathogenic isolates. journal of bacteriology, ( ): – . solé, c., guilly, s., da silva, k., llopis, m., le-chatelier, e., huelin, p., carol, m., moreira, r., fabrellas, n., de prada, g., napoleone, l., graupera, i., pose, e., juanola, a., borruel, n., berland, m., toapanta, d., casellas, f., guarner, f., doré, j., solà, e., ehrlich, s. d., and ginès, p. ( ). alterations in gut microbiome in cirrhosis as assessed by quantitative metagenomics: relationship with acute-on-chronic liver failure and prognosis. gastroenterology, ( ): – .e . stewart, e. j. ( ). growing unculturable bacteria. sunagawa, s., coelho, l. p., chaffron, s., kultima, j. r., labadie, k., salazar, g., djahanschiri, b., zeller, g., mende, d. r., alberti, a., cornejo-castillo, f. m., costea, p. i., cruaud, c., d’ovidio, f., engelen, s., ferrera, i., gasol, j. m., guidi, l., hildebrand, f., kokoszka, f., lepoivre, c., lima-mendez, g., poulain, j., poulos, b. t., royo-llonch, m., sarmento, h., vieira-silva, s., dimier, c., picheral, m., searson, s., kandels-lewis, s., boss, e., follows, m., karp-boss, l., krzic, u., reynaud, e. g., sardet, c., sieracki, m., velayoudon, d., bowler, c., de vargas, c., gorsky, g., grimsley, n., hingamp, p., iudicone, d., jaillon, o., not, f., ogata, h., pesant, s., speich, s., stemmann, l., sullivan, m. b., weissenbach, j., wincker, p., karsenti, e., raes, j., acinas, s. g., and bork, p. ( ). structure and function of the global ocean microbiome. science, ( ). tenaillon, o., skurnik, d., picard, b., and denamur, e. ( ). the population genetics of commensal escherichia coli. thorpe, h. a., bayliss, s. c., hurst, l. d., and feil, e. j. ( ). comparative analyses of selection operating on nontranslated intergenic regions of diverse bacterial species. genetics, ( ): – . vieira-silva, s., falony, g., belda, e., nielsen, t., aron-wisnewsky, j., chakaroun, r., forslund, s. k., assmann, k., valles-colomer, m., nguyen, t. t. d., proost, s., prifti, e., tremaroli, v., pons, n., le chatelier, e., andreelli, f., bastard, j. p., coelho, l. p., galleron, n., hansen, t. h., hulot, j. s., lewinter, c., pedersen, h. k., quinquis, b., rouault, c., roume, h., salem, j. e., søndertoft, n. b., touch, s., alves, r., amouyal, c., galijatovic, e. a. a., barthelemy, o., batisse, j. p., berland, m., bittar, r., blottière, h., bosquet, f., boubrit, r., bourron, o., camus, m., cassuto, d., ciangura, c., collet, j. p., dao, m. c., debedat, j., djebbar, m., doré, a., engelbrechtsen, l., fellahi, s., fromentin, s., giral, p., graine, m., hartemann, a., hartmann, b., helft, g., hercberg, s., hornbak, m., isnard, r., jaqueminet, s., jørgensen, n. r., julienne, h., justesen, j., kammer, j., kerneis, m., khemis, j., krarup, n., kuhn, m., lampuré, a., lejard, v., levenez, f., lucas-martini, l., massey, r., maziers, n., medina-stamminger, j., moitinho-silva, l., montalescot, g., moutel, s., le pavin, l. p., poitou-bernert, c., pousset, f., pouzoulet, l., schmidt, s., silvain, j., svendstrup, m., swartz, t., vanduyvenboden, t., vatier, c., verger, e., walther, s., dumas, m. e., ehrlich, s. d., galan, p., gøtze, j. p., hansen, t., holst, j. j., køber, l., letunic, i., nielsen, j., oppert, j. m., stumvoll, m., vestergaard, h., zucker, j. d., bork, p., pedersen, o., bäckhed, f., clément, k., and raes, j. ( ). statin therapy is associated with lower prevalence of gut microbiota dysbiosis. nature, ( ): – . wood, d. e., lu, j., and langmead, b. ( ). improved metagenomic analysis with kraken . genome biology, ( ): . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / s supplementary materials s . third-party tools usage and rational we propose here a the motivations and precise usage of the third-party tools that are employed in strainflair. s . . graph construction vg toolkit allows to modify the graph including a normalization step. normalization consists in deleting redundant nodes (nodes containing the same sub-sequence and having the same parent and child nodes), removing edges that do not introduce new paths, and merging nodes separated by only one edge. for each cluster, if the colored paths of the corresponding graph still describe their respective input sequences, the graph is normalized. after the concatenation of all computed graphs (one for each cluster), the final single variation graph is indexed using vg toolkit. indexing a graph allows a fast querying of the graph when mapping reads. indexation uses two file formats: xg, which is a succinct graph index which presents a static index of nodes, edges and paths of a variation graph, and gcsa, a generalized fm-index to directed acyclic graphs. a snarls file is also generated, describing snarls (a generalization of the superbubble concept (paten et al., )) in the variation graph and similarly allowing faster querying. s . . mapping reads vg toolkit offers two sequence-to-graph mappers. the first one, vg map, outputs one or several final paths for each alignment. however, in case of several alignments with equal mapping scores, only one is randomly chosen. in order to get more exhaustive and accurate results, strainflair uses vg mpmap to map reads on the variation graph. the mapping results are given in gamp format, then converted into json format with vg toolkit, describing, for each read, the nodes of the graph traversed by the alignment. s . gene-level output by strainflair here we present the exhaustive description of information provided by strainflair at the gene level (before strain-level computations). for each colored path strainflair provides the following items: • the corresponding gene identifier. • for each reference genome, the number of copies of the gene. since each unique version of a gene is represented once in the graph, whereas it can exist in several copies in the genome (duplicate genes), the counts and abundances computed correspond to the sum of those copies. keeping track of the number of copies is important to normalize the counts. • the cluster identifier to which the colored path belongs. • for unique mapped reads: their raw number and their number normalized by the sequence length (see section querying variation graphs in methods). • for unique plus multiple mapped reads: their raw number and their number normalized by the sequence length (see section querying variation graphs in methods). • the mean abundance of the nodes composing the path, as defined in the manuscript. • the mean abundance without the nodes of the path never covered by a read, as defined in the manuscript. • the ratio of covered nodes, as defined in the manuscript. s . abundance metrics validation the output of strainflair provides several metrics to estimate the abundance of the genes detected in the sample. for validation, we used a combination of lasso (least absolute shrinkage and selection operator) model and linear model on the simulated dataset to estimate the abundances at the strain-level, as the abundance of a gene is a linear combination of the abundances of the strains it belongs to. as such, / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / we expect no intercept value for those models and have forced the intercept at zero for the following modeling. first, a lasso model was used to perform strain selection. the response variable of the model was the presence or absence of the genes according to the selected metric while the strains, described as their genes content (number of copies), were the predictors. then, a linear model was constructed with the raw selected metric as the response variable, and only the strains selected by the lasso model as the predictors. the estimate of the strains relative abundance was thus the coefficients of the linear model associated to the strains and transformed into relative values. for each metric, the sum of squared errors between the real relative abundances and the estimated relative abundances from the linear model was computed. the best metric was then defined as the one minimizing this sum of squared errors. for the mixtures containing e. coli k- mg , the three expected strains were selected and thus detected using lasso, except for the mixture containing only , reads of k- mg (representing . % of the mixture, hence very negligible). for all the mixtures, the best metric was the mean abundance computed from the node abundances and by taking into account the multiple mapped reads. for the mixtures containing e. coli bl -de , bl -de being absent from the reference but very close to k- mg , we expected to get some detection of k- in the results. the three expected strains were selected and thus detected using lasso, except for the mixture containing only , reads of bl -de (representing . % of the mixture, hence very negligible). for the mixtures at , , , , and , reads of bl -de , the best metric was the mean abundance computed from the node abundances without the abundances at zero, and by taking into account the multiple mapped reads. while for the others, the best metric was the mean abundance computed from the node abundances (including the abundances at zero), and by taking into account the multiple mapped reads. this approach using linear models was particularly appropriate for this situation where the reference variation graph and the sample contained a small number of strains and thus a small number of predictors for the model. however, this can hardly transpose to a whole metagenomic sample with various species and various strains that would lead to too many predictors and probably confusing the heuristics behind the models. this was confirmed by applying the same methodology above on the mock dataset leading to abundances estimation hardly comparable to expected. compared to kraken results, the sum of squared errors of our methodology was approximately whereas for the results with the lasso model it was around . nevertheless, those results highlighted the relevance of (i) using a metric taking into account the multiple mapped reads and not only the unique mapped reads, and (ii) using our metric of abundance based on the node abundances over raw read counts. / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / s . performances our benchmarks were performed on the genouest platform on a machine with xeon e - . ghz with gb of memory and cpus. time results (table s ) are the wall-clock times. we provided rough computation time, mainly in the purpose to show that strainflair can be applied on usual datasets. dataset step items processed time disk used (gb) max mem. (gb) gene prediction genomes m . gene clustering , genes m . graph construction , clusters m . . graph concatenation , graphs m . simulated graph indexation graph m . . mapping reads , short reads m . . json conversion gamp file m . . json parsing json file + gfa file + pickle file m . abundance computing gene abundances table m . gene prediction genomes m . . gene clustering , genes m . . graph construction , clusters m . . graph concatenation , graphs m . mock graph indexation graph m . . mapping reads , , short read pairs m . json conversion gamp file m . json parsing json file + gfa file + pickle file m . abundance computing gene abundances table m . table s . strainflair performances on simulated and mock datasets. / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / s . distance between the selected genomes in the simulated experiment we estimated the distance between the complete genomes of the selected strains using fastani (average nucleotide identity). fastani uses an alignment-free algorithm to estimate the average nucleotide identity between pairs of sequences. k- iai o :h sakai se santai bl -de rm k- . . . . . . . iai . . . . . . . o :h . . . . . . . sakai . . . . . . . se . . . . . . . santai . . . . . . . bl -de . . . . . . . rm . . . . . . . table s . distance between each pair of complete genome sequences from eight strains of e. coli as computed by fastani. all pairs showed a distance at least greater than %, highlighting the strong similarities between the strains. as a threshold, we although considered that beyond %, sequences were too similar to be considered and distinguished, additionally to the effect of sequencing errors. the fastani results showed that none of the pairs exceeded this similarity threshold. the strain e. coli bl -de was chosen as the unknown strain while the seven others would be used to build the reference pangenome graph. according to the results of fastani, the strain bl -de closest genome in the present references is the strain k- with a similarity of . %. hence we expected to find evidences of the strain k- while analyzing a sample containing the unknown strain bl -de . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / s . detailed results from simulated datasets #reads k- method o :h iai k- sakai se santai rm expected . . . , strainflair . . kraken . . . . . . . expected . . . , strainflair . . . kraken . . . . . . . expected . . . , strainflair . . . kraken . . . . . . . expected . . . , strainflair . . . kraken . . . . . . . expected . . . , strainflair . . . kraken . . . . . . . expected . . , strainflair . . . kraken . . . . . . . expected . . . , strainflair . . . kraken . . . . . . . table s . reference strains relative abundances expected and computed by strainflair or kraken for each simulated experiment with variable coverage of the k- mg strain. best results are shown in bold. table s provides exhaustive results on simulated datasets when all queried strains are indexed in the variation graph. table s provides exhaustive results on simulated datasets when one of the queried strain (bl -de ) is not indexed and highly similar to strain k- . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / #reads bl -de method o :h iai k- sakai se santai rm expected . . ( . ) , strainflair . . kraken . . . . . . . expected . . ( . ) , strainflair . . kraken . . . . . . . expected . . ( . ) , strainflair . . kraken . . . . . . . expected . . ( . ) , strainflair . . . kraken . . . . . . . expected . . ( . ) , strainflair . . . kraken . . . . . . . expected . ( . ) , strainflair . . . kraken . . . . . . . expected . . ( . ) , strainflair . . . kraken . . . . . . . table s . reference strains relative abundances expected and computed by strainflair or kraken for each simulated experiment with variable coverage of the bl -de strain, absent from the reference graph. bl -de being similar at . % to k- strain (highest similarity compared to the other references), we expect that reads from bl -de will map this strain, hence its expected values are given in parentheses, as they correspond to bl -de strain abundances and not k- . best results are shown in bold. / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / s . detailed results for validation on mock datasets figure s . experimental relative abundance compared to relative abundance computed by strainflair and kraken . figure s shows full results obtained on the mock dataset. / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / benchmarking association analyses of continuous exposures with rna-seq in observational studies benchmarking association analyses of continuous exposures with rna-seq in observational studies tamar sofer , ,*, nuzulul kurniansyah ,*, françois aguet , kristin ardlie , peter durda , deborah a. nickerson , joshua d. smith , yongmei liu , sina a. gharib , susan redline , stephen s. rich , jerome i. rotter , kent d. taylor division of sleep and circadian disorders, brigham and women’s hospital, boston, ma, usa departments of medicine and of biostatistics, harvard university, boston, ma, usa the broad institute of mit and harvard, cambridge, ma, usa department of pathology and laboratory medicine, larner college of medicine, university of vermont, burlington, vt, usa department of genome sciences, university of washington, seattle, wa, usa duke molecular physiology institute, department of medicine, division of cardiology, duke university medical center, durham, nc, usa computational medicine core, center for lung biology, uw medicine sleep center, department of medicine, university of washington, seattle, wa, usa center for public health genomics, university of virginia, charlottesville, va, usa the institute for translational genomics and population sciences, department of pediatrics, the lundquist institute for biomedical innovation at harbor-ucla medical center, torrance, ca usa *these authors contributed equally to the work. correspondence: tamar sofer email: tsofer@bwh.harvard.edu longwood ave .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / boston, ma abstract large datasets of hundreds to thousands of individuals measuring rna-seq in observational studies are becoming available. many popular software packages for analysis of rna-seq data were constructed to study differences in expression signatures in an experimental design with well-defined conditions (exposures). in contrast, observational studies may have varying levels of confounding of the transcript-exposure associations; further, exposure measures may vary from discrete (exposed, yes/no) to continuous (levels of exposure), with non-normal distributions of exposure. we compare popular software for gene expression - deseq , edger, and limma - as well as linear regression-based analyses for studying the association of continuous exposures with rna-seq. we developed a computation pipeline that includes transformation, filtering, and generation of empirical null distribution of association p-values, and we apply the pipeline to compute empirical p-values with multiple testing correction. we employ a resampling approach that allows for assessment of false positive detection across methods, power comparison, and the computation of quantile empirical p-values. the results suggest that linear regression methods are substantially faster with better control of false detections than other methods, even with the resampling method to compute empirical p-values. we provide the proposed pipeline with fast algorithms in r. introduction .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / many studies of phenotypes associated with gene expression from rna-seq consist of small sample sizes (tens of subjects) and are focused on comparisons of transcriptional expression patterns between well-delineated states, such as different experimental conditions, tumor versus non-tumor cells ( ; ), and disease vs non-disease groups ( ). some studies are designed to identify differential expression across hidden, discrete conditions ( ). epidemiological cohorts have recently utilized stored samples to facilitate the use of rna-seq data in studies of association with subclinical phenotypes such as blood biomarkers, imaging, and other physiological measures, with often continuous measures being used in statistical analyses. high throughput rna sequencing enables broad assaying of a sample’s transcriptome ( ) and has been in increasing use for over a decade ( ). a large variety of analytic and statistical approaches have been developed to address scientific questions such as alternative splicing, differential expression, and more ( ; - ), often building on methods developed for analyses of expression microarrays ( - ); comprehensive reviews are available ( - ). in this work, we are specifically interested in differential expression analysis with continuous exposures, and we assume that count data are already prepared and available to the analyst. popular software packages for differential expression analysis include the deseq r package ( ), which models the expression counts as following a negative binomial distribution, with shrinkage imposed on both the mean and the dispersion parameters, based on estimates from the entire transcriptome, or user-supplied values. edger ( ) uses a negative binomial model similar to the deseq model for transcript counts, in combination with overdispersion moderation. edger was primarily designed for differential expression analysis between two groups when at least one of the groups has replicated measurements ( ). limma ( ) uses linear models, which are very .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / flexible and can effectively accommodate many study designs and hypotheses. similar to the deseq and edger packages, limma also uses an empirical bayes method to borrow information across transcripts to estimate a global variance parameter that is applied for the computation of variance parameters of each single transcript. it uses log transformation and weighting, known as the “voom” transformation, in the final linear model that is used for differential expression analysis. we refer to it henceforth as the limma-voom. prior to differential expression analysis, library normalization is performed ( ). popular approaches are the tmm (trimmed-means of m-values) normalization ( ), implemented in edger, and the size factors normalization ( ), implemented in deseq . sleep disordered breathing phenotypes, such as the apnea-hypopnea index (ahi), the number of apnea and hypopnea events per hour of sleep, provides a quantitative assessment of the severity of the disorder, with no clear threshold above which different biological processes occur (although thresholds are used for clinical decision making and health insurance reimbursement). association analysis with continuous exposures provides different challenges than those traditionally encountered. the distribution of such exposures may have strong effects on the association analysis results, regardless of the underlying associations, due to the combination of skewed exposure distributions and the distribution of rna-seq read count data, that are generally over-dispersed with occasional extreme values. as observational study data analyses may include covariates, statistical methods from experimental studies (e.g., exact tests) cannot be applied. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / in this manuscript, we compare the deseq , edger, and limma-voom analysis approaches for differential expression analysis, with linear regression–based approaches that do not use the empirical bayes approach for estimating variance parameter across the transcriptome. we study the computation of p-values using resampling of phenotype residuals, while preserving the structure of the data. this addresses the limitation of permutation noted by others in the context of differential expression analysis of rna-seq ( ), where permutation may not be tuned to test a specific null hypothesis because in its standard form it “breaks” all relationships between the permuted variable and the rest of the dataset. finally, we study the use of empirical p-values that tune the original p-values based on the residual resampling scheme. throughout, we use a dataset with sleep disordered breathing phenotypes and rna-seq from the multi-ethnic study of atherosclerosis as a case study. we demonstrate the statistical implications of performing association analysis of rna-seq with continuous, non-normal exposures, compare analysis methods, and develop recommendations. methods the multi-ethnic study of atherosclerosis (mesa) mesa is a longitudinal cohort study, established in , that prospectively collected risk factors for development of subclinical and clinical cardiovascular disease among participants in six field centers across the united states (baltimore city and baltimore county, md; chicago, il; forsyth county, nc; los angeles county, ca; northern manhattan and the bronx, ny; and st. paul, mn). the cohort has been studied every few years. the present analysis considers n = individuals who participated in a sleep ancillary study performed shortly following the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / participants exam during - ( ; ), with rna-seq measured via the trans-omics in precision medicine (topmed) program. here, we used rna-seq data with rna extracted from whole blood drawn in exam ( - ). sleep data were collected using standardized full in- home level- polysomnography (compumedics somte systems, abbotsville, australia, au ), as described before ( ). of the participants in the current analysis, there were african- americans (aa), european-americans (ea) and hispanic-europeans (ha). rna sequencing in mesa is briefly described in the supplementary materials. sleep disordered breathing measures as examples for continuous exposures from population-based studies, we took three sleep disordered breathing measures: ( ) the apnea-hypopnea index (ahi), defined as the number of apnea (breathing cessation) and hypopnea (at least % reduction of breath volume, accompanied by % or higher reduction of oxyhemoglobin saturation compared to the baseline saturation) per hour or sleep; ( ) minimum oxyhemoglobin saturation during sleep (mino ), and ( ) average oxyhemoglobin saturation during sleep (avgo ). we chose these traits because they are clinically relevant, often used in sleep research studies, and represent exposures that may alter gene expression (via hypoxemia and sympathetic activation). the ahi had the least skewed distribution of the considered phenotypes, and avgo had the longest “tail” of small values in the residual distribution. residuals were obtained by regression the sleep measures on age, sex, body mass index (bmi), study center, and self-reported race/ethnic group. compared tests of associations between exposure and transcripts .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / we compared the standard packages deseq , edger, limma, and linear regression-based approaches, in which we always applied log transformation on the transcript counts, and then applied linear regression. because some of the observed transcript count values are zero, which cannot be log transformed, we compared a few approaches for replacing zero values. for a given transcript �, denote the minimum observed transcript level that is higher than zero by m� � min ��, … , ��: �� for � � , … , ��. we compare the following approaches, applied on each transcript �, � � , … , � separately: a . subhalfmin: replace zero values with �� . a . addhalfmin: replace all values �� by �� . a . addhalf: replace all values �� by �� . conceptual framework for studying analysis approaches to study performance of various analysis approaches, we performed simulation studies. simulation study was used to assess type error across methods when using output p-values, and when using “empirical p-values”, which are p-values that account for true distribution of the p-values under the null and are described later. simulation study was used to assess power in transcriptome-wide analysis settings, when using methods that control the type error according to simulation study . in addition, we performed a simulation study (supplementary materials) to assess power for testing of individual transcript according to various distributional characteristics of transcript counts. the goal was to identify approaches for filtering transcripts for association analysis that will optimize power. all simulations used a “residual permutation” (below). the reported criteria for declaring differentially expressed transcripts were false discovery rate (fdr) controlling p-values < . based on the benjamini-hochberg (bh) procedure, and based .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / on the local fdr procedure implemented in the qvalue r package, family-wise error rate (fwer) controlling p-values < . based on the holms procedure, and an arbitrary threshold of p-value< - . residual permutation approach for simulations and for empirical p-value computation to generate realistic simulation studies in which: (a) the data structure, including the exposure, covariates, and outcome distributions; and (b) their relationships, aside from the exposure- outcome association, are the same as in the real data, we used a residual permutation approach. we regressed each sleep exposure of interest � on the covariates � and estimated their effect �. we then obtained residuals, defined as: � � � � ��. to study type error, we permuted these residuals at random to obtain �� !�", and generated a sleep exposure unassociated with any of the rna-seq measures by: �� !�". we repeated this procedure times for evaluating type error control. we generated simulated data under four power simulations in a similar approach, with the difference that we forced a specific correlation value between the simulated sleep exposure and a specific transcript. to this end, for a given transcript � measured on individuals � � , … , �, we computed the rank of each individual: ��!��", … , ��!��". to set a correlation $ between the simulated � and transcript � we sampled $ % � (rounded) indices from , … , �, corresponding to $ % � individuals for which we forced their ranks in the permuted residual values, now denoted by �� !�" , to be the same as their ranks in the transcript values (note that the transcript values are never changed). for the rest of the individuals, the permuted residuals are completely random. when .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / multiple individuals have the same transcript counts (i.e., their ranks are tied), we randomly assign their ranks. for example, if people have zero counts for a given transcript, each of these individuals will be equally likely to have the rank of , , …, or . the code for generating this residual permutation approach is provided in the supplementary information and in a dedicated github repository https://github.com/nkurniansyah/rna- seq_continuous_exposure. empirical p-values to account for the null distribution of p-values we used the residual permutation approach, under the null hypothesis, to generate a null distribution of p-values and to compute empirical p-values. when the distribution of p-values under the null hypothesis is unknown, and specifically when it is not uniform, their values are not reliable for hypothesis testing. alternative approaches compute “empirical p-values” with the goal of generating an appropriate p-value distribution, i.e., in which an empirical p-value � satisfies pr!� ' . |*�" � . (supplementary materials). for computing empirical p-values, we use a relatively small number of residual permutations (in comparison to the number of permutations used for computing permutation p-values) followed by transcriptome-wide association studies. we use the results of these transcriptome-wide tests under permutation to compute the null distribution of p-values, which is then used to compute the empirical p-values. we compare two types of empirical p-values: quantile empirical p-values, and storey empirical p-values implemented in the qvalue r package ( ). the quantile empirical p-value approach is inspired by previously proposed procedures based on permutation ( ) of phenotypes (rather than residuals). it estimates the null distribution of p-values non- .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / parametrically, and the quantile empirical p-value is the quantile of the raw p-value in this distribution. the storey empirical p-values uses the null distribution of the test statistics to identify whether a transcript is likely sampled from the null or a non-null distribution. both implementations assume that the empirical null distribution is the same for all transcripts. we used residual permutations to compute test statistics and p-values under the null and compared the empirical p-values to standard permutation p-values. resampling approach for binary exposure phenotypes we compared the analysis of a continuous exposure to that of a dichotomized variable. instead of a sleep measure, we used body mass index (bmi), because it is known to have large impact of gene expression and is therefore a powerful phenotype for such a comparison. bmi was dichotomized to “obese” if bmi + kg/m and non-obese otherwise. because obesity is binary and, therefore, the residual permutation approach is not appropriate as proposed for continuous variables, we generated a binomial obesity variable based on bmi probability given covariates. given a logistic model -./�� ! � � " � � � , we estimated the covariates’ association parameters and obtained estimated probabilities for obesity for each person � � , … , � by �̂! � � " � � ��! � � ". based on these estimated outcome probabilities, we sampled random obesity status as binomial variables. results mesa participant characteristics are provided in the supplementary material, table s . the distribution of the raw phenotypes ahi, mino , and avgo , and their residuals after regression on covariates is provided in figure , demonstrating the high non-normality. simulations were .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / performed after normalizing the data so that each library has the same size (prior to filtering), which we set to the median observed value (i.e., median normalization) in the raw reads, or , , . results for some of the settings in simulation study under tmm and size factor normalizations are provided in the supplementary materials. simulation study : type error analysis after normalization, we applied filters to remove lowly expressed transcripts. there were , transcripts. after applying filters requiring that the (a) maximum read count is > and that (b) the proportion of individuals with zero counts for a transcript across the sample is not higher than . (see supplementary materials for more information on filters), , transcripts were available for the simulation study. we used residual permutation to generate simulated sdb phenotypes that are not associated with the transcripts, but maintain the same correlation structure with the transcript and covariates. we generated datasets with simulated sdb phenotypes, and performed analyses. complete results showing the average number of false positive detection based on the existing packages limma, edger, and deseq , as well as the three linear regression analyses described here, are provided in supplementary figures s -s . these results include comparisons of raw p-values, the proposed quantile empirical p-values, and the empirical p-values provided in the qvalue r package ( ), and for the three sdb phenotypes. we found that the number of false positives vary with the exposure phenotypes, with analyses of mino (figure ) generally resulting in more false positive detections than analyses of the ahi, with intermediate numbers for avgo (figures s -s in the supplementary materials). figure compares the average number of falsely discovered transcript associations when using simulated .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / sleep phenotypes mimicking mino using the residual permutation approach by focusing on limma, edger, deseq , and linear regression applied on log of expression counts with subhalfmin. for each method, type i error was determined using raw p-values and storey empirical p-values, with significance thresholds based on benjamini-hochberg (bh) fdr, local fdr (qvalue package), and holms family-wise error rate (fwer). empirical p-values usually reduced the number of false detections, with the method in the qvalue package being usually more conservative than the quantile-based empirical p-values method. compared to linear regression-based approaches, deseq , edger and limma-voom had many false detections when using the raw p-values, even after applying multiple testing corrections. the three linear regression-based methods described here were quite similar, with the addhalf approach often resulting in slightly more false detections. based on these results, we chose to move forward for the next set of simulations with linear regression with subhalfmin for handling of zero counts. simulation study : power analysis we performed simulations that mimic transcriptome-wide analysis to assess power. based on simulations comparing power by transcript distributional characteristics (see supplementary materials), we only considered , transcripts for which no more than % of the sample had zero counts. we chose two transcripts, and for each of these and each of the sleep phenotypes, we performed simulations in which we used the residual permutation approach to generate association between the sleep phenotype and the transcript with correlation $ � . . we performed transcriptome-wide association analysis using deseq , edger, and linear regression with subhalfmin transformation (limma-voom was not used, given its high rate of false .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / positive detections in some of the settings in simulation study ). for power, we always used empirical p-values (both types) and determined whether the specific transcript of interest passed the significance threshold based on fdr-adjusted ( ) empirical p-value < . . power was defined as the proportion of the simulations in which the associations was significant, and was consistently higher for the linear regression-based approach compared to deseq or edger. for linear regression, the quantile empirical p-values performed essentially the same as storey’s empirical p-values, while storey’s empirical p-values resulted in substantially higher statistical power when using deseq and edger. we illustrate power comparisons in figure using storey’s empirical p-values. power comparisons using quantile empirical p-values are provided in the supplementary materials figure s . proposed analysis approach based on the above simulation studies, we developed an analytic pipeline as depicted in figure : (a) the raw read count are normalized; (b) filters are applied to remove lowly expressed transcripts and those for which the statistical power is low, as determined by simulations, (c) addhalfmin transformation is applied for each individual separately, then log transformation is applied on all transcripts, (d) association analyses is performed using linear regression to compute effect sizes and p-values, (e) permutations are computed times on exposure residuals after regressing on covariates, to generate simulated traits that maintain the data structure, (f) each of vectors of simulated traits are analyzed using the same approach as the raw trait, generating p-values, (g) p-values from the analysis of the simulated traits are combined to generate an empirical null distribution of p-values, that are used to generate .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / empirical p-values for the raw trait using the qvalue package, and (h) multiple testing correction is applied on the empirical p-values. comparison of analysis of continuous bmi with analysis of dichotomous obesity status we compared the differential expression of transcripts in analysis of bmi and obesity. residual permutation procedure was used and quantile-empirical p-values generated for both analyses. a total of mesa individuals had bmi measure available and, for analysis, at least % non- zero transcripts were required. for obesity, several non-zero transcript thresholds were examined: %, %, and %. the results were similar for all thresholds, resulting in many more identified transcript associations ( vs. ) with continuous bmi compared to using a dichotomous trait (supplementary information figure s ). computing time comparison the compute time for transcriptome-wide association study was obtained for analyses using deseq , edger, and our linear regression implementation. using our linear regression implementation on a single core, a single transcriptome-wide association study applied on ~ k transcripts and n= individuals took less than a minute; when transcriptome-wide association studies applied to residual permutations were included to compute empirical p- values, the time reached minutes, and the maximum memory used was . gb. in comparison, deseq took . minutes and edger took . minutes for a single transcriptome-wide association study. the maximum memory used for deseq and edger was similar at . gb. r package .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / code for implementing the proposed procedure and for a shiny app is provided in the github repository https://github.com/nkurniansyah/olivia. the code also provides test of multiple exposure variable at the same time, which applies the multivariate-wald test, and an efficient implementation of a permutation test when considering a single transcript, rather than a transcriptome-wide analysis. the repository also includes code used for simulations. data availability mesa data are available through application to dbgap. phenotypes are available in mesa study accession phs .v .p , and rna-seq data has been deposited and will become available through the topmed-mesa study accession phs .v .p . discussion we systematically assessed the approaches for studying the association of gene expression, estimated using rna sequencing, with continuous and non-normally distributed exposure phenotypes. we found that linear regression-based analysis performs well for continuous phenotype associations, and is computationally highly efficient. we used a residual permutation approach to study the distribution of p-values under the null of no association between the phenotypes and rna-seq, and used this approach to further study power, and to compute empirical p-values. notably, the residual permutation approach allows for the dataset to have the same correlation structures and associations between the phenotypes and the transcripts and covariates, while eliminating the transcript-phenotype associations. we implemented this approach in an r package and developed an r shiny app, to make our pipeline easily accessible to the research community. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / recently, van rooij et al ( ) also performed a benchmarking study comparing analysis approaches for transcriptome-wide analysis of rna-seq in population-based studies, including when using continuous phenotypes in association testing. while we used similar statistical methods to theirs, we took a different analytical approach. van rooij et al. used multiple datasets to apply association analysis between a phenotype and transcripts, and assessed replication between analyses. we, on the other hand, leveraged simulations to generate data under a known association structure. in addition, we were motivated by a specific problem: highly non-normal sleep exposure measures, often leading to suboptimal control of type error. thus, it was critical to assess control of false discovery under the null hypothesis. notably, sleep phenotypes are less often available and there are no other large observational studies data sets to our knowledge with both rna-seq measures and similar sdb phenotypes. some of our findings are similar to those of van rooij et al.: they also recommend using linear regression analysis, and they also found that using a continuous phenotype is generally more powerful than dichotomizing it (in agreement with what is known from statistical literature). similarly, they found that normalization method had very little effect on the results. however, they recommend testing all genes, while we recommend filtering transcripts with at least % zero counts, based on our power simulations. additional future work is needed to evaluate various filtering criteria, and to develop methods that allow for flexible, non-linear modeling of the association between phenotype and gene expression while remaining computationally efficient to allow for permutation analysis. we propose to compute p-values under the null hypothesis of no association between the transcript and the exposure phenotype by permuting residuals of the exposure phenotype after .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / regressing on covariates, and re-structuring the exposure by summing the permuted residuals with the estimated mean, and thus maintain the overall data structure except for the exposure- outcome association of interest. outside the gene expression literature, others have proposed to permute residuals rather than the outcome. for example, previous permutation methods proposed to permute residuals of the outcome after regressing on covariates ( ), or to permute the residuals of the exposure phenotypes without constructing a new exposure phenotype by summing the permuted residuals with the estimated mean ( ). it will be interesting to perform a more comprehensive study of statistical permutation approaches for rna-seq association analyses, as well as studying them in the context of mixed models. we recommend using empirical p-values, which require residual permutation, and therefore, performing transcriptome-wide association analyses instead of one. considering figures s - s in the supplementary information, one can see that in most settings, linear regression methods do not have many false positive detections even when raw p-values are used. however, we chose to be more conservative by strongly protecting the analysis from false positive detections. importantly, the linear regression analysis with empirical p-values had higher power than the other common approaches (deseq , edger), indicating simultaneous improvement in controlling false positives and increasing power. unfortunately, we cannot effectively estimate the fdr in these simulations. fdr is defined as the proportion of false discoveries out of all discovered (significant) associations. in simulation study , none of the transcripts were associated with the outcomes, so that any estimated fdr would be %. under the alternative, one can suggest to use the number of wrongly discovered associations to estimate the fdr. however, many transcripts are highly correlated with the one simulated to be associated with the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / exposures, and are therefore associated with the exposure by design, and thus the number of transcripts falsely detected as associated with the exposure cannot be easily determined. the empirical p-values procedure uses p-values from the entire tested transcriptome to compute the empirical null distribution. this encapsulates the assumption that the null distribution of p- values is the same for all transcripts, which is generally a limitation, but has been shown to be often acceptable since it will lead to less power, rather than increasing the number of false detections ( ; ). an approach that does not require this assumption estimates the null distribution for p-value for each transcript separately, which is a standard permutation approach. we investigated this issue by comparing the quantile empirical p-values with the permutation p- values that use , residual permutations to estimate the null distribution of the p-value of each transcript separately (figure s in the supplementary materials). the two p-value distributions are very similar. therefore, a computationally expensive permutation approach, as well as other approaches proposed by investigators, such as estimating null distributions across sets of transcripts with similar properties ( ; ), are likely unnecessary and not superior to the computationally efficient empirical p-values method. another approach for estimating the null distribution of p-values uses the primary results, without any permutation ( ; ). these approaches also use the assumption that the null p-value distribution is the same across transcripts (i.e. a shared null distribution exists). given the computationally fast implementation of the transcriptome-wide association study, we believe that using residual permutation is beneficial because it allows for a more precise quantification of the null p-value distribution. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / batch effects are important to account for in studies of rna-seq. here, we did not study their effect because it was beyond the scope of our investigation. van rooij et al ( ) in their benchmarking study focusing on replication across cohorts, compared a few approaches for adjusting for technical covariates, including estimating and adjusting for latent confounders ( ). they concluded that inclusion of more technical adjusting covariates, including hidden confounders, increases the rate of replication between studies. to summarize, we highlighted the problem of high false positive findings in rna-seq data when studying the association of continuous exposure phenotypes that are highly non-normal. we developed a computationally efficient pipeline to address the false positive detection problem, and studied strategies to optimize statistical power. our approach will be particularly useful for epidemiological studies with rna-seq data that were not designed as disease-focused case-control studies. acknowledgements this work was supported by the national heart lung and blood institute grant r hl . mesa and the mesa share projects are conducted and supported by the national heart, lung, and blood institute (nhlbi) in collaboration with mesa investigators. support for mesa is provided by contracts n d , hhsn i, n -hc- , n d , n -hc- , n d , n -hc- , n d , n -hc- , n d , n -hc- , n d , n -hc- , n d , n -hc- , n -hc- , n -hc- , n -hc- , n -hc- , ul -tr- , ul -tr- , ul -tr- . also supported in .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / part by the national center for advancing translational sciences, ctsi grant ul tr , and the national institute of diabetes and digestive and kidney disease diabetes research center (drc) grant dk to the southern california diabetes endocrinology research center. molecular data for the trans-omics in precision medicine (topmed) program was supported by the national heart, lung and blood institute (nhlbi). rna-seq for “nhlbi topmed: multi- ethnic study of atherosclerosis (mesa)” (phs .v .p ) was performed at the northwest genomics center (hhsn i). core support including centralized genomic read mapping and genotype calling, along with variant quality metrics and filtering were provided by the topmed informatics research center ( r hl- - s ; contract hhsn i). core support including phenotype harmonization, data management, sample-identity qc, and general program coordination were provided by the topmed data coordinating center (r hl- ; u hl- ; contract hhsn i). we gratefully acknowledge the studies and participants who provided biological samples and data for topmed. author contributions ts conceptualized and drafted the manuscript and supervised the analysis. nz performed all statistical analysis and data visualization and developed the r package and r shiny app. dn, and js performed rna sequencing. fa and ka generated the mesa processed the sequenced rna to generate the rna-seq dataset. pd, yl, ssr, jir, and kdt designed the rna-seq study in .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / mesa. sr designed and supervised the mesa sleep ancillary study. all authors critically reviewed and approved the manuscript. references . zhai w, yao xd, xu yf, peng b, zhang hm, et al. . transcriptome profiling of prostate tumor and matched normal samples by rna-seq. eur rev med pharmacol sci : - . peng l, bian xw, li dk, xu c, wang gm, et al. . large-scale rna-seq transcriptome analysis of cancers and normal tissue controls across tcga cancer types. sci rep : . kim wj, lim jh, lee js, lee sd, kim jh, oh ym. . comprehensive analysis of transcriptome sequencing data in the lung tissues of copd subjects. int j genomics : . klambauer g, unterthiner t, hochreiter s. . dexus: identifying differential expression in rna-seq studies with unknown conditions. nucleic acids res :e . auer pl, doerge rw. . statistical design and analysis of rna sequencing data. genetics : - . mortazavi a, williams ba, mccue k, schaeffer l, wold b. . mapping and quantifying mammalian transcriptomes by rna-seq. nat methods : - . law cw, alhamdoosh m, su s, dong x, tian l, et al. . rna-seq analysis is easy as - - with limma, glimma and edger. f research . liu r, holik az, su s, jansz n, chen k, et al. . why weight? modelling sample and observational level variability improves power in rna-seq analyses. nucleic acids res :e . love mi, huber w, anders s. . moderated estimation of fold change and dispersion for rna-seq data with deseq . genome biol : . pimentel h, bray nl, puente s, melsted p, pachter l. . differential analysis of rna- seq incorporating quantification uncertainty. nat methods : - . wolf jbw. . principles of transcriptome analysis and gene expression quantification: an rna-seq tutorial. molecular ecology resources : - . kathleen kerr m, a. churchill g. . statistical design and the analysis of gene expression microarray data. genetical research : - . durbin bp, hardin js, hawkins dm, rocke dm. . a variance-stabilizing transformation for gene-expression microarray data. bioinformatics :s -s . mostafavi s, battle a, zhu x, urban ae, levinson d, et al. . normalizing rna- sequencing data by modeling hidden covariates with prior knowledge. plos one :e . conesa a, madrigal p, tarazona s, gomez-cabrero d, cervera a, et al. . a survey of best practices for rna-seq data analysis. genome biol : . costa-silva j, domingues d, lopes fm. . rna-seq differential expression analysis: an extended review and a software tool. plos one .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . ge sx, son ew, yao r. . idep: an integrated web application for differential expression and pathway analysis of rna-seq data. bmc bioinformatics : . hrdlickova r, toloue m, tian b. . rna-seq methods for transcriptome analysis. wiley interdisciplinary reviews: rna :e . li wv, li jj. . modeling and analysis of rna-seq data: a review from a statistical perspective. quantitative biology : - . robinson md, mccarthy dj, smyth gk. . edger: a bioconductor package for differential expression analysis of digital gene expression data. bioinformatics : - . ritchie me, phipson b, wu d, hu y, law cw, et al. . limma powers differential expression analyses for rna-sequencing and microarray studies. nucleic acids res :e . dillies m-a, rau a, aubert j, hennequet-antier c, jeanmougin m, et al. . a comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. briefings in bioinformatics : - . robinson md, oshlack a. . a scaling normalization method for differential expression analysis of rna-seq data. genome biol :r . anders s, huber w. . differential expression analysis for sequence count data. nature precedings . bild de, bluemke da, burke gl, detrano r, diez roux av, et al. . multi-ethnic study of atherosclerosis: objectives and design. am j epidemiol : - . chen x, wang r, zee p, lutsey pl, javaheri s, et al. . racial/ethnic differences in sleep disturbances: the multi-ethnic study of atherosclerosis (mesa). sleep : - . storey j, bass a, dabney a, robinson d. . qvalue: q-value estimation for false discovery rate control. in r package version . . . . van der laan mj, hubbard ae. . quantile-function based null distribution in resampling based multiple testing. stat appl genet mol biol :article . benjamini y, hochberg y. . controlling the false discovery rate: a practical and powerful approach to multiple testing. journal of the royal statistical society: series b : - . van rooij j, mandaviya pr, claringbould a, felix jf, van dongen j, et al. . evaluation of commonly used analysis strategies for epigenome- and transcriptome-wide association studies through replication of large-scale population studies. genome biology : . anderson mj, legendre p. . an empirical comparison of permutation methods for tests of partial regression coefficients in a linear model. journal of statistical computation and simulation : - . werft w, benner a. . glmperm: a permutation of regressor residuals test for inference in generalized linear models. the r journal : . yang h, churchill g. . estimating p-values in small microarray experiments. bioinformatics : - . storey jd, tibshirani r. . sam thresholding and false discovery rates for detecting differential gene expression in dna microarrays. in the analysis of gene expression data: methods and software, ed. g parmigiani, es garrett, ra irizarry, sl zeger: - . new york, ny: springer new york. number of - pp. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . fan j, chen y, chan hm, tam pkh, ren y. . removing intensity effects and identifying significant genes for affymetrix arrays in macrophage migration inhibitory factor-suppressed neuroblastoma cells. proceedings of the national academy of sciences of the united states of america : . van iterson m, van zwet ew, heijmans bt, the bc. . controlling bias and inflation in epigenome- and transcriptome-wide association studies using the empirical null distribution. genome biology : . efron b. . large-scale simultaneous hypothesis testing. journal of the american statistical association : - . wang j, zhao q, hastie t, owen ab. . confounder adjustment in multiple hypothesis testing. annals of statistics : - .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure legends figure : distributions of the three sleep-disordered breathing exposure phenotypes used as case studies in this manuscript. the left column provides the empirical density functions of the raw phenotypes, the right column provides the empirical density function of their residuals after regressing on age, sex, bmi, self-reported race/ethnic group, and study center. avgo : average oxyhemoglobin saturation during sleep. mino : minimum oxyhemoglobin saturation during sleep. ahi: apnea hypopnea index. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : average number of false positive transcript associations detected by various methods used in simulation study and computed over repetitions. we used the residual permutation approach to mimic the mesa data set with the sleep phenotype mino . the methods reported here are linear regression (applied on log -transformed transcript counts, with zero values replaced with subhalfmin); deseq , edger, and limma-voom. the left column provides results when using raw p-values, the middle corresponds to use of quantile- empirical p-values, and the right corresponds to storey empirical p-values. we report false positive detections as those with benjamini-hochberg (bh) false discovery rate adjusted (fdr) adjusted p-value < . , local fdr < . (qvalue package) and with holms family-wise error rate (fwer) adjusted p-values < . . error bars reflect the mean ; standard error. in supplementary figures s -s , we provide complete results, including for additional sleep phenotypes: ahi and avgo . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : estimated power for detecting a transcript simulated as associated with the three sleep traits when using storey empirical p-values, and association is determined significant if its bh fdr-adjusted p- value is < . . the transcripts were randomly selected out of available transcripts (after filtering of transcripts with % or higher zero counts across the sample). we compared linear regression, deseq , and edger in transcriptome-wide association analysis for each of the sleep phenotypes. for each transcript used in simulations, we show both power and the box plot of its distribution in the sample after median normalization. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : analysis pipeline for association transcriptome-wide association analysis of continuous exposure phenotypes. the raw data is normalized using library-size normalization, followed by filtering of transcripts, transformation of transcript expression values, then single-transcript testing to obtain raw p-values. in parallel, residual permutation is applied under the null times, and p-values are used to construct an empirical p-value distribution under the null, and to compute empirical p-values. finally, the quantile empirical p-values are corrected for multiple testing. s .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / kincore: a web resource for structural classification of protein kinases and their inhibitors kincore: a web resource for structural classification of protein kinases and their inhibitors vivek modi roland dunbrack jr. institute for cancer research fox chase cancer center, philadelphia pa usa .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / abstract protein kinases exhibit significant structural diversity, primarily in the conformation of the activation loop and other components of the active site. we previously performed a clustering of the conformation of the activation loop of all protein kinase structures in the protein data bank (modi and dunbrack, pnas, : - , ) into classes based on the location of the phe side chain of the dfg motif at the n- terminus of the activation loop. this is determined with a distance metric that measures the difference in the dihedral angles that determine the placement of the phe side chains (the ,  of x, d, and f of the x-dfg motif and the  of the phe side chain). the nomenclature is based on the regions of the ramachandran map occupied by the xdf residues and the  rotamer of the phe residue. all active structures are “blaminus”, while common inactive dfgin conformations are “blbplus” and “abaminus”. type ii inhibitors bind almost exclusively to the dfgout “bbaminus” conformation. in this paper, we present kincore (http://dunbrack.fccc.edu/kincore), a web resource providing access to the conformational assignments based on our clustering along with labels for ligand types (type i, type ii, etc.) bound to each kinase chain in the pdb. the data are annotated with several properties including pdbid, uniprotid, gene, protein name, phylogenetic group, spatial and dihedral labels for orientation of dfgmotif residues, c-helix disposition, ligand name and type. the user can browse and query the database using these attributes individually or perform advanced search using a combination of them like a phylogenetic group with specific conformational label and ligand type. the user can also determine the spatial and dihedral labels for a structure with unknown conformation using the web server and standalone program. the entire database can be downloaded as text files and structure files in pymol sessions and mmcif format. we believe that kincore will help in understanding conformational dynamics of these proteins and guide development of inhibitors targeting specific states. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://dunbrack.fccc.edu/kincore https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction protein kinases are catalytic molecular switches that regulate signaling pathways in cells by phosphorylating protein substrates [ ]. their catalytic activity is achieved by a remarkably flexible active site which is observed in multiple different conformations when the enzyme is in inactive state but adopts a unique conformation in the catalytically active state. the dysregulation of this mechanism due to a mutation or upregulation of expression can lead to a variety of diseases including cancer [ , ]. protein kinases are widely studied as drug targets with molecules targeted to inhibit the active state or stabilize a specific inactive state [ , ]. thus, the understanding of conformational dynamics in protein kinases is critical for development of better drugs and novel biological insights. there are typical protein kinase genes with kinase domains in the human genome [ , ]. this number includes several pseudokinases but excludes atypical protein kinase genes, some of which are distantly related to the typical protein kinase fold [ ]. among the domains, currently the structures of have been experimentally determined either in apo form or in complex with ligands. the protein kinase fold consists of an n-terminal lobe, which is formed by five beta sheets and one alpha helix called the c-helix, and a c-terminal lobe which consists of five or six alpha helices. the two lobes form a deep cleft in the middle region of the protein creating the atp-binding active site. this site is surrounded by several structural elements critical for catalysis which occupy a unique conformation in the active state and exhibit flexibility across different inactive states of the enzyme. one of the most critical elements is the activation loop which adopts a unique extended orientation in the active state of the kinase and multiple types of folded conformations in inactive states. it begins with a conserved motif called the dfgmotif (asp-phe-gly) whose orientation is tightly coupled with active/inactive status of the protein. in addition, the c-helix displays inwards disposition in the active state while exhibiting a range of positions and orientations in other states. the dfgmotif conformations were previously addressed by using a simple convention of dfgin and dfgout. the dfgin group consists of all the conformations in which dfg-asp points in atp pocket and dfg-phe is adjacent to the c-helix. the structures solved in the active state conformation of the enzyme form a subset of this category. in dfgout conformations, the dfg-asp and dfg-phe residues swap their positions so that dfg-asp is removed from the atp binding site and replaced with dfg-phe. all the type ii inhibitors bind to dfgout conformations [ ]. the dfgin and dfgout groups, however, provide only a broad description of a more complex conformational landscape [ , ]. in our previous work, we developed a scheme for clustering and labeling different conformations of protein kinase structures [ ]. our clustering scheme is based on the spatial location and backbone and side-chain dihedrals of the conserved dfgmotif in the activation loop. we clustered all the conformations into three spatial groups (dfgin, dfginter, dfgout) based on the proximity of the dfg-phe side chain to two different residues in the n-terminal domain. within these groups, we further clustered the structures by the dihedral angles that determine the location of the dfg- phe side chain: the backbone dihedrals of the x, d and f residues (where x is the residue before the dfgmotif) and the χ dihedral angle of the phe side chain. the kinase states are therefore named after the region of the ramachandran map occupied by the x, d, and f residues (a for alpha, b for beta, l for .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / left-handed) and the phe χ rotamer (plus, minus, or trans for the + °, - °, or ° conformations). as a result, among the dfgin structures, we distinguished between the catalytically active kinase conformation (labeled blaminus) and five inactive conformations (blbplus, blbminus, blbtrans, abaminus, blaplus). among dfgout structures, we identified one dominant conformation labeled bbaminus, which is strongly correlated with type ii kinase inhibitors, such as imatinib. finally, among the small set of dfginter structures, where the phe side chain is intermediate between the dfgin and dfgout positions, we distinguished one cluster based on clustering the dihedral angles (babtrans). our nomenclature strongly correlates with other structural features associated with active and inactive kinases, such as the positions of the c-helix and the activation loop and the presence or absence of the n-terminal domain salt bridge. since our clustering and nomenclature is based on backbone dihedrals, it is intuitive to structural biologists and easy to apply in a wide variety of experimental and computational studies, as demonstrated recently in identifying the conformation in crystal structure of irak [ ], molecular dynamics simulations of abl kinase [ ] and structural analyses of pseudokinases [ ]. developing small molecule inhibitors is one of the most common therapeutic strategies against protein kinases. these inhibitors occupy the atp binding pocket and allosteric sites on the surface of the protein. there have been two approaches used to classify inhibitors – a) based on the region of the protein to which the inhibitor binds; b) based on the conformation of the protein to which it binds. the first approach was used by dar and shokat [ ] who defined three types of inhibitors: type i – inhibitors which bind to the adenosine pocket but do not require a specific conformation of structural elements including the c- helix and dfgmotif; type ii – inhibitors that occupy the adenosine pocket and induce dfgout conformations because they extend into the pocket adjacent to the c-helix occupied by dfg-phe in dfgin structures; type iii – inhibitors that block kinase activity but without displacing atp. this classification was extended by zuccotto and coworkers who introduced type i½ inhibitors as molecules which bind to the atp region like type i compounds but extend into the back cavity making additional contacts with the residues involved in type ii binding [ ]. rauh et. al. defined type iv as the allosteric inhibitors which bind to a site distant to the atp binding region inducing an inactive conformation in the active site [ , ]. van linden et al. defined the ligand types by identifying three regions in the active site - a front cleft, the gate area, and the back cleft, which are further divided into subpockets [ ] without the use of labels like type i, ii etc. roskoski used the second approach and redefined all the inhibitors based on the conformation of the protein [ ]. according to this scheme, type i inhibitors bind only to the active conformation; type i½ are the inhibitors which bind to dfgin inactive conformations and type ii inhibitors bind to dfgout conformation. each of these categories were divided into two subtypes a and b. however, this scheme is inadequate because, as we have shown, some inhibitors such as bosutinib and sunitinib can bind to different conformations across proteins [ ]. for example, according to roskoski’s classification sunitinib will be labeled type i in nfz_a (dfgin-blaminus) and type iib in g f_a (dfgout-bbaminus), even though they bind to the kinase domain in an identical manner. in this paper, we present the kinase conformation resource, kincore – a web resource which automatically collects and curates all protein kinase structures from the protein data bank (pdb) and assigns conformational and inhibitor type labels. the website is designed so that the information for all .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / the structures can be accessed at once using one database table and instances of it through individual pages for kinase phylogenetic groups, genes, conformational labels, pdbids, ligands and ligand types. the database can be searched using unique identifiers such as pdbid or gene, and queried using a combination of attributes such as phylogenetic group, conformational label and ligand type. we also provide several options to download data – database tables as a tab separated files; the kinase structures as pymol sessions and coordinate files in mmcif format. the structures have been renumbered by uniprot and our common numbering scheme, which is derived from our structure-based alignment of all human protein kinase domains [ ]. we have also developed a webserver and standalone program which can be used to determine the spatial and dihedral labels for a structure with unknown conformation. we automatically label ligand types based on the pockets to which an inhibitor binds defined by specific residues in the kinase domain. thus, we use five labels for different ligand types: type i – bind to atp binding region only (both active and inactive dfgin states); type i½ – atp binding region and extending into the back pocket (both active and inactive dfgin states); type ii – atp binding region and extending to back pocket regions exposed only in dfgout structures; type iii – back pocket only without displacing atp; and allosteric – outside the active site cleft. results kincore provides conformational assignments and ligand type labels to protein kinase structures from pdb. the current update contains structures from kinase genes from humans ( chains) and from genes ( chains) from seven model organisms. the pk structures were identified from the pdb [ ] using psi-blast [ ] using a kinase pssm matrix as a query (methods). the pdb files are split by chain, renumbered by uniprot numbering [ , ] and our common residue numbering scheme, and annotated by conformational and ligand type labels as described below. the conformational labels are assigned using the structural features and clusters described in our previous work [ ]. the scheme assigns two types of labels to each chain – ) a spatial label (dfgin, dfginter, dfgout) by computing the distance of the dfg-phe-cz atom from the c atoms of two conserved residues – the strand  -lys involved in the n-terminal domain salt bridge formed in active kinase structures (and some inactive structures) and the residue four amino acids past the c-helix-glu involved in the same salt bridge and assigning a label using distance cutoff criteria (methods); ) a dihedral label –the dihedral angles (φ,ψ of x-dfg, asp, phe and χ for phe) for each chain in a spatial group are used to calculate the distance of the structure from the precomputed cluster centroids and assigned a label if its distance satisfies defined cutoff criteria (methods). all the kinase conformations are represented by a set of eight labels: dfgin-blaminus, dfgin-blaplus, dfgin-abaminus, dfgin-blbminus, dfgin-blbplus, dfgin- blbtrans; dfgout-bbaminus; dfginter-babtrans. the chains that do not satisfy the dihedral distance cutoff criteria for any cluster or are missing some of the relevant coordinates are labeled as ‘unassigned’. additionally, we have also labeled the c-helix disposition by computing the distance between the c-helix- glu-c atom from the b -lys-c-atom (as a proxy for the conserved salt bridge interaction) and labeled it as c-helix-in and c-helix-out (methods). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure : representative protein kinase structure ( eta_a) displaying the residues used to define inhibitor binding regions. to assign labels to ligands, we have used specific residue positions to identify regions of the binding pocket – the atp binding pocket (including the hinge residues), back pocket and type ii-only region (figure ). the structures are first renumbered by our common numbering scheme so that all the aligned residues have the same residue number across all the kinases. a ligand is then assigned a label based on its contacts with different binding regions. we have used the following five ligand type labels to annotate all the ligand-bound structures of protein kinases (figure ): . type i – bind to atp binding region only . type i½ – bind to atp binding region and extend into the back pocket (subdivided as type i½-front and type i½-back depending on contact with n-terminal or c-terminal residues of the c-helix, respectively) . type ii – bind to the atp binding region and extend into the back pocket and type ii-only region . type iii – bind only in the back pocket without displacing atp . allosteric - any pocket outside the atp-binding region the distribution of different ligand types across kinase conformations is provided in table . it shows that type i and type i½ are the most commonly observed inhibitors. however, except type ii, all the inhibitor types are observed in complex with multiple conformational states. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / table : distribution of ligand types across protein kinase conformations (number of chains). spatial label dihedral label type i type i½ (front+back) type ii type iii allosteric total (%) dfgin blaminus (active) - ( . ) blbplus - ( . ) abaminus - ( . ) blbminus - ( . ) blbtrans - - ( . ) blaplus - - ( . ) noise - ( . ) dfgout bbaminus ( . ) noise ( . ) dfginter babtrans - - - ( . ) noise - ( . ) total (%) ( . ) ( . ) ( . ) ( . ) ( . ) many inhibitors are observed in multiple crystal structures bound to one or more different kinases. we counted the number of unique inhibitors that occur bound to kinase chains in two (or more) states across entries in the pdb. in table , we show a table that provides the number of unique inhibitors that occur in each pair of states (excluding the unclassified spatial or dihedral labels). the numbers along the diagonal are the counts of unique inhibitors observed in at least one structure of the given state. a total of inhibitors occur in two or more kinase states. table . counts of inhibitors that are bound to chains in two or more states. dfgin- blaminus dfgin- abaminus dfgin- blbplus dfgin- blbminus dfgin- blbtrans dfgin- blaplus dfgout- bbaminus dfginter- babtrans dfgin-blaminus dfgin-abaminus dfgin-blbplus dfgin-blbminus dfgin-blbtrans dfgin-blaplus dfgout-bbaminus dfginter-babtrans numbers along the diagonal provide the number of unique inhibitors in each state. the off-diagonal values are the number of unique inhibitors bound to chains in the two states shown in the row and column headers. website the web pages on kincore are designed in a common format across the website to organize the information in a consistent and uniform way. each page retrieved from the database is organized in two parts – the top part provides a summary of the number of structures in the queried groups or conformations, with representative structures from each category listed and displayed. this is followed by a table from the database with each unique pdb chain as a row providing different kinds of information including conformational and ligand type labels and c-helix position, kinase family, gene name, uniprot id, ligand pdb id, and ligand type. the kinase group, gene name, pdb code, conformational labels, ligand name and ligand type are hyperlinked to their specific pages. each page also contains three tabs on the top to list ‘human’, ‘non-human’ and ‘all’ structures. there are buttons provided on each page to .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / download the database table as a tab separated file, and to download all of the kinase structures on the page as pymol sessions, and renumbered coordinate files. figure : snapshot of database table displaying entries for pdb chains on browse page. the information from the database can be accessed using two main pages: . browse page: this page provides statistics and labels for all the kinase structures in the database (figure ). the ‘summary’ table on top of the page displays the distribution of protein kinase chains in the pdb across conformational states and phylogenetic groups. this is followed by ‘database’ table which contains annotation for all individual pdb chains retrieved from the database. the entire table with additional information like resolution, rfactor, activation loop residue etc. can be downloaded as a tab separated file. . search page: this page offers two options to query the database: • unique identifier: the database can be queried by pdb entry code (e.g., gs ), uniprot identifier (e.g., egfr_human), gene name (e.g., egfr), and ligand identifier (e.g., sti). the result will take the user to the page dedicated to the specific query item. for reference the list of all genes in the database is provided for the user through a ‘help’ button above the search box. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / • advanced query: the database can be queried by selecting kinase phylogenetic group, conformational label, and ligand type using a drop-down menu. if ‘all’ option is selected for all the three categories, then the entire database table can be accessed at once. a subset of chains in the database can be retrieved by selecting a specific group name, conformational label, and ligand type, for example selecting tyr group + dfgout- bbaminus + type ii ligand type will retrieve all the structures which have these three annotations. if all the structures in complex with type i½ ligand are desired, then the user can select ‘all group’ + ‘all conformations’ + ‘type i½ ligand’. the website contains several webpages which are dynamically generated and retrieve queried instances of the database. these pages can be accessed as a result of individual queries or by clicking on the hyperlinks on the browse page table. they are, . phylogenetic group page: typical protein kinases are divided into nine phylogenetic groups – agc, camk, cmgc, ck , nek, rgc, ste, tkl and tyr [ref]. each group is assigned a page on kincore displaying information about the structures in that group. on each page, the summary table provides the number of kinase chains in the group across different conformations with their representative structures (best resolution and least missing residues). these representative structures are also displayed on the page in d using ngl viewer. . gene page: a page for each kinase gene in the pdb can be accessed through the hyperlinks on browse page or by unique identifier search feature and contains information for all the structures of a specific gene. the summary table on the page gives the number of structures available and their distribution across different conformations with representative example for each. it also provides hyperlinks to the phylogenetic group page (described above) for the gene and the corresponding protein entry on the uniprot website. in addition to the data provided on the browse page, the database table on this page also contains for each chain information on mutations, phosphorylation with total length of the structure and number of residues resolved in the activation loop. . pdb page: the pdb page provides information on individual pdb entries and can be accessed by the hyperlinks on the browse page or by the unique identifier search feature (figure ). each pdb entry is annotated with information on gene, protein name, phylogenetic group, uniprot id, organism, domain boundary, resolution, conformation, and ligand type labels for every chain. additionally, the page also contains a sequence feature displaying the uniprot sequence of the protein in the structure. the residues which are unresolved in the structure are displayed in lower case letters to distinguish them from residues with coordinates in the entry. further, mutated and phosphorylated residues are shown in red and green color, respectively. . ligand page: the ligand page provides access to all chains in complex with a specific ligand. for example, all the structures in complex with atp can be retrieved by querying for ‘atp’ on the search page or clicking on the hyperlinks on the browse page. the summary table provides the .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / number of chains in complex with the ligand across different conformations. like other pages, the database table provides the list of all the pdb chains with conformational labels and ligand annotations. this page facilitates the comparison of conformations and ligand binding mode across structures from one or multiple kinases in complex with the same ligand. for example, bosutinib (pdb identifier db ) which is an fda-approved drug, is found in complex with structures from kinases in different conformations (figure ). figure : snapshot of pdb page with the sequence feature. alignment page in our previous work, we developed a structure-based multiple sequence alignment (msa) for human protein kinase domains [ ]. this alignment contains blocks of aligned regions conserved across human kinases with intermittent regions of low sequence similarity in lower case letters. the alignment is annotated with gene name, uniprot id, and protein residue numbers. on kincore, we provide access to this msa through the alignment page which contains basic information about the alignment with a table of conserved regions across human kinases. the alignment can be visualized inside the browser window through ‘open in browser’ button created using jalview’s biojs feature. this feature provides multiple options for quick analysis including buttons to filter, color, or sort the sequences within the browser window. the alignment is also available to download as a jalview session as well as clustal- and fasta- formatted files. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / phylogeny page using our multiple sequence alignment, we also updated the protein kinase phylogenetic tree [ ]. this tree was used to assign a set of ten kinases previously categorized as “other” to the camk group, consisting of aurora kinases, polo-like kinases, and calcium/calmodulin-dependent kinase kinases. on our resource the tree can be accessed through the phylogeny page. it provides basic information about the tree, the number of kinase genes and domains in different phylogenetic groups, and links to visualize and download the tree. figure : snapshot of ligand page displaying bosutinib (pdb ligand identifier db ) in complex with structures from kinase genes and in different conformations. download options we provide multiple data download options on kincore to assist the user in different kinds of analysis. these download options are created for all the pages or any instance of database retrieved by a query, e.g. structures of a specific gene, ligand etc. or structures from an advanced query like tyr kinases with dfgout state and type ii ligands. these options are: . coordinate files we provide structure files in mmcif and pdb format with three different numbering systems: the original author residue numbering; renumbered by uniprot protein sequence; and a .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / common residue numbering scheme derived from our multiple sequence alignment of kinases [ ]. . pymol sessions we provide pymol [ ] sessions for the structures retrieved from any query from the database. two pymol sessions are provided for each query – all chains and representative chains (best resolution, least missing residues). across all the pymol sessions, the chains are labeled in a consistent format as – phylogroup_gene_spatiallabel_dihedrallabel_pdbidchainid (e.g., tyr_egfr_dfgin_blaminus_ gs a). additionally, we also provide pymol scripts (.pml format) which the user can download and run on a local machine to create the sessions. . database files we provide the information retrieved from the database on every page as tab separated files which can be downloaded using ‘database table as tsv’ button. when clicked on the ‘browse’ page, this button will download the information in the entire database in one file. on the other pages specific for a gene or conformation, this file will contain only the subset of the information from the database which is queried. the tsv file has the following header, “organism group gene uniprotid pdb method resolution rfac freerfac spatiallabel dihedrallabel c-helix ligand ligandtype dfg_phe edia_x_o edia_asp_o edia_phe_o edia_gly_o proteinname” . bulk download the ‘download’ page provides different options to download structure files and pymol sessions in bulk. the page is divided into two sections – coordinate files and pymol sessions. the user can download coordinate files for all the structures in one zip folder or in subsets of specific phylogenetic group, gene, and conformational label. the tab on the top of the page gives the option to download files with original author residue numbering or renumbered by uniprot protein sequence and common residue numbering from our alignment. the second part of the ‘download’ page provides pymol sessions for phylogenetic groups, genes and ligands. we have developed a webserver which the user can use to upload a kinase structure file in pdb or mmcif format to determine its conformation. the program extracts the sequence from structures file and identifies residue positions by aligning it with precomputed hmm profiles of kinase groups. it then determines the conformation of the protein by assigning spatial and dihedral labels (methods). on the output page, the server prints the kinase phylogenetic group which is the closest match to the sequence of the input structure, dihedrals of x-dfg, dfg-asp, dfg-phe residues, spatial group, dihedral label and c-helix disposition. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / we have written a standalone program using python which the user can download to assign conformational labels to an unannotated structure. the program can be run in two ways: a) with flag align=true: alignment with precomputed hmm profiles is done to identify the residue numbers for b - lys, c-helix-glu and dfg-phe. the program then computes inter-residue distances and dihedral angles to label the conformation in the structure (methods); b) with flag align=false: alignment with an hmm profile is not done, and the residue numbers are provided by the user. this option is faster and more useful for identifying conformations in a large number of structures generated from a molecular dynamics simulations. discussion experimentally determined protein kinase structures in apo-form or in complex with a ligand display an extremely flexible active site. however, examining the conformational dynamics of kinases and its role in ligand binding require combining two pieces of information – the conformational state of the protein and the type of ligand in complex. currently, there are two main resources, kinametrix and klifs, that address protein kinase conformations and inhibitors. however, they provide either conformational assignments or ligand type information, but not both. kinametrix (http://kinametrix.com/) offers a simple scheme of dfgin and dfgout coupled with c-helix conformation [ ]. the resource does not provide information on ligands and lacks any download options for structures. this resource has not been updated with structures since may . klifs (https://klifs.vu-compmedchem.nl/index.php) – also offers a simple dfgin and dfgout classification [ , ] and does not distinguish active and inactive dfgin structures. this resource is more focused on providing information about ligand binding to kinases. it is regularly updated and allows bulk downloads for the results of each search. kincore fills a gap by providing a sophisticated scheme for kinase conformations, with ligand type labels. the information can be accessed as individual queries for example, getting a list of all chains in complex with type ii ligand; or a combination of queries like, agc group kinases + dfgin-blbplus conformation + type i½ ligand. a feature that distinguishes kincore from many structural bioinformatics resources is the ability to download coordinate files for the result of any query in one click. for example, a search for aurka produces a list of protein chains from pdb entries. these can be downloaded in mmcif format with one click with residue numbering in original pdb numbering, renumbered according to the uniprot sequences, or in our common residue numbering scheme from the kinase multiple sequence alignment. each coordinate file is labeled by spatial label and dihedral angle cluster, e.g. camk_aurka_dfgin_blaminus_ ol a.cif. a user can also download a pymol session file with all of the structures for a given query. in addition, an important part of our resource is the web server and standalone program which can label the unknown conformation of a new structure. the standalone program can run on structure files with multiple chains and models. we believe it will be extremely useful to batch process the structures generated from a molecular modeling protocol or molecular dynamics simulation. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://kinametrix.com/ https://klifs.vu-compmedchem.nl/index.php https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / several experimental and computational studies have reported applying the nomenclature from our previous work in structural analyses of kinases [ ]. lange and colleagues have solved the crystal structure of the pseudokinase irak (pdbid ruu) and identified its conformation as blaminus, similar to the active state of a typical protein kinase [ ]. paul et.al. have studied the dynamics of abl kinase by various simulation techniques with markov state models and analyzed the transition between different metastable states by using our nomenclature [ ]. kirubakaran et. al. have identified the catalytically primed structures (blaminus) from the pdb to create a comparative modeling pipeline for the ligand bound structures of cdk kinases [ ]. paul and srinivasan have done structural analyses of pseudokinases in arabidopsis thaliana and compared with typical protein kinases by applying our conformational labels [ ]. therefore, we believe that the development of kincore database and webserver will greatly benefit a larger research community by making the labeled kinase structures more accessible and facilitating identification of kinase conformations in a wide range of studies. methods identifying and renumbering protein kinase structures the database contains protein kinase domains from homo sapiens and seven model organisms consisting bos taurus, danio rerio, drosophila melanogaster, mus musculus, rattus norvegicus, sus scrofa and xenopus laevis. to identify structures from these organisms the sequence of human aurora a kinase (residues - ) was used to construct a pssm matrix from three iterations of ncbi psi-blast on the pdb with default cutoff values [ ]. this pssm matrix was used as query to run command line psi-blast on the pdbaa file from the in the pisces server (http://dunbrack.fccc.edu/pisces) [ ]. pdbaa contains the sequence of every chain in every asymmetric unit of the pdb in fasta format with resolution, r-factors, and swissprot identifiers (e.g. aurka_human). a total of pdb entries with kinase chains were identified. some poorly aligned kinases and non-kinase proteins that were homologous to kinases but distantly related were removed. the structure files were split by individual kinase chains in the asymmetric unit and renumbered by uniprot protein numbering scheme. the mapping between pdb author numbering and uniprot was obtained from structure integration with function, taxonomy and sequence (sifts) database [ ]. the sifts files were also used to extract mutation, phosphorylation, and missing residue annotations. the structure files were also renumbered by a common residue numbering scheme using our protein kinase multiple sequence alignment. each residue in a kinase domain was renumbered by its column number in the alignment. therefore, aligned residues across different kinase sequences get the same residue number. for example, in these renumbered structure files the residue number of the dfgmotif across all kinases is – . the conserved motifs for all the structures were identified from the same alignment. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://dunbrack.fccc.edu/pisces https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / assigning conformational labels each kinase chain is assigned a spatial group and a dihedral label using our previous clustering scheme as a reference [ ]. our clustering scheme has three spatial groups – dfgin, dfginter, and dfgout. these are sub-divided into dihedral clusters dfgin -- blaminus, blaplus, abaminus, blbminus, blbplus, blbtrans; dfginter – babtrans; and dfgout – bbaminus. to determine the spatial group for each chain, the location of dfg-phe in the active site was identified using the following criteria: . d ≤ Åand d ≥ Å– dfgin . d > Å and d <= Å– dfgout . d ≤ Å and d ≤ Å – dfginter, where d = αc-glu(+ )-cα to dfg-phe-cζ and d = β -lys-cα to dfg-phe-cζ any structure not satisfying the above criteria is considered an outlier and assigned the spatial label “none.” to identify the dihedral label the dfg-phe rotamer type in each chain was first identified (minus, plus, trans). the chains for each rotamer type were then represented with a set of backbone (Φ, Ψ) dihedrals from x-dfg, dfg-asp, dfg-phe residues. using these dihedrals, the distance of each kinase chain was calculated from precomputed cluster centroid points for each cluster with the same rotamer type in the given spatial group. for example, the dihedral distance for all dfgin with phe-minus structures was computed against blaminus, abaminus and blbminus. the dihedral angle distance is computed using the following formula, 𝐷(𝑖, 𝑗) = (𝐷(∅𝑖 𝑋 , ∅𝑗 𝑋 ) + 𝐷(𝜓𝑖 𝑋 , 𝜓𝑗 𝑋 ) + 𝐷(∅𝑖 𝐷 , ∅𝑗 𝐷 ) + 𝐷(𝜓𝑖 𝐷, 𝜓𝑗 𝐷 ) + 𝐷(∅𝑖 𝐹 , ∅𝑗 𝐹 ) + 𝐷(𝜓𝑖 𝐹 , 𝜓𝑗 𝐹 )) where, 𝐷(𝜃 , 𝜃 ) = ( − cos(𝜃 − 𝜃 )) a chain is assigned to a dihedral label if the distance from that cluster centroid is less than < . . the chains which have any motif residue missing or are distant from all the cluster centroids are assigned the dihedral label “none.” the c-helix disposition is determined using the distance between cβ atoms of b -lys and c-helix-glu(+ ). a distance of < Å indicates that the salt bridge between the two residues is present suggesting a c-helix- in conformation. a value of > Å suggests a c-helix-out conformation. ligand classification the different regions of the atp binding pocket are identified by specific residues using our common numbering scheme (supplementary figure ): • atp binding region – hinge residues – residues - .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / • back pocket - c-helix and partial regions of b and b strands, dfgmotif backbone – residues - , - , , - , - and - • type ii-only pocket – exposed only in dfgout conformation – residues , , and a contact between ligand atoms and protein residues is defined if the distance between any two atoms is ≤ . Å (hydrogens not included). based on these contacts we have labeled the ligand types as follows: . allosteric: any small molecule in the asymmetric unit whose minimum distance from the hinge region and c-helix-glu(+ ) residue is greater than . Å. . type i½: subdivided as – type i½-front – at least three or more contacts in the back pocket and at least one contact with the n-terminal region of the c-helix. type i½_back - at least three or more contacts in the back pocket but no contact with n-terminal region of c-helix. . type ii – at least three or more contacts in the back pocket and at least one contact in the type - only pocket. . type iii – minimum distance from the hinge greater than Å and at least three contacts in the back pocket. . type i – all the ligands which do not satisfy the above criteria. identify conformation using webserver the program uses the structure file uploaded by the user to extract the sequence of the protein. it aligns the sequence with precomputed hmm profiles of kinase phylogenetic groups (e.g. agc.hmm, camk.hmm). the alignment with the best score is identified and used to determine the positions of the dfgmotif, b -lys, and c-helix-glu(+ ) residues. the program then computes the distance between specific atoms and dihedrals to identify spatial and dihedral labels using the assignment method described above. standalone program the standalone program is written in python . . the program is available to download from https://github.com/vivekmodi/kincore-standalone and can be run in a macos or linux machine terminal window. the user can provide individual .pdb or .cif (also compressed .gz) file or a list of files as an input. it identifies the unknown conformation from a structure file in the same way as described for the webserver. software and libraries used all the scripting and analysis is done using python and depends on pandas (https://pandas.pydata.org), and biopython [ ] libraries. website and database kincore is developed using flask web framework (https://flask.palletsprojects.com/en/ . .x/). the webpages are written in html and style elements created using bootstrap v . . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/vivekmodi/kincore-standalone https://flask.palletsprojects.com/en/ . .x/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / (https://getbootstrap.com/). the d visualization is done by using ngl viewer (http://nglviewer.org/ngl/api/). pymol (v . ) is used for creating download sessions [ ]. the entire application is deployed on the internet using apache webserver. acknowledgements the authors want to thank maxim shapovalov for his help in deploying the server. this work was funded by nih grant r gm to r.l.d. references . adams, j.a., kinetic and catalytic mechanisms of protein kinases. chem rev, . ( ): p. - . . blume-jensen, p. and t. hunter, oncogenic kinase signalling. nature, . ( ): p. - . . lahiry, p., et al., kinase mutations in human disease: interpreting genotype-phenotype relationships. nat rev genet, . ( ): p. - . . zhang, j., p.l. yang, and n.s. gray, targeting cancer with small molecule kinase inhibitors. nat rev cancer, . ( ): p. - . . ferguson, f.m. and n.s. gray, kinase inhibitors: the road ahead. nature reviews drug discovery, . ( ): p. - . . manning, g., et al., the protein kinase complement of the human genome. science, . ( ): p. - . . modi, v. and r.l. dunbrack, jr., a structurally-validated multiple sequence alignment of human protein kinase domains. sci rep, . ( ): p. . . vijayan, r., et al., conformational analysis of the dfg-out kinase motif and biochemical profiling of structurally validated type ii inhibitors. journal of medicinal chemistry, . ( ): p. - . . möbitz, h., the abc of protein kinase conformations. biochimica et biophysica acta (bba)- proteins and proteomics, . ( ): p. - . . ung, p.m.-u., r. rahman, and a. schlessinger, redefining the protein kinase conformational space with machine learning. cell chemical biology, . ( ): p. - . e . . modi, v. and r.l. dunbrack, defining a new nomenclature for the structures of active and inactive kinases. proceedings of the national academy of sciences, . ( ): p. - . . lange, s.m., et al., dimeric structure of the pseudokinase irak suggests an allosteric mechanism for negative regulation. structure, . . paul, f., y. meng, and b. roux, identification of druggable kinase target conformations using markov model metastable states analysis of apo-abl. j chem theory comput, . ( ): p. - . . paul, a. and n. srinivasan, genome-wide and structural analyses of pseudokinases encoded in the genome of arabidopsis thaliana provide functional insights. proteins, . ( ): p. - . . dar, a.c. and k.m. shokat, the evolution of protein kinase inhibitors from antagonists to agonists of cellular signaling. annu rev biochem, . : p. - . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://getbootstrap.com/ http://nglviewer.org/ngl/api/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . zuccotto, f., et al., through the "gatekeeper door": exploiting the active kinase conformation. j med chem, . ( ): p. - . . gavrin, l.k. and e. saiah, approaches to discover non-atp site kinase inhibitors. medchemcomm, . ( ): p. - . . fang, z., c. grutter, and d. rauh, strategies for the selective regulation of kinases with allosteric modulators: exploiting exclusive structural features. acs chem biol, . ( ): p. - . . van linden, o.p., et al., klifs: a knowledge-based structural database to navigate kinase-ligand interaction space. j med chem, . . roskoski, r., jr., classification of small molecule protein kinase inhibitors based upon the structures of their drug-enzyme complexes. pharmacol res, . : p. - . . consortium, w., protein data bank: the single global archive for d macromolecular structure data. nucleic acids research, . (d ): p. d -d . . altschul, s.f., et al., gapped blast and psi-blast: a new generation of database programs. nucleic acids research, . : p. - . . uniprot consortium, uniprot: a hub for protein information. nucleic acids res, . (database issue): p. d - . . velankar, s., et al., sifts: structure integration with function, taxonomy and sequences resource. nucleic acids research, . (d ): p. d -d . . delano, w.l., the pymol molecular graphics system. , schrödinger, inc.: san carlos, ca. . rahman, r., p.m.-u. ung, and a. schlessinger, kinametrix: a web resource to investigate kinase conformations and inhibitor space. nucleic acids research, . (d ): p. d -d . . kanev, g.k., et al., klifs: an overhaul after the first years of supporting kinase research. nucleic acids research, . (d ): p. d -d . . kirubakaran, p., et al., comparative modeling of cdk inhibitors to explore selectivity and structure-activity relationships. biorxiv, : p. . . . . . wang, g. and r.l. dunbrack, jr., pisces: recent improvements to a pdb sequence culling server. nucleic acids res, . (web server issue): p. w - . . cock, p.j., et al., biopython: freely available python tools for computational molecular biology and bioinformatics. bioinformatics, . ( ): p. - . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / scaespy: a tool for autoencoder-based analysis of single-cell rna sequencing data scaespy: a tool for autoencoder-based analysis of single-cell rna sequencing data andrea tangherloni[∗], e-mail : andrea.tangherloni@unibg.it wellcome trust-medical research council cambridge stem cell institute cb aw, cambridge, uk department of haematology, university of cambridge cb aw, cambridge uk wellcome trust sanger institute, wellcome trust genome campus cb sa, hinxton, uk current address: department of human and social sciences, university of bergamo , bergamo, italy federico ricciuti, e-mail : f.ricciuti@campus.unimib.it department of informatics, systems and communication, university of milano-bicocca , milan, italy daniela besozzi, e-mail : daniela.besozzi@unimib.it department of informatics, systems and communication, university of milano-bicocca , milan, italy bicocca bioinformatics, biostatistics and bioimaging centre (b ), , milan, italy pietro liò[†],[∗], e-mail : pl @cam.ac.uk department of computer science and technology, university of cambridge cb fd, cambridge, uk ana cvejic[†],[∗], e-mail : as @cam.ac.uk wellcome trust-medical research council cambridge stem cell institute cb aw, cambridge, uk department of haematology, university of cambridge cb aw, cambridge, uk wellcome trust sanger institute, wellcome trust genome campus cb sa, hinxton, uk [∗]corresponding author. [†]these authors contributed equally. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. research scaespy: a tool for autoencoder-based analysis of single-cell rna sequencing data andrea tangherloni , , , *, federico ricciuti , daniela besozzi , , pietro liò † and ana cvejic , , † *correspondence: andrea.tangherloni@unibg.it wellcome trust-medical research council cambridge stem cell institute, cambridge, uk full list of author information is available at the end of the article †equal contributor abstract background: single-cell rna sequencing (scrna-seq) experiments are gaining ground to study the molecular processes that drive normal development as well as the onset of different pathologies. finding an effective and efficient low-dimensional representation of the data is one of the most important steps in the downstream analysis of scrna-seq data, as it could provide a better identification of known or putatively novel cell-types. another step that still poses a challenge is the integration of different scrna-seq datasets. though standard computational pipelines to gain knowledge from scrna-seq data exist, a further improvement could be achieved by means of machine learning approaches. results: autoencoders (aes) have been effectively used to capture the non-linearities among gene interactions of scrna-seq data, so that the deployment of ae-based tools might represent the way forward in this context. we introduce here scaespy, a unifying tool that embodies: ( ) four of the most advanced aes, ( ) two novel aes that we developed on purpose, ( ) different loss functions. we show that scaespy can be coupled with various batch-effect removal tools to integrate data by different scrna-seq platforms, in order to better identify the cell-types. we benchmarked scaespy against the most used batch-effect removal tools, showing that our ae-based strategies outperform the existing solutions. conclusions: scaespy is a user-friendly tool that enables using the most recent and promising aes to analyse scrna-seq data by only setting up two user-defined parameters. thanks to its modularity, scaespy can be easily extended to accommodate new aes to further improve the downstream analysis of scrna-seq data. considering the relevant results we achieved, scaespy can be considered as a starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics. keywords: autoencoders; scrna-seq; dimensionality reduction; clustering; batch correction; data integration background single-cell rna sequencing (scrna-seq) was named the “method of the year” in , and it is currently used to investigate cell-to-cell heterogeneity since it allows to measure the transcriptome-wide gene expression at single-cell resolution, enabling the identification of different cell-types. scrna-seq data are prevalent generated in studies that aim at understanding the molecular processes driving normal develop- ment and the onset of pathologies [ , ]. this field of research continuously poses new computational questions that have to be addressed [ ]. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint mailto:andrea.tangherloni@unibg.it https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of one of the most important steps in scrna-seq analysis is the clustering of cells into groups that correspond to known or putatively novel cell-types, by considering the expression of common sets of signature genes. however, this step still remains a challenging task because applying clustering approaches in high-dimensional spaces can generate misleading results, as the distance between most pairs of points is similar [ ]. as a consequence, finding an effective and efficient low-dimensional rep- resentation of the data is one of the most crucial steps in the downstream analysis of scrna-seq data. a common workflow of downstream analysis, depicted in figure , includes two dimensionality reduction steps: ( ) principal component analysis (pca) [ ] for an initial reduction of the dimensions based on the highly vari- able genes (hvgs), and ( ) a non-linear dimensionality reduction approach—e.g., t-distributed stochastic neighbour embedding (t-sne) [ ] or uniform manifold approximation and projection (umap) [ , ]—on the pca space for visualisa- tion purposes (e.g., showing the labelled clusters) [ , ]. in addition, when mul- tiple scrna-seq datasets have to be combined for further analyses, the technical non-negligible batch-effects that may exist among the datasets must be taken into account [ , , – ], making the dimensionality reduction even more complicated and fundamental. indeed, finding a salient batch corrected and low dimensional embedding space can help to better partition and distinguish the various cell-types. although commonly used approaches for dimensionality reduction achieved good performance when applied to scrna-seq data [ ], novel and more robust dimension- ality reduction strategies should be used to account for the sparsity, intrinsic noise, unexpected dropout, and burst effects [ , ], as well as the low amounts of rna that are typically present in single-cells. ding et al. showed that low-dimensional representations of the original data learned using latent variable models preserve both the local and global neighbour structures of the original data [ ]. autoen- coders (aes) showed outstanding performance in this regard due to their ability to capture the strong non-linearities among the gene interactions existing in the high-dimensional expression space. autoencoders for denoising and dimensionality reduction deep count ae network (dca) was one of the first ae-based approach proposed to denoise scrna-seq datasets [ ] by considering the count distribution, overdis- persion, and sparsity of the data. dca relies on a negative binomial noise model, with or without zero-inflation, to capture nonlinear gene-gene dependencies. start- ing from the vanilla version of the variational ae (vae) [ ], several approaches have been proposed. among them, single-cell variational inference (scvi) was the first scalable framework that allowed for a probabilistic representation and anal- ysis of gene expression datasets [ ]. scvi was built upon deep neural networks (dnns) and stochastic optimization to consider the information across similar cells and genes to approximate the distributions underlying the analysed gene expression data. this computational tool allows for coupling low-dimensional probabilistic rep- resentation of gene expression data with the downstream analysis to consider the measurement of uncertainty through a statistical model. svensson et al. integrated a linearly decoded vae (ldvae) into scvi [ ], enabling the identification of re- lationships among the cell representation coordinates and gene weights via a factor .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of mode. single-cell vae (scvae) was introduced to directly model the raw counts from rna-seq data [ , ]. more importantly, the authors proposed a gaussian- mixture model to better learn biologically plausible groupings of scrna-seq data on the latent space. decomposition using hierarchical ae (scdha) is a hierarchical ae composed of two modules [ ]. the first module is a non-negative kernel ae able to provide a non- negative, part-based denoised representation of the original data. during this step, the genes and the components having an insignificant contribution to the denoised representation of the data are removed. the second module is a stacked bayesian self-learning network built upon the vae. this specific module is used to project the denoised data into a low-dimensional space used during the downstream analysis. scdha outperformed pca, t-sne, and umap in terms of silhouette index [ ] on the tested datasets. aes coupled with disentanglement methods have been used to both improve the data representation and obtain better separation of the biological factors of vari- ation in gene expression data [ ]. in addition, a graph ae, consisting of graph convolutional layers, was developed to predict relationships between single-cells. this framework can be used to identify the cell-types in the dataset under analysis and discover the driver genes for the differentiation process. wang et al. proposed a deep vae for scrna-seq data named vasc [ ], a deep multi-layer genera- tive model that improves the dimensionality reduction and visualisation steps in an unsupervised manner. thanks to its ability to model dropout events—which can hinder various downstream analysis steps (e.g., clustering analysis, differential expression analysis, inference of gene-to-gene relationships) by introducing a high number of zero counts in the expression matrices—, and to find nonlinear hierarchi- cal representations of the data, vasc obtained superior performance with respect to four state-of-the-art dimensionality reduction and visualisation approaches [ ]. dimensionality reduction with adversarial vae (dr-a) has been recently pro- posed to fulfil the dimensionality reduction step from a data-driven point of view [ ]. compared to the previous approaches, dr-a exploits an adversarial vae- based framework, which is a recent variant of generative adversarial networks. dr- a generally obtained more accurate low-dimensional representation of scrna-seq data compared to state-of-the-art approaches (e.g., pca, scvi, t-sne, umap), leading to better clustering performance. geddes et al. proposed an ae-based clus- ter ensemble framework to improve the clustering step [ ]. as a first step, random subspace projections of the data are compressed onto a low-dimensional space by exploiting an ae, obtaining different encoded spaces. then, an ensemble cluster- ing approach is applied across all the encoded spaces to generate a more accurate clustering of the cells. autoencoders for the imputation of missing data autoimpute was proposed to deal with the insufficient quantities of starting rna in the individual cells, a problem that generally leads to significant dropout events. as a consequence, the resulting gene expression matrices are sparse and contain a high number of zero counts. autoimpute is an ae-based imputation method that works on sparse gene expression matrices, trying to learn the inherent distribution of the .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of input data to assign the missing values [ ]. scsva was also proposed to identify and recover dropout events [ ], which are imputed by fitting a mixed model of each possible cell-type. in addition, it performs an efficient feature extraction step of the high-dimensional scrna-seq data, obtaining a low-dimensional embedding. in the tests showed by the authors, scsva was able to outperform different state-of-the-art and novel approaches (e.g., pca, t-sne, umap, vasc). other two methods based on nonparametric aes were proposed to address the imputation problem [ ]. learning with autoencoder (late) relies on an ae that is directly trained on a gene expression matrix with parameters randomly generated, while transfer learning with late (translate) takes into consideration a reference gene expression dataset to estimate the parameters that are then used by late on the new gene expression matrix. late and translate were able to obtain outstanding performance on both real and simulated data by recovering nonlinear relationships in pairs of genes, allowing for a better identification and separation of the cell-types. graphsci combines graph convolution network and ae to systematically in- tegrate gene-to-gene relationships with the gene expression data. it is the first approach that integrates gene-to-gene relationships into a deep learning frame- work. graphsci is able to impute the dropout events by taking advantage of low- dimensional representations of similar cells and gene-gene interactions [ ]. generally, in the existing aes the input data are usually codified in a specific for- mat, making their integration into the existing scrna-seq analysis toolkits (e.g., scanpy [ ] and seurat [ ]) a difficult task. in addition, the existing tools are im- plemented in keras[ ], tensorflow [ ] or pytorch [ ], and all the three libraries are thus required to run them. finally, the currently available aes cannot be di- rectly exploited to obtain the latent space or to generate synthetic cells. in order to overcome the described limitations, we developed scaespy, which is a unifying, user-friendly, and standalone tool that relies only on tensorflow and allows easy access to different aes by setting up only two user-defined parameters. scaespy can be used on high-performance computing (hpc) infrastructures to speed-up its execution. it can be easily run on clusters of both central processing units (cpus) and graphics processing units (gpus). indeed, it was designed and developed to be executed on multi- and many-core infrastructures. in addition, scaespy gives access to the latent space, generated by the trained ae, which can be directly used to show the cells in this embedded space or as a starting point for other dimension- ality reduction approaches (e.g., t-sne and umap) as well as downstream analyses (e.g., batch-effect removal). in this work, we show how scaespy can be used to deal with the existing batch- effects among samples. indeed, the application of batch-effect removal tools into the latent space allowed us to outperform state-of-the-art methods as well as the same batch-effect removal tools applied on the pca space. finally, scaespy implements different loss functions, which are fundamental to deal with different sequencing platforms. [ ]https://github.com/fchollet/keras .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://github.com/fchollet/keras https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of results we tested pca and aes to address the integration of different datasets. specifically, we used all the aes implemented in our scaespy tool: vae [ ], an ae only based on the maximum mean discrepancy (mmd) distance (called here mmdae) [ ], mmdvae, gaussian-mixture vae (gmvae), and two novel gaussian-mixture aes that we developed, called gmmmd and gmmmdvae, respectively. in all the performed tests, the constrained versions of the following loss functions were used: negative binomial (nb), poisson, zero-inflated nb (zinb), zero-inflated poisson (zip). we used a number of gaussian distributions equal to the number of datasets to integrate for gmvae, gmmmd, and gmmmdvae. in addition, we tested the following configurations of hidden layer and latent space to understand how the dimension of the aes might potentially affect the performance: ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), and ( , ); where (h,l) repre- sents the number of neurons composing the hidden layer (h neurons) and latent space (l neurons). in order to deal with the possible batch-effects, we applied the following ap- proaches, as suggested in [ , ] and being the most used batch-effect removal tools in the literature: batch balanced k-nearest neighbours (bbknn) [ , ], har- mony [ ], combat [ – ], and the seurat implementation of the canonical cor- relation analysis (cca) [ ]. thus, we compared vanilla pca and aes, pca and aes followed by either bbknn or harmony, combat, and cca. the proposed strategies were compared on three publicly available datasets, namely: peripheral blood mononuclear cells (pbmcs), pancreatic islet cells (pics), and mouse cell atlas (mca) by using well-known clustering metrics (i.e., adjusted rand index, adjusted mutual information index, fowlkes mallows in- dex, homogeneity score, and v-measure). it is worth mentioning that generally the cell-types are manually identified by expert biologists starting from an over or under clustering of the data, eventually followed by different steps of sub clustering of some clusters. here, we evaluate how the different strategies are able to auto- matically separate the cells by fixing the number of clusters equal to the number of cell-types manually identified by the authors of the papers. datasets peripheral blood mononuclear cells pbmcs from eight patients with systemic lupus erythematosus were collected and processed using the × chromium genomics platform [ ]. the dataset is com- posed of a control group ( cells) and an interferon-β stimulated group ( cells). we considered the distinct cell-types identified by the authors following a standard workflow [ ]. the count matrices were downloaded from seurat’s tu- torial “integrating stimulated vs. control pbmc datasets to learn cell-type specific responses" [ ]. pancreatic islet cells pic datasets were generated independently using four different platforms: cel- seq [ ] ( cells), cel-seq [ ] ( cells), fluidigm c [ ] ( cells), and [ ]https://satijalab.org/seurat/v . /immune_alignment.html .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://satijalab.org/seurat/v . /immune_alignment.html https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of smart-seq [ ] ( cells). for our tests, we considered the different cell-types across the datasets identified in [ ] by applying pca on the scaled integrated data matrix. the count matrices were downloaded from seurat’s tutorial “integration and label transfer" [ ]. mouse cell atlas mca is composed of two different datasets. the former was generated by han et al. [ ] using microwell-seq ( cells) [ ], while the latter by the tabula muris consortium [ ] using smart-seq ( cells). the distinct cell-types with the highest number of cells, which were present in both datasets, have been taken into account as in [ ]. the count matrices were downloaded from the public github repository related to [ ] [ ]. metrics adjusted rand index the rand index (ri) is a similarity measure between the results obtained from the application of two different clustering methods. the first clustering method is used as ground truth (i.e., true clusters), while the second one has to be evaluated (i.e., predicted clusters). ri is calculated by considering all pairs of samples appearing in the clusters, namely, it counts the pairs that are assigned either to the same or different clusters in both the predicted and the true clusters. the adjusted ri (ari) [ ] is the “adjusted for chance" version of ri. its values vary in the range [− , ]: a value close to means a random assignment, independently of the number of clusters, while indicates that the clusters obtained with both clustering approaches are identical. negative values are obtained if the index is less than the expected index. adjusted mutual information index the mutual information index (mii) [ ] represents the mutual information of two random variables, which is a similarity measure of the mutual dependence between the two variables. specifically, it is used to quantify the amount of information that can be gained by one random variable observing the other variable. mii is strictly correlated with the entropy of a random variable, which quantifies the expected “amount of information" that is contained in a random variable. this index is used to measure the similarity between two labels of the same data. similarly to ari, the adjusted mii (amii) is “adjusted for chance" and its values vary in the range [ , ]. fowlkes mallows index the fowlkes mallows index (fmi) [ ] measures the similarity between the clusters obtained by using two different clustering approaches. it is defined as the geometric mean between precision and recall. assuming that the first clustering approach is the ground truth, the precision is the percentage of the results that are relevant, while the recall refers to the percentage of total relevant results correctly assigned by the second clustering approach. the index ranges from to . [ ]https://satijalab.org/seurat/v . /integration.html [ ]https://github.com/jinmiaochenlab/batch-effect-removal-benchmarking .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://satijalab.org/seurat/v . /integration.html https://github.com/jinmiaochenlab/batch-effect-removal-benchmarking https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of homogeneity score the result of the tested clustering approach satisfies the homogeneity score (hs) [ ] if all of its clusters contain only cells which are members of a single cell-type. its values range from to , where indicates perfectly homogeneous labelling. notice that by switching true cluster labels with the predicted cluster labels, the completeness score is obtained. completeness score the result of the tested clustering approach satisfies the completeness score (cs) [ ] if all the cells that are members of a given cell-type are elements of the same cluster. its values range from to , where indicates perfectly complete labelling. notice that by switching true cluster labels with the predicted cluster labels, the hs is obtained. v-measure the v-measure (vm) [ ] is the harmonic mean between hs and cs; it is equivalent to mii when the arithmetic mean is used as aggregation function. integration of multiple datasets obtained with the same sequencing platforms nowadays, various scrna-seq platforms are currently available (e.g., droplet-based and plate-based [ – ]) and their integration is often challenging due to the differ- ences in biological sample batches as well as to the used experimental platforms. to test whether aes can be effectively applied to combine multiple datasets, generated using the same platform but under different experimental conditions, we used the pbmc datasets. we merged the control and treated datasets by using vanilla pca and aes, pca and aes followed by either bbknn or harmony, combat, and cca. after the construction of the neighbourhood graphs, we performed a clustering step by using the leiden algorithm [ ]. since in the original paper different cell-types were manually identified [ ], we selected leiden’s resolutions that allowed us to obtain distinct clusters and calculated all the metrics described above. in what follows, the calculated values of all metrics are given in percentages. for each metric, the higher the value the better the result. our analysis showed that the cca-based approach, proposed in the seurat li- brary, achieved a mean ari equal to . % (with standard deviation equal to ± . %), combat reached a mean ari of . % (± . %), vanilla pca had a mean ari of . % (± . %), pca followed by bbknn was able to obtain a mean ari of . % (± . %), while followed by harmony reached a mean ari of . % (± . %), as shown in figure a. among all the tested aes, mmdae followed by harmony (using the nb loss function and neurons for the hidden layer and neurons for the latent space) achieved the best results, with a mean ari equal to . % (± . %). in order to assess whether any of the results ob- tained by the best ae were different from a statistical point of view, we applied the mann–whitney u test with the bonferroni correction [ – ]. in all the compar- isons, mmdae followed by harmony had a p-value lower than . , confirming that the achieved results are statistically different compared to those achieved by the other approaches. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of regarding the amii, cca had a mean value of . % (± . %), combat achieved a mean value of . % (± . %), vanilla pca obtained a mean value of . % (± . %), pca followed by bbknn reached a mean value of . % (± . %), while followed by harmony a mean value of . % (± . %). mm- dae followed by harmony had better results, with a mean value equal to . % (± . %). mmdae followed by harmony outperformed the other strategies also in terms of of fms, hs, cs, and vm (see additional file and figure ). we also compared the results obtained by the best ae for each of the tested di- mension (h,l) in terms of ari (figure b). gmmmd followed by harmony (using the nb loss function) obtained the best results for the dimension ( , ), gmmmd followed by harmony (using the poisson loss function) reached the best results for the dimensions ( , ) and ( , ), and gmmmd followed by bbknn (using the nb loss function) achieved the best results for the dimension ( , ). mmdae followed by harmony (using the nb loss function) was able to reach the best results for the dimensions ( , ), ( , ), and ( , ), while mmdae followed by har- mony (using the poisson loss function) obtained the best result for the dimensions ( , ). notice that we used two gaussian distributions because we merged two different datasets. in order to visually assess the quality of the separation of the manually annotated cell-type and the found clusters, we plotted them in the umap space generated starting from the mmdae followed by harmony space (figures c and d). finally, we also plotted the two samples in the same umap space to visually see the quality of the alignment between the two samples them-self (figure a). this plot confirms that the batch-effects were completely removed. our analysis showed that clustering the neighbourhood graph generated from ae spaces allowed for a better identification of the existing cell-types when compared to other approaches, thus confirming the ari results. integration of multiple datasets obtained with different sequencing platforms combining datasets from different studies and scrna-seq platforms can be a pow- erful approach to obtain complete information about the biological system under investigation. however, when datasets generated with different platforms are com- bined, the high variability in the gene expression matrices can obscure the existing biological relationships. for example, the gene expression values are much higher in data acquired with plate-based methods (i.e., up to millions) than in those ac- quired with droplet-based methods (i.e., a few thousands). thus, combining gene expression data that spread across several orders of magnitude is a difficult task that cannot be tackled by using linear approaches like pca. to examine how well aes perform in resolving this task, we combined four pic datasets acquired with cel-seq [ ], cel-seq [ ], fluidigm c [ ], and smart-seq protocols [ ]. we integrated the datasets by using vanilla pca and aes, pca and aes fol- lowed by either bbknn or harmony, combat, and cca. since in the original paper cell-types were manually annotated for the pic datasets [ ], we clustered the neighbourhood graphs using the leiden algorithm considering only the resolu- tions that allowed us to obtain distinct clusters. we then calculated ari, amii, fms, hs, cs, and vm metrics. the calculated values of all metrics are given in percentages; for each metric, the higher the value the better the result. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of cca had a very low mean ari, i.e., . % (± . %), combat obtained a mean ari of . % (± . %), vanilla pca achieved a mean ari of . % (± . %), pca followed by bbknn reached a mean ari of . % (± . %), while followed by harmony was able to obtain a mean ari of . % (± . %), see figure a. gmmmd followed by harmony (using the nb loss function and neurons for the hidden layer and neurons for the latent space) outperformed the other aes, achieving a mean ari equal to . % (± . %). in all the comparisons, expect for the one against pca followed by harmony, gmmmd followed by harmony had a p-value lower than . , confirming that the achieved results are statistically different with respect to those obtained by the other approaches. similar results were achieved for the amii metric, cca reached a mean value equal to . % (± . %), combat obtained a mean value of . % (± . %), vanilla pca reached a mean value of . % (± . %), pca followed by bbknn achieved a mean value of . % (± . %) and pca followed by harmony a mean value of . % (± . %), while gmmmd followed by harmony was able to reach a mean value equal to . % (± . %). considering the other measures, both pca and gmmmd followed by harmony obtained very similar results, outperforming the other strategies (see additional file and figure ). considering the best ae for each of the tested dimension (h,l) in terms of ari (see figure b), gmmmd followed by harmony (using the nb loss function) resulted the best choice for the dimensions ( , ) and ( , ), while it obtained the best results for the dimension ( , ) when the poisson loss function was used. gmvae followed by harmony (using the poisson loss function) reached the best results for the dimensions ( , ) and ( , ). mmdae followed by harmony achieved the best results for the dimensions ( , ) and ( , ), exploiting the nb loss function and poisson loss function, respectively. finally, vae followed by harmony obtained the best results with the poisson function for the dimension ( , ). note that we exploited four gaussian distributions because we merged four different datasets. the quality of the separation of the manually annotated cell-type and found clusters can be visually evaluated in figures c and d) we finally visualised the cells (coloured by platform) using the umap space generated from the gmmmd followed by harmony space (figure b) to confirm that the batch-effects among the samples sequenced with different platforms were correctly removed. taken together, our analysis shows that gmmmd followed by harmony can efficiently identify the “shared" cell-types across the different platforms due to its ability to deal with the high variability in the gene expression matrices. we would like to highlight that pca followed by harmony was capable of achieving good results because the original clusters were obtained by applying a similar pipeline [ ]. as a final test, we combined two mca datasets acquired with microwell-seq [ ] and smart-seq protocols [ ]. we integrated the datasets in the same way we did in the other two tests. we clustered the neighbourhood graphs using the leiden algorithm considering only the resolutions that allowed us to obtain distinct clusters because distinct cell-types were manually annotated for the pic datasets [ ]. we then calculated all metrics. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of in such a case, mmdvae followed by harmony (using the poisson loss function and neurons for the hidden layer and neurons for the latent space) outper- formed the other aes as well as the other strategies, obtaining a mean ari equal to . % (± . %), as shown in figure a. combat achieved the worst mean ari, i.e., . % (± . %), cca reached a mean ari of . % (± . %), vanilla pca obtained a mean ari of . % (± . %), pca followed by bbknn had a similar mean ari, that is, . % (± . %), while followed by harmony achieved a mean ari of . % (± . %). mmdvae followed by harmony had a p-value lower than . in all the tested comparisons. considering the other metrics, mmdvae followed by harmony generally obtained better results compared to the other strategies (see additional file and figure ). comparing the best ae for each of the tested dimension (h,l) in terms of ari, the vanilla gmmmd with the nb loss function obtained the best results for the di- mension ( , ), while gmmmd followed by harmony reached the best results for the dimensions ( , ) and ( , ), exploiting the poisson loss function and nb loss function, respectively. mmdae followed by bbknn (using the zip loss func- tion) achieved the best results for the dimensions ( , ) and ( , ), exploiting the nb loss function and poisson loss function, respectively. mmdvae followed by harmony resulted the best choice for the dimensions ( , ) and ( , ) when coupled with the nb loss function and poisson loss function, respectively. finally, vae followed harmony with the poisson loss function obtained the best results for the dimension ( , ). as for the the integration of the pbmc datasets, we used two gaussian distributions because we merged two different datasets. figures c and d show the umap generated from the mmdvae followed by harmony space coloured by the manually annotated cell-type and found clusters, respectively, while figure c depicts the cells coloured by platform on the same umap space, confirming that the batch-effects between the two samples were cor- rectly removed. in this case, the achieved results show that mmdvae followed by harmony was able to better identify the “shared" cell-types across the different platforms. discussion non-linear approaches for dimensionality reduction can be effectively used to cap- ture the non-linearities among the gene interactions that may exist in the high- dimensional expression space of scrna-seq data [ ]. among the different non- linear approaches, aes showed outstanding performance, outperforming other ap- proaches like umap and t-sne. several ae-based methods have been developed so far, but their integration with the common single-cell toolkits results a difficult task because they usually require input data codified in a specific format. in addi- tion, three different machine learning libraries are required to use them (i.e., keras, tensorflow, and pytorch). here, we proposed scaespy, a unifying and user-friendly tool that allows the user to use the most recent and promising aes (i.e., vae, mmdae, mmdvae, and gmvae). we also designed and developed gmmmd and gmmmdvae, two novel aes that combine mmdae and mmdvae with gmvae to exploit more than one gaussian distribution. we introduced a learnable prior distribution in the latent .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of space to model the dimensionality of the subpopulations of cells composing the data or to combine multiple samples. we integrated aes with both harmony and bbknn to remove the existing batch- effects among different datasets. our results showed that exploiting the latent space to remove the existing batch-effects permits for a better identification of the cell subpopulations. as a batch-effect removal tool, harmony allowed for achieving bet- ter results than bbknn in the majority of the cases. when different droplet-based data have to be combined, our gmmmd and the mmdae, coupled with the con- strained nb and poisson loss functions, obtained the highest results compared to all the other aes. in order to combine and analyse multiple datasets, generated by using different scrna-seq platforms, both our gmmmd and mmdvae, mainly to- gether with the nb and poisson loss functions, outperformed the other strategies. however, also gmvae and the simple vae obtained outstanding performance, highlighting that the kullback-leibler divergence function can become fundamen- tal to handle data spreading various orders of magnitude, especially the high values (up to millions) introduced by plate-based methods. it is clear that using more than one gaussian distribution allow for obtaining a better integration of the datasets and separation of the cell-types when more than two datasets have to be integrated, as clearly shown by the results reached on the pic datasets. considering the achieved results on the identification of the clusters, scaespy can be used at the basis of methods that aim to automatically identify the cell- types composing the scrna-seq datasets under analysis [ ]. as a matter of fact, scaespy coupled with bbknn was successfully applied to integrate different foetal human samples, enabling the identification of rare blood progenitor cells [ ]. conclusions in this study, we proposed a ae-based and user-friendly tool, named scaespy, which allows for using the most recent and promising aes to analyse scrna-seq data. the user can select the desired ae by only setting up two user-defined parame- ters. once the selected ae has been trained, it can be used to generate synthetic cells to increase the number of data for further downstream analyses (e.g., training classifiers). in scaespy, the latent space is easily accessible and thus allows the user to perform different analyses, such as the correction of possible batch-effect in a reduced non-linear space or the inference of differentiation trajectories. in this case, the latent space can be utilised to generate the “pseudotime” that measures transcriptional changes that a cell undergoes during the dynamic process. thanks to its modularity, scaespy can be extended to accommodate new aes so that the user will be always able to utilise the latest and cutting-edge aes [ ], which can improve the downstream analysis of scrna-seq data. it is worth noticing that scaespy can be used on hpc infrastructures, both based on cpus and gpus, to speed-up the computations. this is a crucial point when datasets composed of hundreds of thousands of cells are analysed. in such cases, the required running time drastically increases, so relying on hpc infrastructures is the best solution to incredibly reduce the prohibitive running time. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of future improvements as an improvement, prior biological knowledge about genes from ontologies can be incorporated into scaespy. ontologies can introduce useful information into ma- chine learning systems that are used to solve biological problems. they allow for integrating data from different omics (e.g., genomics, transcriptomics, proteomics, and metabolomics) as structured representations of semantic knowledge, which is commonly used for the representation of biological concepts. this approach has been successfully applied to predict the clinical targets from high-dimensional low- sample data [ ]. specifically, ontology embeddings are able to capture the semantic similarities among the genes, which can be exploited to sparsify the network con- nections. in addition, the gene ontology (go) [ ] can be exploited to interpret the extracted features from the latent spaces generated by the aes, allowing for bringing an explanation to the learned representations of the gene expression data. as a possible example, g:profiler [ ] focusing on go terms, kyoto encyclopedia of genes and genomes (kegg), and reactome can be used on the learned embed- dings to investigate the joint effects of different gene sets within specific biological pathways. this approach can help the interpretability and explainability of the learned embeddings of the used aes. integration of multi-omics data since aes showed outstanding performance in the integration of multi-omics of cancer data [ ], we plan to extend scaespy to analyse other single-cell omics. for instance, aes can be applied to analyse scatac-seq, where the identification of the cell-types is still more difficult due to technical challenges [ , ]. scaespy could be effectively applied to analyse disparate types of single-cell data from different points of view. the latent representations of different or combined single-cell omics can be used for further and more in-depth analyses. for instance, the application of other machine learning techniques (e.g., deep neural networks) to the latent representations could facilitate the identification of interesting patterns on gene expression or methylation data, as well as relationships among genomics variants. in that regard, scaespy can be the starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics as an integration and extension of the work proposed in [ ]. methods we developed scaespy so that it can be easily integrated into both scanpy and seu- rat pipelines, as it directly works on a gene expression matrix (see figure ). we integrated into a single tool the latest and most powerful aes designed to resolve the problems underlying scrna-seq data (e.g., sparsity, intrinsic noise, dropout events [ ]). specifically, scaespy is comprised of six aes, based on the vae [ ] and infovae [ ] architectures. the following most advanced aes are included in scaespy: vae, mmdae, mmdvae, gmvae, and two novel gaussian-mixture aes that we developed, called gmmmd and gmmmdvae. gmmmd is a modifi- cation of the mmdae where more than one gaussian distribution is used to model different modes and only the mmd function is used as divergence function. gmm- mdvae is a combination of mmdvae and gmvae where both the mmd func- tion [ ] and the kullback-leibler divergence function [ ] are used. scaespy allows .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of the user to exploit these six different aes by setting up two user-defined parameters, α and λ, which are needed to balance the mmd and the kullback-leibler diver- gence functions. we designed and developed gmmmd and gmmmdvae starting from infovae [ ] and scvae [ ]. in addition, a learnable mixture distribution was used for the prior distribution in the latent space, and also the marginal condi- tional distribution was defined to be a learnable mixture distribution with the same number of components as the prior distribution. finally, the user can also select the following loss functions: nb, constrained nb, poisson, constrained poisson, zinb, constrained zinb, zip, constrained zip, and mean square error (mse). the tested batch-effect removal tools originally proposed to deal with batch-effects in microarray gene expression data [ ], combat has been successfully applied to analyse scrna-seq data [ ]. briefly, given a gene expression matrix, it is firstly standardised so that all genes have similar means and variances. then, starting from the obtained standardised matrix, standard distributions are fitted using a bayesian approach to estimate the existing batch-effects in the data. finally, the original expression matrix is corrected using the computed batch-effect estimators. in our tests, we used the default parameter settings provided by the scanpy function combat. we then applied pca on the space obtained by the top k (here, we set k = ) hvgs calculated by using the function provided by scanpy (v. . . . ), where the top hvgs are separately selected within each batch and merged to avoid the selection of batch-specific genes. we calculated the first components and applied the so-called “elbow method” to select the number of components for the downstream analysis. we used the first , , and components for pbmc, pic, and mca datasets, respectively. after that, we calculated the neighbourhood graph by using the default parameter settings proposed in scanpy. we clustered the obtained neighbourhood graphs with the leiden algorithm by selecting the values of the resolution parameter such that the number of clusters was equal to the manually annotated clusters. finally, all the metrics for each found resolution have been calculated. as another batch-effect removal tool, we used the cca-based approach proposed in the seurat package (v. . . ) [ ]. we applied both runcca and multicca seurat functions to integrate two batches and more than two batches, respectively. firstly, we normalised and log-transformed the counts. then, we calculated the top hvgs by using the function provided by scanpy (v. . . . ). we also scaled the log- transformed data to zero mean and unit variance. in both runcca and multicca seurat functions, as a first step, the cca components (here, we exploited the first ) are used to compute the linear combinations of the genes with the maximum correlation between the batches. a dynamic time warping (alignsubspace seurat function), which accounts for population density changes, is then used to align the calculated vectors and obtain a single low-dimensional subspace where the batch- effects are corrected. we calculated the neighbourhood graph, using the default parameter settings proposed in scanpy, starting from the aligned low-dimensional subspace. we clustered the built neighbourhood graphs with the leiden algorithm as explained before. finally, we calculated all the metrics for each found resolution. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of we also applied harmony [ ] to remove the batch-effects. starting from a reduced space (e.g., pca space or latent space), harmony exploits an iterative clustering- based procedure to remove the multiple-dataset-specific batch-effects. in each itera- tion, the following steps are applied: (i) the cells are grouped into multiple-dataset clusters by exploiting a variant of the soft k-means clustering, which is a fast and flexible method developed to cluster single-cell data; (ii) a centroid is calculated for each cluster and for each specific dataset; (iii) using the calculated centroids, a correction factor is derived for each dataset; (iv) the correction factors are then used to correct each cell with a cell-specific factor. as a further batch-effect removal tool, we applied bbknn [ ]. polanski et al. [ , ] showed that bbknn has comparable or better performance in removing batch- effects with respect to the cca-based approach proposed in the seurat package, scanorama [ ] and mnncorrect [ ]. in addition, bbknn is a lightweight graph alignment method that requires minimal changes to the classical workflow. indeed, it computes the k-nearest neighbours in a reduced space (e.g., pca or latent space), where the nearest neighbours are identified in a batch-balanced manner using a user-defined distance (in our tests, we used the euclidean distance). the neighbour information is transformed into connectivities to build a graph where all cells across batches are linked together. we used both harmony and bbknn to correct the pca and ae spaces. as a final step, we calculated the umap spaces starting from the built neighbour- hood graphs and using the default parameter settings proposed in scanpy, except for the initialisation of the low dimensional embedding (i.e., init_pos equal to random, and random_state equal to of the umap function). the proposed pipeline we modified the workflow shown in figure by replacing pca with aes (figure ). we merged the gene expression matrices of e different samples (e = , e = , and e = for pbmc, pic, and mca datasets, respectively). we applied both pca and aes on the space obtained by the top hvgs calculated by using the latest implementation of scanpy function. for what concerns pca, we firstly normalised and log-transformed the counts, then we applied a classic standardisation, that is, the distribution of the expression of each gene was scaled to zero mean and unit variance. we calculated the first components; after that, we used the “elbow method” to select the first , , and components for pbmc, pic, and mca datasets, respectively. regarding aes, we used the original counts since aes showed to achieve better results when applied using the raw counts [ ]. indeed, using the counts allows for exploiting discrete probability distributions, such as poisson and nb distributions, which obtained the best results in our tests. in all the tests presented here, we used a single hidden layer. in addition, we set epochs, sigmoid activation functions, and a batch equal to samples (i.e., cells). in all tests we used the adam optimizer [ ]. after that, we applied three different strategies (figure ): (i) we calculated the neighbourhood graph in both pca and ae spaces by using the default parameter settings proposed in scanpy. then, we clustered the obtained neighbourhood graphs with the leiden algorithm as described before. finally, we calculated all the metrics .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of table setting of α, λ, and k to obtain the desired ae. α λ k vae mmdae mmdvae gmvae > gmmmd > gmmmdvae > for each found resolution. (ii) we performed a similar analysis where we firstly corrected the pca and ae spaces using harmony [ ] with the default parameter settings proposed in https://github.com/slowkow/harmonypy. (iii) we performed the same analysis described in (i) by replacing the neighbourhood graphs with those generated using bbknn, using the default parameter settings. the generalised formulation of scaespy in this work, we used the notation proposed in [ ] to extend mmdvae with multiple gaussian distributions as well as to introduce a learnable prior distribution in the latent space. the idea behind the introduction of learnable coefficients is that they might be suitable to model the diversity among the subpopulations of cells composing the data or to combine multiple samples or datasets. we consider p∗(x) as the unknown probability in the input space over which the optimisation problem is formulated, z is the latent representation of x with |z| ≤ |x|. the encoder is identified by a function eφ : x → z, while the decoder by a function dθ : z → x. we remind that in vaes, the input x is not mapped into a single point in the latent space, but it is represented by a probability distribution over the latent space. q(z) can be any possible distribution in the latent space and y ∈{ , . . . ,k} is a categorical random variable, where k corresponds to the number of desired gaussian distributions. as general strict divergence function, we considered the mmd(·) divergence function [ ]. the elbo term proposed in this work, which is the measure maximised during the training of aes, is: elbo = e[log(d(x|z,y))] ( ) − (α + λ− )mmd(pe(z)||q(z)) − ( −α)e[kl(pe(z,y|x)||q(z,y))], where kl(·) is the kullback-leibler divergence [ ] between two distributions. all the mathematical details required to derive the generalised formula shown in equa- tion can be found in the additional file . equation allows the user to easily exploit vae, mmdae, mmdvae, gmvae, gmmmd, and gmmmmdvae (see table ). availability and requirements scaespy is written in python programming language (v. . . ) and it relies on tensorflow (v. . . ), an open-source and massively used machine learning li- brary [ ]. scaespy requires the following python libraries: numpy, scikit-learn, .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://github.com/slowkow/harmonypy https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of matplotlib, and seaborn. scaespy’s open-source code is available on gitlab: https://gitlab.com/cvejic-group/scaespy under the gpl- license. the repository contains all the scripts, code and jupyter notebooks used to obtain the results shown in the paper. in the provided jupyter notebooks, we show how it is easy to integrate scaespy and scanpy, and how the data can be visualised and explored by using both scaespy and scanpy’s functions. we also provide a detailed description of scaespy’s parameters so that it can be used by both novice and expert researchers for downstream analyses. competing interests the authors declare that they have no competing interests. author’s contributions at conceived the project. at and fr developed the software. ac, db, and pl supervised the project and helped to interpret and present the results. at performed all the tests and analysed the results. at wrote the manuscript. ac, db, and pl edited the manuscript. all authors read and approved the final manuscript. acknowledgements this research was supported by cancer research uk grant number c /a (ac and at), european research council project – zf_blood (ac) and a core support grant from the wellcome trust and mrc to the wellcome trust – medical research council cambridge stem cell institute. we thank dr. leonardo rundo (department of radiology, university of cambridge) for their critical comments. author details wellcome trust-medical research council cambridge stem cell institute, cambridge, uk. department of haematology, university of cambridge, cambridge, uk. wellcome trust sanger institute, wellcome trust genome campus, hinxton, uk. department of informatics, systems and communication, university of milano-bicocca, milan, italy. department of computer science and technology, university of cambridge, cambridge, uk. current address: department of human and social sciences, university of bergamo, bergamo, italy. bicocca bioinformatics, biostatistics and bioimaging centre (b ), milan, italy. references . gladka, m.m., molenaar, b., de ruiter, h., van der elst, s., tsui, h., versteeg, d., lacraz, g.p., huibers, m.m., van oudenaarden, a., van rooij, e.: single-cell sequencing of the healthy and diseased heart reveals cytoskeleton-associated protein as a new modulator of fibroblasts activation. circulation ( ), – ( ). doi: . /circulationaha. . . keren-shaul, h., spinrad, a., weiner, a., matcovitch-natan, o., dvir-szternfeld, r., ulland, t.k., david, e., baruch, k., lara-astaiso, d., toth, b., et al.: a unique microglia type associated with restricting development of alzheimer’s disease. cell ( ), – ( ). doi: . /j.cell. . . . lähnemann, d., köster, j., szczurek, e., mccarthy, d.j., hicks, s.c., robinson, m.d., vallejos, c.a., campbell, k.r., beerenwinkel, n., mahfouz, a., et al.: eleven grand challenges in single-cell data science. genome biol. ( ), – ( ). doi: . /s - - - . steinbach, m., ertöz, l., kumar, v.: the challenges of clustering high dimensional data. in: new directions in statistical physics: econophysics, bioinformatics, and pattern recognition, pp. – . springer, berlin, heidelberg ( ). doi: . / - - - - _ . wold, s., esbensen, k., geladi, p.: principal component analysis. chemom intell lab syst. ( - ), – ( ). doi: . / - ( ) - . maaten, l.v.d., hinton, g.: visualizing data using t-sne. j mach learn res. (nov), – ( ) . mcinnes, l., healy, j., melville, j.: umap: uniform manifold approximation and projection for dimension reduction. arxiv preprint arxiv: . ( ) . becht, e., mcinnes, l., healy, j., dutertre, c.-a., kwok, i.w., ng, l.g., ginhoux, f., newell, e.w.: dimensionality reduction for visualizing single-cell data using umap. nat biotechnol. ( ), ( ). doi: . /nbt. . luecken, m.d., theis, f.j.: current best practices in single-cell rna-seq analysis: a tutorial. mol. syst. biol. ( ), ( ). doi: . /msb. . hwang, b., lee, j.h., bang, d.: single-cell rna sequencing technologies and bioinformatics pipelines. exp mol med. ( ), ( ). doi: . /s - - - . luecken, m.d., buttner, m., chaichoompu, k., danese, a., interlandi, m., müller, m.f., strobl, d.c., zappia, l., dugas, m., colomé-tatché, m., et al.: benchmarking atlas-level data integration in single-cell genomics. biorxiv ( ). doi: . / . . . . tran, h.t.n., ang, k.s., chevrier, m., zhang, x., lee, n.y.s., goh, m., chen, j.: a benchmark of batch-effect correction methods for single-cell rna sequencing data. genome biol. ( ), – ( ). doi: . /s - - - . leek, j.t., scharpf, r.b., bravo, h.c., simcha, d., langmead, b., johnson, w.e., geman, d., baggerly, k., irizarry, r.a.: tackling the widespread and critical impact of batch effects in high-throughput data. nat. rev. genet. ( ), – ( ). doi: . /nrg .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://gitlab.com/cvejic-group/scaespy http://dx.doi.org/ . /circulationaha. . http://dx.doi.org/ . /j.cell. . . http://dx.doi.org/ . /s - - - http://dx.doi.org/ . / - - - - $_$ http://dx.doi.org/ . / - ( ) - http://dx.doi.org/ . /nbt. http://dx.doi.org/ . /msb. http://dx.doi.org/ . /s - - - http://dx.doi.org/ . / . . . http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /nrg https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of . butler, a., hoffman, p., smibert, p., papalexi, e., satija, r.: integrating single-cell transcriptomic data across different conditions, technologies, and species. nat biotechnol. ( ), ( ). doi: . /nbt. . bacher, r., kendziorski, c.: design and computational analysis of single-cell rna-sequencing experiments. genome biol. ( ), ( ). doi: . /s - - -y . ding, j., condon, a., shah, s.p.: interpretable dimensionality reduction of single cell transcriptome data with deep generative models. nat commun. ( ), ( ). doi: . /s - - - . eraslan, g., simon, l.m., mircea, m., mueller, n.s., theis, f.j.: single-cell rna-seq denoising using a deep count autoencoder. nat commun. ( ), ( ). doi: . /s - - - . kingma, d.p., welling, m.: auto-encoding variational bayes. arxiv preprint arxiv: . ( ) . lopez, r., regier, j., cole, m.b., jordan, m.i., yosef, n.: deep generative modeling for single-cell transcriptomics. nat methods ( ), ( ). doi: . /s - - - . svensson, v., gayoso, a., yosef, n., pachter, l.: interpretable factor models of single-cell rna-seq via variational autoencoders. bioinformatics ( ), – ( ). doi: . /bioinformatics/btaa . grønbech, c.h., vording, m.f., timshel, p.n., sønderby, c.k., pers, t.h., winther, o.: scvae: variational auto-encoders for single-cell gene expression data. biorxiv, ( ). doi: . / . grønbech, c.h., vording, m.f., timshel, p.n., sønderby, c.k., pers, t.h., winther, o.: scvae: variational auto-encoders for single-cell gene expression data. bioinformatics ( ). doi: . /bioinformatics/btaa . tran, d., nguyen, h., tran, b., nguyen, t.: fast and precise single-cell data analysis using hierarchical autoencoder. biorxiv, ( ). doi: . / . rousseeuw, j.: a graphical aid to the interpretation and validation of cluster analysis. j. comput. appl. math. , – ( ) . bica, i., andrés-terré, h., cvejic, a., liò, p.: unsupervised generative and graph representation learning for modelling cell differentiation. sci. rep. ( ), – ( ). doi: . /s - - - . wang, d., gu, j.: vasc: dimension reduction and visualization of single-cell rna-seq data by deep variational autoencoder. genom proteom bioinf. ( ), – ( ). doi: . /j.gpb. . . . lin, e., mukherjee, s., kannan, s.: a deep adversarial variational autoencoder model for dimensionality reduction in single-cell rna sequencing analysis. bmc bioinformatics ( ), – ( ). doi: . /s - - - . geddes, t.a., kim, t., nan, l., burchfield, j.g., yang, j.y., tao, d., yang, p.: autoencoder-based cluster ensembles for single-cell rna-seq data analysis. bmc bioinformatics ( ), ( ). doi: . /s - - - . talwar, d., mongia, a., sengupta, d., majumdar, a.: autoimpute: autoencoder based imputation of single-cell rna-seq data. sci. rep. ( ), ( ). doi: . /s - - -x . sun, s., liu, y., shang, x.: deep generative autoencoder for low-dimensional embeding extraction from single-cell rnaseq data. in: proceedings of the ieee international conference on bioinformatics and biomedicine (bibm), pp. – ( ). doi: . /bibm . . . ieee . badsha, m.b., li, r., liu, b., li, y.i., xian, m., banovich, n.e., fu, a.q.: imputation of single-cell gene expression with an autoencoder neural network. quant. biol., – ( ). doi: . /s - - - . rao, j., zhou, x., lu, y., zhao, h., yang, y.: imputing single-cell rna-seq data by combining graph convolution and autoencoder neural networks. biorxiv ( ). doi: . / . . . . wolf, f.a., angerer, p., theis, f.j.: scanpy: large-scale single-cell gene expression data analysis. genome biol. ( ), ( ). doi: . /s - - - . satija, r., farrell, j.a., gennert, d., schier, a.f., regev, a.: spatial reconstruction of single-cell gene expression data. nat biotechnol. ( ), ( ). doi: . /nbt. . abadi, m., barham, p., chen, j., chen, z., davis, a., dean, j., devin, m., ghemawat, s., irving, g., isard, m., et al.: tensorflow: a system for large-scale machine learning. in: proceedings of the symposium on operating systems design and implementation), pp. – ( ) . paszke, a., gross, s., chintala, s., chanan, g., yang, e., devito, z., lin, z., desmaison, a., antiga, l., lerer, a.: automatic differentiation in pytorch. in: proceedings of the conference on advances in neural information processing systems ( ) . zhao, s., song, j., ermon, s.: infovae: information maximizing variational autoencoders. arxiv preprint arxiv: . ( ) . park, j.-e., polański, k., meyer, k., teichmann, s.a.: fast batch alignment of single cell transcriptomes unifies multiple mouse cell atlases into an integrated landscape. biorxiv, ( ) . polański, k., young, m.d., miao, z., meyer, k.b., teichmann, s.a., park, j.-e.: bbknn: fast batch alignment of single cell transcriptomes. bioinformatics ( ), – ( ). doi: . /bioinformatics/btz . korsunsky, i., millard, n., fan, j., slowikowski, k., zhang, f., wei, k., baglaenko, y., brenner, m., loh, p.-r., raychaudhuri, s.: fast, sensitive and accurate integration of single-cell data with harmony. nat. methods , – ( ). doi: . /s - - - . johnson, w.e., li, c., rabinovic, a.: adjusting batch effects in microarray expression data using empirical bayes methods. biostatistics ( ), – ( ). doi: . /biostatistics/kxj . leek, j.t., johnson, w.e., parker, h.s., jaffe, a.e., storey, j.d.: the sva package for removing batch effects and other unwanted variation in high-throughput experiments. bioinformatics ( ), – ( ). doi: . /bioinformatics/bts . pedersen, b.: python implementation of combat. github ( ) . kang, h.m., subramaniam, m., targ, s., nguyen, m., maliskova, l., mccarthy, e., wan, e., wong, s., byrnes, l., lanata, c.m., et al.: multiplexed droplet single-cell rna-sequencing using natural genetic variation. nat biotechnol. ( ), ( ). doi: . /nbt. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint http://dx.doi.org/ . /nbt. http://dx.doi.org/ . /s - - -y http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /bioinformatics/btaa http://dx.doi.org/ . / http://dx.doi.org/ . /bioinformatics/btaa http://dx.doi.org/ . / http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /j.gpb. . . http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /s - - -x http://dx.doi.org/ . /bibm . . http://dx.doi.org/ . /s - - - http://dx.doi.org/ . / . . . http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /nbt. http://dx.doi.org/ . /bioinformatics/btz http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /biostatistics/kxj http://dx.doi.org/ . /bioinformatics/bts http://dx.doi.org/ . /nbt. https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of . grün, d., muraro, m.j., boisset, j.-c., wiebrands, k., lyubimova, a., dharmadhikari, g., van den born, m., van es, j., jansen, e., clevers, h., et al.: de novo prediction of stem cell identity using single-cell transcriptome data. cell stem cell ( ), – ( ). doi: . /j.stem. . . . muraro, m.j., dharmadhikari, g., grün, d., groen, n., dielen, t., jansen, e., van gurp, l., engelse, m.a., carlotti, f., de koning, e.j., et al.: a single-cell transcriptome atlas of the human pancreas. cell syst. ( ), – ( ). doi: . /j.cels. . . . lawlor, n., george, j., bolisetty, m., kursawe, r., sun, l., sivakamasundari, v., kycia, i., robson, p., stitzel, m.l.: single-cell transcriptomes identify human islet cell signatures and reveal cell-type–specific expression changes in type diabetes. genome res. ( ), – ( ). doi: . /gr. . . segerstolpe, Å., palasantza, a., eliasson, p., andersson, e.-m., andréasson, a.-c., sun, x., picelli, s., sabirsh, a., clausen, m., bjursell, m.k., et al.: single-cell transcriptome profiling of human pancreatic islets in health and type diabetes. cell metab. ( ), – ( ). doi: . /j.cmet. . . . stuart, t., butler, a., hoffman, p., hafemeister, c., papalexi, e., mauck iii, w.m., hao, y., stoeckius, m., smibert, p., satija, r.: comprehensive integration of single-cell data. cell ( ). doi: . /j.cell. . . . han, x., wang, r., zhou, y., fei, l., sun, h., lai, s., saadatpour, a., zhou, z., chen, h., ye, f., et al.: mapping the mouse cell atlas by microwell-seq. cell ( ), – ( ). doi: . /j.cell. . . . consortium, t.m., et al.: single-cell transcriptomics of mouse organs creates a tabula muris. nature , – ( ). doi: . /s - - - . hubert, l., arabie, p.: comparing partitions. j classif. ( ), – ( ). doi: . /bf . strehl, a., ghosh, j.: cluster ensembles—a knowledge reuse framework for combining multiple partitions. j mach learn res. (dec), – ( ) . fowlkes, e.b., mallows, c.l.: a method for comparing two hierarchical clusterings. j am stat assoc. ( ), – ( ) . vinh, n.x., epps, j., bailey, j.: information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. j mach learn res. (oct), – ( ) . rosenberg, a., hirschberg, j.: v-measure: a conditional entropy-based external cluster evaluation measure. in: proceedings of the conference on empirical methods in natural language processing and computational natural language learning, pp. – ( ) . macosko, e.z., basu, a., satija, r., nemesh, j., shekhar, k., goldman, m., tirosh, i., bialas, a.r., kamitaki, n., martersteck, e.m., et al.: highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. cell ( ), – ( ). doi: . /j.cell. . . . klein, a.m., mazutis, l., akartuna, i., tallapragada, n., veres, a., li, v., peshkin, l., weitz, d.a., kirschner, m.w.: droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. cell ( ), – ( ). doi: . /j.cell. . . . hashimshony, t., wagner, f., sher, n., yanai, i.: cel-seq: single-cell rna-seq by multiplexed linear amplification. cell rep. ( ), – ( ). doi: . /j.celrep. . . . hashimshony, t., senderovich, n., avital, g., klochendler, a., de leeuw, y., anavy, l., gennert, d., li, s., livak, k.j., rozenblatt-rosen, o., et al.: cel-seq : sensitive highly-multiplexed single-cell rna-seq. genome biol. ( ), ( ). doi: . /s - - - . zheng, g.x., terry, j.m., belgrader, p., ryvkin, p., bent, z.w., wilson, r., ziraldo, s.b., wheeler, t.d., mcdermott, g.p., zhu, j., et al.: massively parallel digital transcriptional profiling of single cells. nat commun. , ( ). doi: . /ncomms . gierahn, t.m., wadsworth ii, m.h., hughes, t.k., bryson, b.d., butler, a., satija, r., fortune, s., love, j.c., shalek, a.k.: seq-well: portable, low-cost rna sequencing of single cells at high throughput. nat methods ( ), ( ). doi: . /nmeth. . islam, s., kjällquist, u., moliner, a., zajac, p., fan, j.-b., lönnerberg, p., linnarsson, s.: characterization of the single-cell transcriptional landscape by highly multiplex rna-seq. genome res. ( ), – ( ). doi: . /gr. . . ramsköld, d., luo, s., wang, y.-c., li, r., deng, q., faridani, o.r., daniels, g.a., khrebtukova, i., loring, j.f., laurent, l.c., et al.: full-length mrna-seq from single-cell levels of rna and individual circulating tumor cells. nat biotechnol. ( ), ( ). doi: . /nbt. . picelli, s., faridani, o.r., björklund, Å.k., winberg, g., sagasser, s., sandberg, r.: full-length rna-seq from single cells using smart-seq . nat protoc. ( ), ( ). doi: . /nprot. . . jaitin, d.a., kenigsberg, e., keren-shaul, h., elefant, n., paul, f., zaretsky, i., mildner, a., cohen, n., jung, s., tanay, a., et al.: massively parallel single-cell rna-seq for marker-free decomposition of tissues into cell types. science ( ), – ( ). doi: . /science. . traag, v.a., waltman, l., van eck, n.j.: from louvain to leiden: guaranteeing well-connected communities. sci rep. ( ). doi: . /s - - -z . mann, h.b., whitney, d.r.: on a test of whether one of two random variables is stochastically larger than the other. ann. of math. stat., – ( ) . wilcoxon, f.: individual comparisons by ranking methods. in: breakthroughs in statistics, pp. – . springer, new york, ny ( ). doi: . / - - - - _ . dunn, o.j.: multiple comparisons among means. j. am. stat. assoc. ( ), – ( ) . ma, f., pellegrini, m.: actinn: automated identification of cell types in single cell rna sequencing. bioinformatics ( ). doi: . /bioinformatics/btz . ranzoni, a.m., tangherloni, a., berest, i., riva, s.g., myers, b., strzelecka, p.m., xu, j., panada, e., mohorianu, i., zaugg, j.b., et al.: integrative single-cell rna-seq and atac-seq analysis of human foetal liver and bone marrow haematopoiesis. biorxiv ( ). doi: . / . . . . simidjievski, n., bodnar, c., tariq, i., scherer, p., andres terre, h., shams, z., jamnik, m., liò, p.: variational autoencoders for cancer data integration: design principles and computational practice. front. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint http://dx.doi.org/ . /j.stem. . . http://dx.doi.org/ . /j.cels. . . http://dx.doi.org/ . /gr. . http://dx.doi.org/ . /j.cmet. . . http://dx.doi.org/ . /j.cell. . . http://dx.doi.org/ . /j.cell. . . http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /bf http://dx.doi.org/ . /j.cell. . . http://dx.doi.org/ . /j.cell. . . http://dx.doi.org/ . /j.celrep. . . http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /ncomms http://dx.doi.org/ . /nmeth. http://dx.doi.org/ . /gr. . http://dx.doi.org/ . /nbt. http://dx.doi.org/ . /nprot. . http://dx.doi.org/ . /science. http://dx.doi.org/ . /s - - -z http://dx.doi.org/ . / - - - - $_$ http://dx.doi.org/ . /bioinformatics/btz http://dx.doi.org/ . / . . . https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of genet. , ( ). doi: . /fgene. . . trębacz, m., shams, z., jamnik, m., scherer, p., simidjievski, n., terre, h.a., liò, p.: using ontology embeddings for structural inductive bias in gene expression data analysis. arxiv preprint arxiv: . ( ) . ashburner, m., ball, c.a., blake, j.a., botstein, d., butler, h., cherry, j.m., davis, a.p., dolinski, k., dwight, s.s., eppig, j.t., et al.: gene ontology: tool for the unification of biology. nat. genet. ( ), – ( ) . raudvere, u., kolberg, l., kuzmin, i., arak, t., adler, p., peterson, h., vilo, j.: g:profiler: a web server for functional enrichment analysis and conversions of gene lists ( update). nucl. acids res. (w ), – ( ). doi: . /nar/gkz . chen, x., miragaia, r.j., natarajan, k.n., teichmann, s.a.: a rapid and robust method for single cell chromatin accessibility profiling. nat commun. ( ), ( ). doi: . /s - - - . gretton, a., borgwardt, k., rasch, m., schölkopf, b., smola, a.j.: a kernel method for the two-sample-problem. in: proceedings of the conference on advances in neural information processing systems, pp. – ( ) . kullback, s., leibler, r.a.: on information and sufficiency. ann math statist. ( ), – ( ). doi: . /aoms/ . risso, d., perraudeau, f., gribkova, s., dudoit, s., vert, j.-p.: a general and flexible method for signal extraction from single-cell rna-seq data. nat. commun. ( ), – ( ). doi: . /s - - - . hie, b.l., bryson, b., berger, b.: panoramic stitching of heterogeneous single-cell transcriptomic data. biorxiv, ( ) . haghverdi, l., lun, a.t., morgan, m.d., marioni, j.c.: batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors. nat biotechnol. ( ), ( ). doi: . /nbt. . kingma, d.p., ba, j.: adam: a method for stochastic optimization. arxiv preprint arxiv: . ( ) additional files additional file — mathematical formulation of the proposed autoencoders we provide the mathematical derivation of gmmmd and gmmmdvae as well as the generalised formulation that we derived by following the notation proposed in [ ]. additional file — excel file of the metrics calculated for the pbmc datasets each tab is related to a tested approach and shows the calculated metrics and used method. additional file — excel file of the metrics calculated for the pic datasets each tab is related to a tested approach and shows the calculated metrics and used method. additional file — excel file of the metrics calculated for the mca datasets each tab is related to a tested approach and shows the calculated metrics and used method. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint http://dx.doi.org/ . /fgene. . http://dx.doi.org/ . /nar/gkz http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /aoms/ http://dx.doi.org/ . /s - - - http://dx.doi.org/ . /nbt. https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of figures pca on standardised data steps (iv) and (v) clusteringmarker genes step (vii) data visualisation quality control step (i) log-transformationnormalisation step (ii) step (vi) highly variable genes step (iii) figure a common workflow for the downstream analysis of scrna-seq data. the workflow includes the following seven steps: (i) quality control to remove low-quality cells that may add technical noise, which could obscure the real biological signals; (ii) normalisation and log-transformation; (iii) identification of the hvgs to reduce the dimensionality of the dataset by including only the most informative genes; (iv) standardisation of each gene to zero mean and unit variance; (v) dimensionality reduction generally obtained by applying pca; (vi) clustering of the cells starting from the low-dimensional representation of the data that are used to annotate the obtained clusters (i.e., identification of known and putatively novel cell-types); (vii) data visualisation on the low-dimensional space generated by applying a non-linear approach (e.g., t-sne or umap) on the reduced space calculated in step (v). .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of m genes n ce lls gene expression matrix sample k genes n + .. . + n e c el ls hvgs per sample k n od es k nodes t-sne umapleiden algorithm label assignment scaespy me genes n e c el ls gene expression matrix sample e ... corrected neighbourhood graph bbknn (corrected neighbourhood graph) z- sc or e rank marker genes dimensionality reduction sample sample e graph clustering umap u m a p tsne ts n e ... harmony (corrected latent space) neighbourhood graph corrected latent space figure the proposed workflow to integrate different samples. given e different samples, their gene expression matrices are merged. then, the top k hvgs are selected by considering the different samples. specifically, they are selected within each sample separately and then merged to avoid the selection of batch-specific genes. scaespy is used to reduce the hvg space (k dimensions), and the obtained latent space can be (i) used to calculate a t-sne space, (ii) corrected by harmony, and (iii) used to infer an uncorrected neighbourhood graph. the corrected latent space by harmony is then used to build a neighbourhood graph, which is clustered by using the leiden algorithm and used to calculate a umap space. otherwise, bbknn is applied to rebuild a uncorrected neighbourhood graph by taking into account the possible batch-effects. the corrected neighbourhood graph by bbknn is then clustered by using the leiden algorithm and used to calculate a umap space. in order to assign the correct label to the obtained clusters, the marker genes are calculated by using the mann–whitney u test. finally, the annotated clusters can be visualised in both t-sne and umap space. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of c c a c om b at m m d a e -h ar m on y- × p c a p c a -b b k n n p c a -h ar m on y a r i **** **** **** **** **** (a) g m m m d -b b k n n - × g m m m d -h ar m on y- × g m m m d -h ar m on y- × g m m m d -h ar m on y- × m m d a e -h ar m on y- × m m d a e -h ar m on y- × m m d a e -h ar m on y- × m m d a e -h ar m on y- × **** **** **** **** **** **** *** (b) umap u m a p (c) b cells cd + monocytes cd t cells cd t cells dendritic cells fcgr a+ monocytes megakaryocytes nk cells umap u m a p (d) figure results obtained on the pbmc datasets. (a) boxplot showing the ari values achieved by cca, combat, pca, mmdae followed by harmony with dimension ( , ), pca followed by bbknn, and pca followed by harmony on the pbmc datasets. (b) boxplot showing the ari values achieved by the best ae for each of the tested dimension (h, l) of the hidden layer (h neurons) and latent space (l neurons). (c) umap visualisation of the cell-type manually annotated in the original paper. (d) umap visualisation of clusters identified by the leiden algorithm using the resolution corresponding by the best ari achieved by mmdae followed by harmony. p-value≤ . (****); . . (ns) .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of a m ii **** **** **** **** **** (a) ** **** ns **** **** **** ns (b) fm s **** **** **** **** **** (c) **** **** **** **** **** **** **** (d) h s **** **** **** **** **** (e) ** **** ns **** **** **** ns (f) c s **** **** **** **** **** (g) **** **** **** **** * ns **** (h) c c a c om b at m m d a e -h ar m on y- × p c a p c a -b b k n n p c a -h ar m on y v m **** **** **** **** **** (i) g m m m d -b b k n n - × g m m m d -h ar m on y- × g m m m d -h ar m on y- × g m m m d -h ar m on y- × m m d a e -h ar m on y- × m m d a e -h ar m on y- × m m d a e -h ar m on y- × m m d a e -h ar m on y- × **** **** **** **** **** **** **** (j) figure boxplot showing the values of the calculated metrics using cca, combat, pca, mmdae followed by harmony with dimension ( , ), pca followed by bbknn, and pca followed by harmony as well as by the best ae for each of the tested dimension (h, l), analysing the pbmc datasets. (a) amii achieved by the different strategies. (b) amii achieved by the best ae for each of the tested dimension. (c) fms achieved by the different strategies. (d) fms achieved by the best ae for each of the tested dimension. (e) hs achieved by the different strategies. (f) hs achieved by the best ae for each of the tested dimension. (g) cs achieved by the different strategies. (h) cs achieved by the best ae for each of the tested dimension. (i) vm achieved by the different strategies. (j) vm achieved by the best ae for each of the tested dimension. p-value≤ . (****); . . (ns) .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of umap u m a p (a) control cells treated cells umap u m a p (b) cel-seq cel-seq fluidigm c smart-seq umap u m a p (c) microwell-seq smart-seq figure umap visualisation showing the sample alignment performed by harmony into the latent space obtained by mmdae with dimension ( , ) for the pbmc datasets (a), by gmmmd with dimension ( , ) for the pic datasets (b), and by mmdvae with dimension ( , ) for the mca datasets (c). c c a c om b at g m m m d -h ar m on y- × p c a p c a -b b k n n p c a -h ar m on y a r i **** **** **** **** ns (a) g m m m d -h ar m on y- × g m m m d -h ar m on y- × g m m m d -h ar m on y- × g m v a e -h ar m on y- × g m v a e -h ar m on y- × m m d -h ar m on y- × m m d -h ar m on y- × v a e -h ar m on y- × **** **** **** **** **** (b) umap u m a p (c) acinar cells activated stellate alpha cells beta cells delta cells ductal cells endothelial cells epsilon cells gamma cells macrophages mast cells quiescent stellate schwann cells umap u m a p (d) figure results obtained on the pic datasets. (a) boxplot showing the ari values achieved by cca, combat, pca, gmmmd followed by harmony with dimension ( , ), pca followed by bbknn, and pca followed by harmony on the pbmc datasets. (b) boxplot showing the ari values achieved by the best ae for each of the tested dimension (h, l) of the hidden layer (h neurons) and latent space (l neurons). (c) umap visualisation of the cell-type manually annotated in the original paper. (d) umap visualisation of clusters identified by the leiden algorithm using the resolution corresponding by the best ari achieved by gmmmd followed by harmony. p-value≤ . (****); . . (ns) .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of a m ii **** **** **** **** **** (a) **** **** **** **** **** (b) fm s **** **** **** **** ns (c) fm s **** **** **** **** **** (d) h s **** **** **** **** **** (e) **** **** **** **** **** (f) c s **** **** **** **** **** (g) **** **** **** **** **** (h) c c a c om b at g m m m d -h ar m on y- × p c a p c a -b b k n n p c a -h ar m on y v m **** **** **** **** **** (i) g m m m d -h ar m on y- × g m m m d -h ar m on y- × g m m m d -h ar m on y- × g m v a e -h ar m on y- × g m v a e -h ar m on y- × m m d -h ar m on y- × m m d -h ar m on y- × v a e -h ar m on y- × **** **** **** **** **** (j) figure boxplot showing the values of the calculated metrics using cca, combat, pca, gmmmd followed by harmony with dimension ( , ), pca followed by bbknn, and pca followed by harmony as well as by the best ae for each of the tested dimension (h, l), analysing the pic datasets. (a) amii achieved by the different strategies. (b) amii achieved by the best ae for each of the tested dimension. (c) fms achieved by the different strategies. (d) fms achieved by the best ae for each of the tested dimension. (e) hs achieved by the different strategies. (f) hs achieved by the best ae for each of the tested dimension. (g) cs achieved by the different strategies. (h) cs achieved by the best ae for each of the tested dimension. (i) vm achieved by the different strategies. (j) vm achieved by the best ae for each of the tested dimension. p-value≤ . (****); . . (ns) .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of c c a c om b at m m d v a e -h ar m on y- × p c a p c a -b b k n n p c a -h ar m on y a r i **** **** **** **** **** (a) g m m m d - × g m m m d -h ar m on y- × g m m m d -h ar m on y- × m m d -b b k n n - × m m d -b b k n n - × m m d v a e -h ar m on y- × m m d v a e -h ar m on y- × v a e -h ar m on y- × **** **** **** (b) umap u m a p (c) b cells dendritic cells endothelial cells epithelial cells macrophages monocytes nk cells neutrophils smooth-muscle cells stromal cells t cells umap u m a p (d) figure results obtained on the mca datasets. (a) boxplot showing the ari values achieved by cca, combat, pca, mmdvae followed by harmony with dimension ( , ), pca followed by bbknn, and pca followed by harmony. (b) boxplot showing the ari values achieved by the best ae for each of the tested dimension (h, l) of the hidden layer (h neurons) and latent space (l neurons). (c) umap visualisation of the cell-type manually annotated in the original paper. (d) umap visualisation of clusters identified by the leiden algorithm using the resolution corresponding by the best ari achieved by mmdvae followed by harmony. p-value≤ . (****); . . (ns) .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / tangherloni et al. page of a m ii **** ns **** **** **** (a) **** **** **** (b) fm s **** **** **** **** **** (c) **** **** **** (d) h s **** ns **** ns **** (e) **** **** **** (f) c s **** ns **** **** **** (g) **** **** **** (h) c c a c om b at m m d v a e -h ar m on y- × p c a p c a -b b k n n p c a -h ar m on y . . . . . . . . . v m **** ns **** ns **** (i) g m m m d - × g m m m d -h ar m on y- × g m m m d -h ar m on y- × m m d -b b k n n - × m m d -b b k n n - × m m d v a e -h ar m on y- × m m d v a e -h ar m on y- × v a e -h ar m on y- × **** **** **** (j) figure boxplot showing the values of the calculated metrics using cca, combat, pca, mmdvae followed by harmony with dimension ( , ), pca followed by bbknn, and pca followed by harmony, as well as by the best ae for each of the tested dimension (h, l), analysing the mca datasets. (a) amii achieved by the different strategies. (b) amii achieved by the best ae for each of the tested dimension. (c) fms achieved by the different strategies. (d) fms achieved by the best ae for each of the tested dimension. (e) hs achieved by the different strategies. (f) hs achieved by the best ae for each of the tested dimension. (g) cs achieved by the different strategies. (h) cs achieved by the best ae for each of the tested dimension. (i) vm achieved by the different strategies. (j) vm achieved by the best ae for each of the tested dimension. p-value≤ . (****); . . (ns) .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / abstract multi-class cancer classification and biomarker identification using deep learning multi-class cancer classification and biomarker identification using deep learning fariha muazzam abstract genetic data is important for analysing cellular functions whose disruption gives rise to various kinds of cancer. the intricacies of gene interaction are captured in various kinds of data for cancer detection through sequencing technology, but diagnosis, prognosis and treatment are still hard. advent of machine learning helped researchers in supervised and unsupervised learning tasks along with gene identification but resourcefulness has not been overtly satisfactory. this research revolves around multi-class cancer classification, feature extraction and relevant gene identification through deep learning methods for different types of cancers using rna-seq from the cancer genome atlas. it has been constrained by hardware resource availability and within them the experiments that have been performed have shown promising results. stacked de-noising autoencoders were used for feature extraction and biomarker identification while d convolutional neural networks for classification. classification was performed with extracted features and relevant genes,which gave average performance of around % and % respectively. we were able to identify generic cancer-related pathways and their associated genes through stacked de-noising auto-encoders generated weight matrix and features. the common pathways include wnt signalling pathway, angiogenesis. moreover, across all pathways some recurrent genes were observed, namely: pik c g, pcdhb , wnt a and these genes were found, in literature, to be involved in multiple types of cancer. the proposed approach shows superior performance and promise against traditional techniques used by bioinformatics community, in terms of accuracy and relevant gene identification. keywords: cancer detection, cancer prevention, targeted therapy, precision medicine department of computer science, national university of computer and emerging sciences correspondence: fariha muazzam (l @lhr.nu.edu.pk) mailto:l @lhr.nu.edu.pk introduction genes play an important role in the normal functioning of humans’ bodily processes and physiology ( ). however, there is a nuance of uncertainty associated with molecular events that occur which can cause alteration in routine processes. such changes in mechanism can lead to mutations or chromosomal rearrangements which can be harmful or benign, but are heavily associated with cancer causation ( ). identification of genes or group of genes propagating cancerous cell formation provides meaningful opportunity to detect cancer at an early stage or stagnate its progression at a later stage ( ). in today’s day and age cancer is one of the leading diseases, causing . million deaths each year ( ). cancer diagnosis and treatment remain to be center of attention for medical professionals and researchers everywhere. development of high- throughput dna sequencing technology has led to varied discoveries in the field of genomics as mutation profiles, rna expressions or micro-rna profiles can be easily detected now ( ).the importance of such genetic data can be realized by the fact that cancer diagnosis, progression and prognosis can be statistically analyzed through machine learning algorithms. furthermore, sub-networks of genes and individual biomarkers responsible for cancer can be marginalized for precision medicine ( ) ( ). machine learning and deep learning techniques have been used extensively in domains such as image processing, natural language processing or audio recognition and have shown great promise. however, with regard to field of bioinformatics, focus has always been towards recognizing subtypes or biomarkers through clustering algorithms. in recent past, focus has shifted towards classification through supervised learning algorithms for rna-seq expressions. with somatic mutations, very naive or basic methods have been used for classification. also, multi-class classification has not really been explored even though cross-cancer biomarkers identification has been tampered with. machine learning algorithms ease two challenges associated with study of genetic data: extraction of meaningful genes and classification of cancer. techniques like principal component analysis, k-means clustering and independent component analysis have been used to reduce dimensions while k-nearest neighbors, random forrest and support vector machines for classification ( ) ( ). due to availability of large datasets and computational resources, researchers have moved towards using deep learning algorithms in classification problems like object detection or image classification ( ). more recently, bioinformatics has been penetrated with the applications of deep learning to genetic data for drug discovery, gene regulation or protein classification as huge sets of data are accessible ( ). hence, cancer detection based on gene expressions or mutation profiles has been experimented with deep learning architectures to improve classification accuracy and identification of biomarkers. for cancer detection through gene expressions, generative adversarial network(gan) ( ), stacked denoising auto encoder(sda)( ), artificial neural networks(ann) ( ), discriminant deep belief networks(ddbn)( ) and one-dimensional convolutional neural networks( dcnn) ( ) have been used. deep neural networks(dnn) have been used for heterogeneous classification of different types of cancer using somatic mutation profiles ( ). this research picks up from detection of different types of cancer rna-seq expressions using deep neural networks with application of dimensionality reduction ( ). rna-seq expressions data for breast cancer has been reduced using kernel principal component analysis(kpca) and principal component analysis(pca) and classified using svm with linear and radial basis function kernels and ann ( ). heterogeneous rna- seq expressions data has been analyzed with sda for feature extraction and biomarker identification and dnn and dcnn for multi-class classification. the purpose was to achieve high classification performance and extract meaningful genes for targeted therapy by exploring deep learning architectures that have not been tried yet on rna-seq expressions based cancer classification. this section would be followed by description of materials and methods, results acquired and final conclusion of the whole study. related work cancer detection from genetic data has been a challenging task but an important one for bioinformatics researchers. due to cheaper dna sequencing technology, larger datasets are available to be used for diagnosis, treatment or prognosis. hence, various feature extraction and machine learning algorithms have been used for dimensionality reduction and classification over the years. moreover, the world has moved from studying effects of individual gene functions to gene networks. moreover, same networks can cause various diseases as well. over the years with advancement of sequencing technology, scientists have incorporated various forms of gene expressions data in their studies; ranging from microarray expression to dna sequencing ( ). in recent years the shift has been moved from microarrays to rna-seq datasets for gene expressions-based cancer research. however, regardless of the data type most of the techniques used for cancer detection or relevant gene identification have been the same. clustering analysis has been used to group significant genes together and aid with accurate classification of samples. k-nearest neighbours has been used for quantifying correlation between gene expressions for prostate cancer ( ) and with varied distance measures for classification of breast cancer ( ). also k-means clustering classification based on driver genes, identified using wavelet transforms for colon and leukemia samples ( ). hierarchal clustering has been utilized to classify subtypes of breast cancer data( ) and cancer data with reduced dimensionality( ). apart from clustering, svms have been used stupendously for classification of gene expression profiles for different kinds of cancers. multi-category svms aided the subtype classification of leukaemia dataset to a great extent ( ). network algorithms have also been used to identify network of genes contributing to propagation of multiple types of cancer ( ). however, since, numerous machine learning algorithms have been developed; researchers have explored their usefulness with respect to cancer diagnosis and biomarker identification. with the advent of deep learning methods, there has been an obvious inclination towards using them for dimensionality reduction as well as classification. gupta et al.( ) in their paper used this architecture for learning meaningful representation of gene expressions data of yeast cell cycle clusters of genes evaluated from raw input were already labeled and were compared with the clustering of output of sda. moreover, pca was also tested on gene expression profiles and evaluated with aforementioned clustering algorithms. the results reveal that sda capture gene co-expressions better than pca by all means. danaae et al. ( ) focused their research on extracting deeply connected genes from rna-seq expressions of breast cancer data using sda. pca and kpca were used as comparative techniques to measure sda’s efficacy. apart from reducing dimensionality, they have analyzed the weight matrix of sda to identify contributor genes. these genes have been tagged as deeply connected genes(dcgs). panther pathways was used to analyze functions corresponding to different genes and tumor suppressor genes. bhat et al. ( ) experimented with generative adversarial deep convolution networks to accurately classify gene expression-based datasets of two types of cancer: breast cancer and prostrate cancer. karabulut et al. ( ) demonstrated the efficiency of ddbn on classification of cancer as compared to traditional model like svm. experiments were performed on three different types of cancer individually: laryngeal, colorectal and bladder. for comparison svm, random forrest and k-nn were applied on all datasets. results revealed that ddbn outperformed all the afore-mentioned classification. liu et al. ( ) focused their research on discrimination of tumor samples from normal ones. they proposed sample expansion method inspired from sae and sda to enlarge training samples. dcnn has been proposed in this paper for tumor classification. it takes input in one dimensional vector instead of traditional two dimensions used for image classification. the performance of dcnn was better than that of sae on each dataset teixeira et al. ( ) worked for singling out most informative genes using sda for classification of thyroid cancer using ann. they used traditional methods like pca and kernel pca for comparison with deep learning method for feature extraction. output of sda was analyzed by extracting the weight matrix and using connected weights method and three groups of genes were discovered with inter-related functions. hence, the effectiveness of deep learning models for feature extraction and relevant gene identification is prominent especially when the world is moving towards precision medicine. so it is that, multi-class classification and biomarker identification is the current focus and for that reason researchers have been experimenting with deep learning. deep learning has become famous for classification problems related to larger datasets and feature extraction for wide variety of fields and more recently for bioinformatics too. material and method acquisition of data gene expressions datasets have been most widely used with relation to anomaly classification as mentioned in before. the dataset for this study has been formulated from the cancer genome atlas(tcga) supported portals. • rna-seq expressions tcga portal provides gene expressions data in form of read counts as well as normalized expressions for different types of cancer. for multi-class classification, each kind of data has to have same genes and this is ensured by the fact that they are sequenced by same technology and preprocessed with same techniques. broad institute gdac portal provides dataset for rna-seq expressions in raw form as well as rsem normalized form. for this research, illumina hiseq rsem normalized dataset has been used as seen in table table : tcga multi-class cancer dataset dataset split the dataset for types was combined into dataset with each sample given a corresponding label for its type of cancer. the labels were numbers between - for each sample, where each number corresponds to a specific cancer type. the dataset contained around samples for types of cancer, and was split into training, validation and test sets. the percentage split of each set was %, % and % respectively. as there was an apparent class imbalance among different types, so division of dataset was kept proportionate per class. to elaborate it means, that each type was divided into three sets with the afore-mentioned percentage split. preprocessing the genes have been normalized and those with zero values across all samples have been removed, as they would not contribute to the results. sda the experiments used output of sda as an input to dcnn for classification of cancer types. sda has been trained through greedy-layer wise training where each layer is trained for a specific number of iterations and the output of the preceding was used as input to the succeeding layer. number of hidden units per layer were decreased gradually because it has known to incorporate the features better. five experiments were cancer type no. of cancerous samples breast invasive carcinoma(brca) adrenocortial carcinoma(acc) cervical and endocervical cancer(cesc) head and neck squamous carci- noma(hnsc) kidney renal papillary cell carcinoma(kirp) brainlower grade glioma ( lgg) lung adenocarcinoma (luad) pancreatic adenocarcinoma (paad) prostate adenocarcinoma (prad) stomach adenocarcinoma(stad) uterine carcinosarcoma (ucs) bladder urothelial carcinoma(blca) performed revealing substantial results and they produced five high-ranked gene sets and reduced feature sets. the output of sda was used in two ways: • using reduced features the output of the final layer of sda was the reduced features of the dataset. these features for each sample were stored for training dataset after the desired iterations were performed. final weights for each layer were also stored so that it could be used to reduce features for test and validation dataset. • using high ranked genes final weight matrix when analyzed shows that the weights of the genes were normally distributed. a small portion of genes had high weights which had been regarded as high- weight genes. these genes were filtered in training and testing datasets so as to reduce the number of features as done in ( ). the weight matrix of each layer was multiplied to generate a number of genes x number of features matrix. mweight=∏ i= n w i ( ) for each node, mean weight and standard deviation was calculated and genes were ranked by filtering genes outside specific number of standard deviations. g = mean−nstd ∗std > genes > mean + nstd ∗std ( ) dcnn the reduced features extracted using sda was fed to dcnn for classification. the overall accuracy of the system determines whether the extracted features were of any significance or not. biomarker identification for biomarker identification, high-ranked gene sets were generated for different sda architectures and their relevant pathways were identified from panther database. overlapping pathways and genes were analyzed amongst all sets and there were quite a few that overlapped. the overlapping genes were checked against literature to confirm whether the identified genes are cross-cancer ones, and they are identified as biomarkers. figure : workflow results this study focused exploration with rna-seq expression dataset only due to its availability; however it can be safely assumed that the built pipeline could be useful for other types of datasets as well. sda as mentioned in previous section, the output as reduced features and high-ranked genes based on weight matrix was used. different number of layers of sda was trained with different hidden units. as per literature, if the number of hidden units is decreased gradually then sda better incorporates the features for reconstruction. the original number of genes was and removing the genes with zero value across all samples, left total number of genes to be . the hidden units ranged between - for whole architecture but first two layers contained fixed number of units and respectively. only third- last and last layer were changed for experiments. substantial experiments were conducted with and layers as that gave higher accuracy. for reduced features, the best results were obtained when the reconstruction layer contained higher number of units. the features were tested by using dcnns for classification. however, the accuracy kind of plateaued at features with around . %. the following graph in fig shows the accuracy achieved with dcnns and varied number of layers and reduced features. the experiments included in this graph are with and layers. the first and second layer contained fixed and units respectively. rna-seq expressions normalization dimensionality reduction/feat ure extraction classification high-ranked genes biomarker identification figure : accuracy with linear combination of reduced features high-ranked genes the weight matrix for each layer of sda was used to rank genes based on the combination of their weights. as per literature it has been observed that genes with higher weights tend to act as contributing genes towards cancer. as per ( ) the weight matrix of sda follows an approximate normal distribution and the highly negative or highly positive genes in terms of their weights are significant genes. so, the genes away from mean weights would be categorized as the high-ranked ones. so we used standard deviation from the mean to identify the relevant genes. due to limitation of resources, the experiments could only be performed within a restricted range; nevertheless they show huge performance in terms of relevant gene identification. it was observed that genes that stood ground away from the mean were actually the relevant ones. also the genes that overlapped amongst different sda architectures were considered to be cross-cancer relevant genes. since the aim of this research has always been that we achieve maximum performance with minimal genes; architectures within the range of - features give better performance within - standard deviation. four genes were found to be similar amongst all pathways for all sets across all standard deviations, so proof of them being involved in multiple types was studied in literature. the study shows the promise and relevance of realized genes as seen in table . table : relevance of identified genes in literature genes cancer types wnt a brca luad blca prad paad pik c g brca blca hnsc apart from that, there are pathways that are found to be common in overlapping genes for different standard deviations, however two of them are same as found in all sets of experiment-generated genes for all standard deviations namely: wnt pathway and angiogenesis. also the genes associated with these pathways are similar to that found in experiment-generated gene sets. the following figure shows how standard deviations between - relates pathways and overlapping genes and the scope for meaningful analysis. figure : pathway hits against different standard deviations the following tables show the summarized results for reduced features and high-ranked genes in comparison to other similar studies. table : summarized results for reduced features paper classification mean per class accuracy danae. et al( ) breast cancer . proposed multi-class( types including breast cancer) . table : summarized results for high-ranked genes paper classifiaction high- ranked genes mean per class accuracy pathway hits danaee et. al( ) breast cancer . proposed multi-class( types including breast cancer) . conclusion this study was aimed at classifying types of cancer and identifying relevant genes and the results show that the proposed approach shows promise for the said task. usage of sda with dcnn has revealed an average accuracy of % for reduced features and % for high-ranked genes. this shows that relevant gene sets could help with cancer classification task as well as cross-cancer gene and pathway identification. we were able to identify cancer-relevant pathways and genes for the sets, that different experiments generated, from panther database. the common genes amongst all experiments were verified by literature as to be involved in multiple cancers. this shows that our method can be used for multi-class or single-class cancer classification and for recognizing the relevant genes as biomarkers. this gives hope to identify those genes that have yet not been explored by literature. panther database is used by bioinformatics community to study the origin, families and relevance of genes with respect to single type or varied types of cancer. that involves a lot of manual analysis, but deep learning decreases the load by pointing to relevant genes and pathways or identify newer pathways and genes. the hardware resource constrained the study but reliability and significance of automating the classification and identification with deep learning was still realized. more experiments would show more avenues that could be explored for cancer study through deep learning. furthermore, using more types of cancer would also aid in identifying larger sets of cross-cancer biomarkers and pathways. this study is just a step to show the relevance of using automated gene identification techniques which are reliable and can handle large amount of variations and unknowns and ambiguities. whereas, the traditional statistical techniques for genes involve thresholding depending on the samples and the genes involved. even though resource limitation in terms of gpu hours was tackled during the course of study, it still provided good results. additional information ethics approval this is an original study performed using open source dataset of tcga and there is no violation of rights and obligations for usage of the dataset. data availability the data was downloaded from broad institute firehose database (https://gdac.broadinstitute.org/). conflict of interest there is no conflict of interest in with regarding to authors’ contributions funding the project was completed by using first author’s own funds. no external funding was involved. author’s contributions the project was implemented and paper was written by first author. second author provided guidance for forming the workflow and methodology of the study. acknowledgements this study could not have been without the guidance and support of my supervisor dr. saira karim. references . gupta a, wang h, ganapathiraju m. learning structure in gene expression data using deep architectures, with an application to gene clustering. in: bioinformatics and biomedicine (bibm), ieee international conference on. . p. – . . yuan y, shi y, li c, kim j, cai w, han z, et al. deepgene : an advanced cancer type classifier based on deep learning and somatic point mutations. bmc bioinformatics [internet]. ; (suppl ). available from: http://dx.doi.org/ . /s - - - . fawzy h, kamel m, al-amodi hsab. exploitation of gene expression and cancer biomarkers in paving the path to era of personalized medicine. genomics proteomics bioinformatics [internet]. ; ( ): – . available from: http://dx.doi.org/ . /j.gpb. . . . lee y, lee c-k. classification of multiple cancer types by multicategory support vector machines using gene expression data. bioinformatics. ; ( ): – . . danaee p, ghaeini r, hendrix da. a deep learning approach for cancer detection and relevant gene identification. in: pacific symposium on biocomputing . . p. – . . min s, lee b, yoon s. deep learning in bioinformatics. brief bioinform. ; ( ): – . . bhat rr, viswanath v, li x. deepcancer : detecting cancer through gene expressions via deep generative learning. (ml). . karabulut em. discriminative deep belief networks for microarray based cancer classification . ; ( ): – . . liu j, wang x, cheng y, zhang l. tumor gene expression data classification via sample expansion- based deep learning. oncotarget. ; ( ): . . wang z, gerstein m, snyder m. rna-seq: a revolutionary tool for transcriptomics. nat rev genet. ; ( ): . . singh d, febbo pg, ross k, jackson dg, manola j, ladd c, et al. gene expression correlates of clinical prostate cancer behavior. ; (march): – . https://gdac.broadinstitute.org/ . rules c, medjahed sa. breast cancer diagnosis by using k-nearest neighbor with different breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. ;(january). . mishra p, bhoi n, meher j. effective clustering of microarray gene expression data using signal processing and soft computing methods. in: electrical, electronics, signals, communication and optimization (eesco), international conference on. . p. – . . woodward wa, krishnamurthy s, yamauchi h, el-zein r, ogura d, kitadai e, et al. genomic and expression analysis of microdissected inflammatory breast cancer. breast cancer res treat. ; ( ): – . . khan j, wei js, ringner m, saal lh, ladanyi m, westermann f, et al. classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. nat med. ; ( ): . . martinez-ledesma e, verhaak rgw, treviño v. identification of a multi-cancer gene expression biomarker for cancer clinical outcomes using a network-based algorithm. sci rep. ; : . . teixeira v, camacho r, ferreira pg. learning influential genes on cancer gene expression data with stacked denoising autoencoders. in: bioinformatics and biomedicine (bibm), ieee international conference on. . p. – . . microbe-host ai. adage-based integration of publicly available pseudomonas aeruginosa gene interactions. ( ): – . . braune e-b, seshire a, lendahl u. notch and wnt dysregulation and its relevance for breast cancer and tumor initiation. biomedicines. ; ( ): . doi: . /biomedicines . tammela t, sanchez-rivera fj, cetinbas nm, et al. a wnt-producing niche drives proliferative potential and progression in lung adenocarcinoma. nature. ; ( ): - . doi: . /nature . zhang m, li h, zou d, gao j. ruguo key genes and tumor driving factors identification of bladder cancer based on the rna-seq profile. onco targets ther. ; : - . doi: . /ott.s . ahmad i, sansom oj. role of wnt signalling in advanced prostate cancer. j pathol. ; ( ): - . doi: . /path. . fakhar m, najumuddin, gul m, rashid s. antagonistic role of klotho-derived peptides dynamics in the pancreatic cancer treatment through obstructing wnt- and frizzled binding. biophys chem. ; (june): - . doi: . /j.bpc. . . . fidalgo f, rodrigues tc, pinilla m, et al. lymphovascular invasion and histologic grade are associated with specific genomic profiles in invasive carcinomas of the breast. tumor biol. ; ( ): - . doi: . /s - - -z . tilley sk, kim wy, fry rc. analysis of bladder cancer tumor cpg methylation and gene expression within the cancer genome atlas identifies gria as a prognostic biomarker for basal-like bladder cancer. am j cancer res. ; ( ): - . . simpson dr, mell lk, cohen eew. targeting the pi k/akt/mtor pathway in squamous cell carcinoma of the head and neck. oral oncol. ; ( ): - . doi: . /j.oraloncology. . . multi-class cancer classification and biomarker identification using deep learning abstract introduction related work material and method preprocessing sda dcnn results high-ranked genes creating clear and informative image-based figures for scientific publications helena jambor *, alberto antonietti *, bradly alicea , tracy l. audisio , susann auer , vivek bhardwaj , , steven j. burgess , iuliia ferling , małgorzata anna gazda , , luke h. hoeppner , vinodh ilangovan , hung lo , , mischa olson , salem yousef mohamed , sarvenaz sarabipour , aalok varma , kaivalya walavalkar , erin m. wissink , tracey l. weissgerber *co-first authors mildred-scheel early career center, medical faculty, technische universität dresden, germany department of electronics, information and bioengineering, politecnico di milano, italy; department of brain and behavioral sciences, university of pavia, pavia, italy orthogonal research and education laboratory, champaign, illinois, united states evolutionary genomics unit, okinawa institute of science and technology, okinawa, japan department of plant physiology, faculty of biology, technische universität dresden, dresden, germany max plank institute of immunology and epigenetics, freiburg, germany hubrecht institute, utrecht, the netherlands carl r woese institute for genomic biology, university of illinois at urbana- champaign, urbana, illinois, united states junior research group evolution of microbial interactions, leibniz institute for natural product research and infection biology - hans knöll institute (hki), jena, germany cibio/inbio, centro de investigação em biodiversidade e recursos genéticos, campus agrário de vairão, universidade do porto, - vairão, portugal departamento de biologia, faculdade de ciências, universidade do porto, porto, portugal the hormel institute, university of minnesota, austin, mn, usa; the masonic cancer center, university of minnesota, minneapolis, mn, united states aarhus university, denmark neuroscience research center, charité - universitätsmedizin berlin, corporate member of freie universität berlin, humboldt - universität zu berlin, and berlin institute of health, berlin, germany einstein center for neurosciences berlin, berlin, germany section of plant biology, school of integrative plant science, cornell university, ithaca, ny, united states gastroenterology and hepatology unit, internal medicine department, faculty of medicine, university of zagazig, egypt institute for computational medicine and the department of biomedical engineering, johns hopkins university, united states national centre for biological sciences (ncbs), tata institute of fundamental research (tifr), bangalore, karnataka, india department of molecular biology and genetics, cornell university, ithaca, ny, united states quest – quality | ethics | open science | translation, charité - universitätsmedizin berlin, berlin institute of health (bih), germany .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / address for correspondence: tracey weissgerber, tracey.weissgerber@charite.de, quest – quality | ethics | open science | translation, charité – universitätsmedizin berlin, berlin institute of health, berlin, germany abstract scientists routinely use images to display data. readers often examine figures first; therefore, it is important that figures are accessible to a broad audience. many resources discuss fraudulent image manipulation and technical specifications for image acquisition; however, data on the legibility and interpretability of images are scarce. we systematically examined these factors in non-blot images published in the top journals in three fields; plant sciences, cell biology and physiology (n= papers). common problems included missing scale bars, misplaced or poorly marked insets, images or labels that were not accessible to colorblind readers, and insufficient explanations of colors, labels, annotations, or the species and tissue or object depicted in the image. papers that met all good practice criteria examined for all image-based figures were uncommon (physiology %, cell biology %, plant sciences %). we present detailed descriptions and visual examples to help scientists avoid common pitfalls when publishing images. our recommendations address image magnification, scale information, insets, annotation, and color and may encourage discussion about quality standards for bioimage publishing. keywords: microscopy; imaging; images; photographs; colorblind; transparency; good bioimaging practices .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:tracey.weissgerber@charite.de https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction images are often used to share scientific data, providing the visual evidence needed to turn concepts and hypotheses into observable findings. an analysis of million images from more than , papers deposited in pubmed central revealed that . % of figures were “photographs”, a category that included microscope images, diagnostic images, radiology images and fluorescence images. cell biology was one of the most visually intensive fields, with publications containing an average of approximately . photographs per page. plant sciences papers included approximately . photographs per page. while there are many resources on fraudulent image manipulation and technical requirements for image acquisition and publishing, - data examining the quality of reporting and ease of interpretation for image-based figures are scarce. recent evidence suggests that important methodological details about image acquisition are often missing. researchers generally receive little or no training in designing figures; yet many scientists and editors report that figures and tables are one of the first elements that they examine when reading a paper. , when scientists and journals share papers on social media, posts often include figures to attract interest. the pubmed search engine caters to scientists’ desire to see the data by presenting thumbnail images of all figures in the paper just below the abstract. readers can click on each image to examine the figure, without ever accessing the paper or seeing the introduction or methods. embo’s source data tool (rrid:scr_ ) allows scientists and publishers to share or explore figures, as well as the underlying data, in a findable and machine readable fashion. image-based figures in publications are generally intended for a wide audience. this may include scientists in the same or related fields, editors, patients, educators and grants officers. general recommendations emphasize that authors should design figures for their audience rather than themselves, and that figures should be self-explanatory. despite this, figures in papers outside one’s immediate area of expertise are often difficult to interpret, marking a missed opportunity to make the research accessible to a wide audience. stringent quality standards would also make image data more reproducible. a recent study of fmri image data, for example, revealed that incomplete documentation and presentation of brain images led to non-reproducible results. , here, we examined the quality of reporting and accessibility of image-based figures among papers published in top journals in plant sciences, cell biology and physiology. factors assessed include the use of scale bars, explanations of symbols and labels, clear and accurate inset markings, and transparent reporting of the object or species and tissue shown in the figure. we also examined whether images and labels were accessible to readers with the most common form of color blindness. based on our results, we provide targeted recommendations about how scientists can create informative image-based figures that are accessible to a broad audience. these recommendations may also be used to establish quality standards for images deposited in emerging image data repositories. results using a science of science approach to investigate current practices: this study was conducted as part of a participant-guided learn-by-doing course, in which elife .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / community ambassadors from around the world worked together to design, complete, and publish a meta-research study. participants in the ambassadors program designed the study, developed screening and abstraction protocols, and screened papers to identify eligible articles (hj, ba, sjb, vb, lhh, vi, ss, emw). participants in the ambassadors program refined the data abstraction protocol, completed data abstraction and analysis, and prepared the figures and manuscript (aa, sa, tla, if, mag, hl, sym, mo, av, kw, hj, tlw). to investigate current practices in image publishing, we selected three diverse fields of biology to increase generalizability. for each field, we examined papers published in april in the top journals, which publish original research (table s , table s , table s ). all full-length original research articles that contained at least one photograph, microscope image, electron microscope image, or clinical image (mri, ultrasound, x-ray, etc.) were included in the analysis (figure s ). blots and computer- generated images were excluded, as some of the criteria assessed do not apply to these types of images. two independent reviewers assessed each paper, according to the detailed data abstraction protocol (see methods and information deposited on the open science framework (rrid:scr_ ) at https://osf.io/b /). the repository also includes data, code and figures. image analysis: first, we confirmed that images are common in the three biology subfields analyzed. more than half of the original research articles in the sample contained images (plant science: %, cell biology: %, physiology: %). among the papers that included images, microscope images were very common in all three fields ( to %, figure a). photographs were very common in plant sciences ( %), but less widespread in cell biology ( %) and physiology ( %). electron microscope images were less common in all three fields ( to %). clinical images, such as x- rays, mri or ultrasound, and other types of images were rare ( to %). scale information is essential to interpret biological images. approximately half of papers in physiology ( %) and cell biology ( %), and % of plant science papers provided scale bars with dimensions (in the figure or legend) for all images in the paper (figure b, table s ). approximately one-third of papers in each field contained incomplete scale information, such as reporting magnification or presenting scale information for a subset of images. twenty-four percent of physiology papers, % of cell biology papers, and % of plant sciences papers contained no scale information on any image. some publications use insets to show the same image at two different scales (cell biology papers: %, physiology: %, plant sciences: %). in this case, the authors should indicate the position of the high-magnification inset in the low-magnification image. the majority of papers in all three fields clearly and accurately marked the location of all insets ( to %, figure c left panel), however one-fifth of papers appeared to have marked the location of at least one inset incorrectly ( to %). clearly visible inset markings were missing for some or all insets in to % of papers (figure c left panel). approximately half of papers ( to %, figure c right panel) provided legend explanations or markings on the figure to clearly show that an inset was used, whereas this information was missing for some or all insets in the remaining papers. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://osf.io/b / https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : image types and reporting of scale information and insets a: microscope images and photographs were common, whereas other types of images were used less frequently. b: complete scale information was missing in more than half of the papers examined. partial scale information indicates that scale information was presented in some figures, but not others, or that the authors reported magnification rather than including scale bars on the image. c: problems with labeling and describing insets are common. totals may not be exactly % due to rounding. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / many images contain information in color. we sought to determine whether color images were accessible to readers with deuteranopia, the most common form of color blindness, by using the color blindness simulator color oracle (https://colororacle.org/, rrid: scr_ ). we evaluated only images in which the authors selected the image colors (e.g. fluorescence microscopy). papers without any colorblind accessible figures were uncommon ( to %), however % of cell biology papers and - % of physiology and plant science papers contained some images that were inaccessible to readers with deuteranopia (figure a). to % of papers contained color annotations that were not visible to someone with deuteranopia. figure legends and, less often, titles typically provide essential information needed to interpret an image. this text provides information on the specimen and details of the image, while also explaining labels and annotations used to highlight structures or colors. % of physiology papers, % of cell biology papers and % of plant papers described the species and tissue or object shown completely. - % of papers did not provide any such information (figure b). approximately half of the papers ( - %, figure c, right panel) also failed or partially failed to adequately explain that insets were used. annotations of structures were explained better. two-thirds of papers across all three fields clearly stated the meaning of all image labels, while to % of papers provided partial explanations. most papers ( to %) completely explained the image colors by stating what substance each color represented or naming the dyes or staining technique used. finally, we examined the number of papers that used optimal image presentation practices for all criteria assessed in the study. twenty-eight ( %) physiology papers, ( %) cell biology papers and ( %) plant sciences papers met all criteria for all image- based figures in the paper (data not shown in figure). in plant sciences and physiology, the most common problems were with scale bars, insets and specifying in the legend the species and tissue or object shown. in cell biology, the most common problems were with insets, colorblind accessibility, and specifying in the legend the species and tissue or object shown. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://colororacle.org/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : use of color and annotations in image-based figures a: while many authors are using colors and labels that are visible to colorblind readers, the data show that improvement is needed. b: most papers explain colors in image-based figures, however, explanations are less common for the species and tissue or object shown, and labels and annotations. totals may not be exactly % due to rounding. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / designing image-based figures: how can we improve? our results obtained by examining papers from three fields provide us with unique insights into the quality of reporting and the accessibility of image-based figures. our quantitative description of standard practices in image publication highlights opportunities to improve transparency and accessibility to readers from different backgrounds. we have therefore outlined specific actions that scientists can take when creating images, designing multipanel figures, annotating figures and preparing figure legends. throughout the paper, we provide visual examples to illustrate each stage of the figure preparation process. other elements are often omitted to focus readers’ attention on the step illustrated in the figure. for example, a figure that highlights best practices for displaying scale bars may not include annotations designed to explain key features of the image. when preparing image-based figures in scientific publications, readers should address all relevant steps in each figure. all steps described below (image cropping and insets, adding scale bars and annotation, choosing color channel appearances, figure panel layout) can be implemented with standard image processing software such as fiji (rrid:scr_ ) and imagej (rrid:scr_ ), which are open source, free programs for bio-image analysis. a quick guide on how to do basic image processing for publications with fiji is available in a recent cheat sheet publication and a discussion forum and wiki are available for fiji and imagej (https://imagej.net/). . choose a scale or magnification that fits your research question scientists should select an image scale or magnification that allows readers to clearly see features needed to answer the research question. figure a shows drosophila melanogaster at three different microscopic scales. the first focuses on the ovary tissue and might be used to illustrate the appearance of the tissue or show stages of development. the second focuses on a group of cells. in this example, the “egg chamber” cells show different nucleic acid distributions. the third example focuses on subcellular details in one cell, for example, to show finer detail of rna granules or organelle shape. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://imagej.net/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : selecting magnification and using insets a. magnification and display detail of images should permit readers to see features related to the main message that the image is intended to convey. this may be the organism, tissue, cell, or a subcellular level. microscope images show d. melanogaster ovary (a ), ovarian egg chamber cells (a ), and a detail in egg chamber cell nuclei (a ). b. insets or zoomed-in areas are useful when two different scales are needed to allow readers to see essential features. it is critical to indicate the origin of the inset in the full- scale image. poor and clear examples are shown. example images were created based on problems observed by reviewers. images show b , b , b , b : protostelium aurantium amoeba fed on germlings of aspergillus .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fumigatus d -gfp (green) fungal hyphae, dead fungal material stained with propidium iodide (red), and acidic compartments of amoeba marked with lysotracker blue dnd- dye (blue); b : lendrum-stained human lung tissue (haraszti, public health image library); b : fossilized orobates pabsti.   when both low and high magnifications are necessary for one image, insets are used to show a small portion of the image at higher magnification (figure b). the inset location must be accurately marked in the low magnification image. we observed that the inset position in the low magnification image was missing, unclear, or incorrectly placed in approximately one third of papers. inset positions should be clearly marked by lines or regions-of-interest in a high-contrast color, usually black or white. insets may also be explained in the figure legend. care must be taken when preparing figures outside vector graphics suits, as insert positions may move during file saving or export. . include a clearly labeled scale bar scale information allows audiences to quickly understand the size of features shown in images. this is especially important for microscopic images where we have no intuitive understanding of scale. scale information for photographs should be considered when capturing images as rulers are often placed into the frame. our analysis revealed that - % of papers screened failed to provide any scale information and another third only provided incomplete scale information (figure b). scientists should consider the following points when displaying scale bars: • every image type needs a scale bar: authors usually add scale bars to microscope images, but often leave them out in photos and clinical images, possibly because these depict familiar objects such a human or plant. missing scale bars, however, adversely affect reproducibility. a size difference of % in between a published study and the reader’s lab animals, for example, could impact study results by leading to an important difference in phenotype. providing scale bars allows scientists to detect such discrepancies and may affect their interpretation of published work. scale bars may not be a standard feature of image acquisition and processing software for clinical images. authors may need to contact device manufacturers to determine the image size and add height and width labels. • scale bars and labels should be clearly visible: short scale bars, thin scale bars and scale bars in colors that are similar to the image color can easily be overlooked (figure ). in multicolor images, it can be difficult to find a color that makes the scale bar stand out. authors can solve this problem by placing the scale bar outside the image or onto a box with a more suitable background color. • annotate scale bar dimensions on the image: stating the dimensions along with the scale bar allows readers to interpret the image more quickly. despite this, dimensions were typically stated in the legend instead (figure b), possibly a legacy of printing processes that discouraged text in images. dimensions should be in high resolution and large enough to be legible. in our set, we came across small and/or low-resolution annotations that were illegible in electronic versions of the paper, even after zooming in. scale bars that are visible on larger figures produced by authors may be difficult to read when the size of the figure is .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / reduced to fit onto a journal page. authors should carefully check page proofs to ensure that scale bars and dimensions are clearly visible. figure : using scale bars to annotate image size scale bars provide essential information about the size of objects, which orients readers and helps them to bridge the gap between the image and reality. scales may be indicated by a known size indicator such as a human next to a tree, a coin next to a rock, or a tape measure next to a smaller structure. in microscope images, a bar of known length is included. example images were created based on problems observed by reviewers. poor scale bar examples ( - bottom), clear scale bar examples ( - ). images , , : microscope images of d. melanogaster nurse cell nuclei; . microscope image of dictyostelium discoideum (see figure ); , , , :. electron microscope image of mouse pancreatic beta-islet cells (andreas müller); , : microscope image of lendrum-stained human lung tissue (haraszti, public health image library); . photo of arabidopsis thaliana; : photograph of fossilized orobates pabsti. . use color wisely in images colors in images are used to display the natural appearance of an object, or to visualize features with dyes and stains. in the scientific context, adapting colors is possible and may enhance readers’ understanding, while poor color schemes may distract or mislead. images showing the natural appearance of a subject, specimen or staining technique (e.g. images showing plant size and appearance, or histopathology images of fat tissue from mice on different diets) are generally presented in color (figure ). images showing electron microscope images are captured in black and white (“grayscale”) by default and may be kept in grayscale to leverage the good contrast resulting from a full luminescence spectrum. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : image types and their accessibility in colorblind render and grayscale mode shown are examples of the types of images that one might find in manuscripts in the biological or biomedical sciences: photograph, fluorescent microscope images with - color hues/look-up-tables (lut), electron microscope images. the relative visibility is assessed in a colorblind rendering for deuteranopia, and in grayscale. grayscale images offer the most contrast ( -color microscope image) but cannot show several structures in parallel (multicolor images, color photographs). color combinations that are not colorblind accessible were used in rows and to illustrate the importance of colorblind simulation tests. scale bars are not included in this figure, as they could not be added in a non-distracting way that would not detract from the overall message of the figure. images show: row : darth vader being attacked, row : d. melanogaster salivary glands, row : d. melanogaster egg chambers, row : d. melanogaster nurse cell nuclei, and row : mouse pancreatic beta-islet cells. in some instances, scientists can choose whether to show grayscale or color images. assigning colors may be optional, even though it is the default setting in imaging programs. when showing only one color channel, scientists may consider presenting this channel in grayscale to optimally display fine details. this may include variations in staining intensity or fine structures. when opting for color, authors should use grayscale visibility tests (figure ) to determine whether visibility is compromised. this can occur when dark colors, such as magenta, red, or blue, are shown on a black background. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : visibility of colors/hues differs and depends on the background color the best contrast is achieved with grayscale images or dark hues on a light background (first row). dark color hues, such as red and blue, on a dark background (last row) are least visible. visibility can be tested with mock grayscale. images show actin filaments in dictyostelium discoideum (lifeact-gfp). all images have the same scale. abbreviations: gfp, green fluorescent protein. . choose a colorblind accessible color palette: fluorescent images with merged color channels visualize the co-localization of different markers. while many readers find these images to be visually appealing and informative, these images are often inaccessible to color blind co-authors, reviewers, editors, and readers. deuteranopia, the most common form of colorblindness, affects up to % of men and . % of women of northern european ancestry. a study of articles published in top peripheral vascular disease journals revealed that % of papers with color maps and % of papers with heat maps used color palettes that were not colorblind safe. we show that approximately half of cell biology papers, and one third of physiology papers and plant science papers contained images that were inaccessible to readers with deuteranopia. scientists should consider the following points to ensure that images are accessible to colorblind readers. • select colorblind safe colors: researchers should use colorblind safe color palettes for fluorescence and other images where color may be adjusted. figure illustrates how four different color combinations would look to viewers with different types of color blindness. green and red are indistinguishable to readers with deuteranopia, whereas green and blue are indistinguishable to readers with tritanopia, a rare form of color blindness. cyan and magenta are the best options, as these two colors look different to viewers with normal color vision, deuteranopia or tritanopia. green and magenta are also shown, as scientists .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / often prefer to show colors close to the excitation value of the fluorescent dyes, which are often green and red. • display separate channels in addition to the merged image: selecting a colorblind safe color palette becomes increasingly difficult as more colors are added. when the image includes three or more colors, authors are encouraged to show separate images for each channel, followed by the merged image (figure ). individual channels may be shown in grayscale to make it easier for readers to perceive variations in staining intensity. • use simulation tools to confirm that essential features are visible to colorblind viewers: free tools, such as color oracle (rrid:scr_ ), quickly simulate different forms of color blindness by adjusting the colors on the computer screen to simulate what a colorblind person would see. scientists using fiji (rrid:scr ) can select the “simulate colorblindness” option in the “color” menu under “images”. figure : color combinations as seen with normal vision and two types of colorblindness the figure illustrates how four possible color combinations for multichannel microscope images would appear to someone with normal color vision, the most common form of colorblindness (deuteranopia), and a rare form of color blindness (tritanopia). some combinations that are accessible to someone with deuteranopia are not accessible to readers with tritanopia, for example green/blue combinations. microscope images show dictyostelium discoideum expressing vps -gfp (vps - green fluorescent protein shows broad signal in cells) and stained with dextran (spotted signal) after infection with conidia of aspergillus fumigatus. all images have the same scale. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / abbreviations: gfp, green fluorescent protein. figure : strategies for making - or -channel microscope images colorblind safe images in the first row are not colorblind safe. readers with the most common form of colorblindness would not be able to identify key features. possible accessible solutions are shown: changing colors/luts to colorblind friendly combinations, showing each channel in a separate image, showing colors in grayscale and inverting grayscale images to maximize contrast. solutions and (show each channel in grayscale, or in inverted grayscale) are more informative than solutions and . regions of overlap are sometimes difficult to see in merged images without split channels. when splitting channels, scientists often use colors that have low contrast, as explained in figure (e.g. red or blue on black). microscope images show d. melanogaster egg chambers ( colors) and nurse cell nuclei ( colors). all images of egg chambers and nurse cells respectively have the same scale. abbreviations: lut, look-up table. . design the figure figures often contain more than one panel. careful planning is needed to convey a clear message, while ensuring that all panels fit together and follow a logical order. a planning table (figure a) helps scientists to determine what information is needed to answer the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / research question. the table outlines the objectives, types of visualizations required, and experimental groups that should appear in each panel. a planning table template is available on osf. after completing the planning table, scientists should sketch out the position of panels, and the position of images, graphs, and titles within each panel (figure b). audiences read a page either from top to bottom and/or from left to right. selecting one reading direction and arranging panels in rows or columns helps with figure planning. using enough white space to separate rows or columns will visually guide the reader through the figure. the authors can then assemble the figure based on the draft sketch. figure : planning multipanel figures planning tables and layout sketches are useful tools to efficiently design figures that address the research question. a. planning tables allow scientists to select and organize elements needed to answer the research question addressed by the figure. b. layout sketches allow scientists to design a logical layout for all panels listed in the planning table and ensure that there is adequate space for all images and graphs. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . annotate the figure annotations with text, symbols or lines allow readers from many different backgrounds to rapidly see essential features, interpret images, and gain insight. unfortunately, scientists often design figures for themselves, rather than their audience. examples of annotations are shown in figure . table describes important factors to consider for each annotation type. figure : using arrows, regions of interest, lines and letter codes to annotate structures in images text descriptions alone are often insufficient to clearly point to a structure or region in an image. arrows and arrowheads, lines, letters, and dashed enclosures can help if overlaid on the respective part of the image. microscope images show d. melanogaster egg chambers, with the different labelling techniques in use. the table provides an overview of their applicability and common pitfalls. all images have the same scale. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table : use annotations to make figures accessible to a broad audience feature to be explained annotation size scale bar with dimensions direction of movement arrow with tail draw attention to: • points of interest symbol (arrowhead, star, etc.) • regions of interest: black & white image highlight in color if this does not obscure important features within the region or outline with boxes or circles • regions of interest: color image outline with boxes or circles • layers labeled brackets beside the image for layers that are visually identifiable across the entire image or a line on the image for wavy layers that may be difficult to identify define features within an image labels when adding annotations to an image, scientists should consider the following steps. • choose the right amount of labeling. figure shows three levels of annotation. the barely annotated image ( a) is only accessible to scientists already familiar with the object and technique, whereas the heavily annotated version ( c) contains numerous annotations that obstruct the image and a legend that is time consuming to interpret. panel b is more readable; annotations of a few key features are shown, and the explanations appear right below the image for easy interpretation. explanations of labels are often placed in the figure legend. alternating between examining the figure and legend is time consuming, especially when the legend and figure are on different pages. figure d shows one option for situations where extensive annotations are required to explain a complex image. an annotated image is placed as a legend next to the original image. a semi-transparent white layer mutes the image to allow annotations to stand out. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : different levels of detail for image annotations annotations help to orient the audience but may also obstruct parts of the image. authors must find the right balance between too few and too many annotations. . example with no annotations. readers cannot determine what is shown. . example with a few annotations to orient readers to key structures. . example with many annotations, which obstruct parts of the image. the long legend below the figure is confusing. . example shows a solution for situations where many annotations are needed to explain the image. an annotated version is placed next to an unannotated version of the image for comparison. the legend below the image helps readers to interpret the image, without having to refer to the figure legend. note the different requirements for space. electron microscope images show mouse pancreatic beta- islet cells. • use abbreviations cautiously: abbreviations are commonly used for image and figure annotation to save space, but inevitably require more effort from the reader. abbreviations are often ambiguous, especially across fields. authors should run a web search for the abbreviation. if the intended meaning is not a top result, authors should refrain from using the abbreviation or clearly define the abbreviation on the figure itself, even if it is already defined elsewhere in the manuscript. note that in figure , abbreviations have been written out below the image to reduce the number of legend entries. • explain colors and stains: explanations of colors and stains were missing in around % of papers. figure illustrates several problematic practices observed in our dataset, as well as solutions for clearly explaining what each color represents. this figure uses fluorescence images as an example; however we also observed many histology images in which authors did not mention which stain was used. authors should describe how stains affect the tissue shown or .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / use annotations to show staining patterns of specific structures. this allows readers who are unfamiliar with the stain to interpret the image. figure : explain color in images cells and their structures are almost all transparent. every dye, stain, and fluorescent label therefore should be clearly explained to the audience. labels should be colorblind safe. large labels that stand out against the background are easy to read. authors can make figures easier to interpret by placing the color label close to the structure; color labels should only be placed in the figure legend when this is not possible. example images were created based on problems observed by reviewers. microscope images show d. melanogaster egg chambers stained with the dna dye dapi ( ′, -diamidino- -phenylindole) and probe for a specific mrna species. all images have the same scale. • ensure that annotations are accessible to colorblind readers: confirming that labels or annotations are visible to colorblind readers is important for both color and grayscale images (figure ). up to one third of papers in our dataset contained annotations or labels that would not have been visible to someone with deuteranopia. this occurred because the annotations blended in with the background (e.g. red arrows on green plants) or the authors use the same symbol in colors that are indistinguishable to someone with deuteranopia to mark different features. figure illustrates how to annotate a grayscale image so that it is accessible to color blind readers. using text to describe colors is also problematic for colorblind readers. this problem can be alleviated by using colored symbols in the legend or by using distinctly shaped annotations such as open vs. closed arrows, thin vs. wide lines, or dashed vs. solid lines. color blindness simulators help in determining whether annotations are accessible to all readers. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure : annotations should be colorblind safe . the annotations displayed in the first image are inaccessible to colorblind individuals, as shown with the visibility test below. this example was created based on problems observed by reviewers. - . two colorblind safe alternative annotations, in color ( ) and in grayscale ( ). the bottom row shows a test rendering for deuteranopia colorblindness. note that double-encoding of different hues and different shapes (e.g. different letters, arrow shapes, or dashed/non-dashed lines) allows all audiences to interpret the annotations. electron microscope images show mouse pancreatic beta-cell islet cells. all images have the same scale. . prepare figure legends each figure and legend are meant to be self-explanatory and should allow readers to quickly assess a paper or understand complex studies that combine different methodologies or model systems. to date, there are no guidelines for figure legends for images, as the scope and length of legends varies across journals and disciplines. some journals require legends to include details on object, size, methodology or sample size, while other journals require a minimalist approach and mandate that information should not be repeated in subsequent figure legends. our data suggest that important information needed to interpret images was regularly missing from the figure or figure legend. this includes the species and tissue type, or object shown in the figure, clear explanations of all labels, annotations and colors, and markings or legend entries denoting insets. presenting this information on the figure itself is more efficient for the reader, however any details that are not marked in the figure should be explained in the legend. while not reporting species and tissue information in every figure legend may be less of an issue for papers that examine a single species and tissue, this is a major problem when a study includes many species and tissues, which may be presented in different .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / panels of the same figure. additionally, the scientific community is increasingly developing automated data mining tools, such as the source data tool, to collect and synthesize information from figures and other parts of scientific papers. unlike humans, these tools cannot piece together information scattered throughout the paper to determine what might be shown in a particular figure panel. even for human readers, this process wastes time. therefore, we recommend that authors present information in a clear and accessible manner, even if some information may be repeated for studies with simple designs. discussion a flood of images is published every day in scientific journals and the number is continuously increasing. of these, around % likely contain intentionally or accidentally duplicated images. our data show that, in addition, most papers show images that are not fully interpretable due to issues with scale markings, annotation, and/or color. this affects scientists’ ability to interpret, critique and build upon the work of others. images are also increasingly submitted to image archives to make image data widely accessible and permit future re-analyses. a substantial fraction of images that are neither human nor machine-readable lowers the potential impact of such archives. based on our data examining common problems with published images, we provide a few simple recommendations, with examples illustrating good practices. we hope that these recommendations will help authors to make their published images legible and interpretable. limitations: while most results were consistent across the three subfields of biology, findings may not be generalizable to other fields. our sample included the top journals that publish original research for each field. almost all journals were indexed in pubmed. results may not be generalizable to journals that are un-indexed, have low impact factors, or are not published in english. data abstraction was performed manually due to the complexity of the assessments. error rates were % for plant sciences, % for physiology and % for cell biology. our assessments focused on factors that affect readability of image-based figures in scientific publications. future studies may include assessments of raw images and meta-data to examine factors that affect reproducibility, such as contrast settings, background filtering and processing history. actions journals can take to make image-based figures more transparent and easier to interpret the role of journals in improving the quality of reporting and accessibility of image-based figures should not be overlooked. there are several actions that journals might consider. • screen manuscripts for figures that are not colorblind safe: open source automated screening tools may help journals to efficiently identify common color maps that are not colorblind safe. • update journal policies: we encourage journal editors to update policies regarding colorblind accessibility, scale bars, and other factors outlined in this manuscript. importantly, policy changes should be accompanied by clear plans for implementation and enforcement. meta-research suggests that changing journal policy, without enforcement or implementation plans, has limited effects on author behavior. amending journal policies to require authors to report rrids, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for example, increases the number of papers reporting rrids by %. in a study of life sciences articles published in nature journals, the percentage of animal studies reporting the landis criteria (blinding, randomization, sample size calculation, exclusions) increased from to . % after new guidelines were released. in contrast, a randomized controlled trial of animal studies submitted to plos one demonstrated that randomizing authors to complete the arrive checklist during submission did not improve reporting. some improvements in reporting of confidence intervals, sample size justification and inclusion and exclusion criteria were noted after psychological science introduced new policies, although this may have been partially due to widespread changes in the field. a joint editorial series published in the journal of physiology and british journal of pharmacology did not improve the quality of data presentation or statistical reporting. • re-evaluate limits on the number of figures: limitations on the number of figures originally stemmed from printing costs calculations, which are becoming increasingly irrelevant as scientific publishing moves online. unintended consequences of these policies include the advent of large, multipanel figures. these figures are often especially difficult to interpret because the legend appears on a different page, or the figure combines images addressing different research questions. • reduce or eliminate page charges for color figures: as journals move online, policies designed to offset the increased cost of color printing are no longer needed. the added costs may incentivize authors to use grayscale in cases where color would be beneficial. • encourage authors to explain labels or annotations in the figure, rather than in the legend: this is more efficient for readers. • encourage authors to share image data in public repositories: open data benefits authors and the scientific community. - how can the scientific community improve image-based figures? the role of scientists in the community is multi-faceted. as authors, scientists should familiarize themselves with guidelines and recommendations, such as ours provided above. as reviewers, scientists should ask authors to improve erroneous or uninformative image-based figures. as instructors, scientists should ensure that bioimaging and image data handling is taught during undergraduate or graduate courses, and support existing initiatives such as neubias (network of european bioimage analysts) that aim to increase training opportunities in bioimage analysis. scientists are also innovators. as such they should support emerging image data archives, which may expand to automatically source images from published figures. repositories for other types of data are already widespread, however the idea of image repositories has only recently gained traction. existing image databases, which are mainly used for raw image data and meta-data, include the allen brain atlas, the image data resource and the emerging bioimage archives. springer nature encourages authors to submit imaging data to the image data resource. while scientists have called for common quality standards for archived images and meta-data, such standards have not been defined, implemented, or taught. examining standard practices for reporting images in scientific publications, as outlined here, is one strategy for establishing common quality standards. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / in the future, it is possible that each image published electronically in a journal or submitted to an image data repository will follow good practice guidelines, and will be accompanied by expanded “meta-data” or “alt-text/attribute” files. alt-text is already published in html to provide context if an image cannot be accessed (e.g. by blind readers). similarly, images in online articles and deposited in archives could contain essential information in a standardized format. the information could include the main objective of the figure, specimen information, ideally with research resource identifier (rrid), specimen manipulation (dissection, staining, rrid for dyes and antibodies used), as well as the imaging method including essential items from meta-files of the microscope software, information about image processing and adjustments, information about scale, annotations, insets, and colors shown, and confirmation that the images are truly representative. conclusions our meta-research study of standard practices for presenting images in three fields highlights current shortcomings in publications. pubmed indexes approximately , new papers per year, or , papers per day (https://www.nlm.nih.gov/bsd/index_stats_comp.html). twenty-three percent, or approximately papers per day, contain images. our survey data suggest that most of these papers will have deficiencies in image presentation, which may affect legibility and interpretability. these observations lead to targeted recommendations for improving the quality of published images. our recommendations are available as a slide set via the open science framework and can be used in teaching best practice and avoid misleading or uninformative image-based figures. our analysis underscores the need for standardized image publishing guidelines. adherence to such guidelines will allow the scientific community to unlock the full potential of image collections in the life sciences for current and future generations of researchers. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.nlm.nih.gov/bsd/index_stats_comp.html https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / methods systematic review: we examined original research articles that were published in april of in the top journals that publish original research for each of three different categories (physiology, plant science, cell biology). journals for each category were ranked according to impact factors listed for the specified categories in journal citation reports. journals that only publish review articles or that did not publish an april issue were excluded. we followed all relevant aspects of the prisma guidelines. items that only apply to meta-analyses or are not relevant to literature surveys were not followed. ethical approval was not required. search strategy: articles were identified through a pubmed search, as all journals were pubmed indexed. electronic search results were verified by comparison with the list of articles published in april issues on the journal website. the electronic search used the following terms: physiology: ("journal of pineal research"[journal] and [issue] and [volume]) or ("acta physiologica (oxford, england)"[journal] and [volume] and [issue]) or ("the journal of physiology"[journal] and [volume] and ( [issue] or [issue])) or (("american journal of physiology. lung cellular and molecular physiology"[journal] or "american journal of physiology. endocrinology and metabolism"[journal] or "american journal of physiology. renal physiology"[journal] or "american journal of physiology. cell physiology"[journal] or "american journal of physiology. gastrointestinal and liver physiology"[journal]) and [volume] and [issue]) or (“american journal of physiology. heart and circulatory physiology”[journal] and [volume] and [issue]) or ("the journal of general physiology"[journal] and [volume] and [issue]) or ("journal of cellular physiology"[journal] and [volume] and [issue]) or ("journal of biological rhythms"[journal] and [volume] and [issue]) or ("journal of applied physiology (bethesda, md. : )"[journal] and [volume] and [issue]) or ("frontiers in physiology"[journal] and (" / / "[date - publication] : " / / "[date - publication])) or ("the international journal of behavioral nutrition and physical activity"[journal] and (" / / "[date - publication] : " / / "[date - publication])) plant science: ("nature plants"[journal] and [issue] and [volume]) or ("molecular plant"[journal] and [issue] and [volume]) or ("the plant cell"[journal] and [issue] and [volume]) or ("plant biotechnology journal"[journal] and [issue] and [volume]) or ("the new phytologist"[journal] and ( [issue] or [issue]) and [volume]) or ("plant physiology"[journal] and [issue] and [volume]) or ("plant, cell & environment"[journal] and [issue] and [volume]) or ("the plant journal : for cell and molecular biology"[journal] and ( [issue] or [issue]) and [volume]) or ("journal of experimental botany"[journal] and ( [issue] or [issue] or [issue]) and [volume]) or ("plant & cell physiology"[journal] and [issue] and [volume]) or ("molecular plant pathology"[journal] and [issue] and [volume]) or ("environmental and experimental botany"[journal] and [volume]) or ("molecular plant-microbe interactions : mpmi"[journal] and [issue] and [volume]) or (“frontiers in plant science”[journal] and (" / / "[date - publication] : " / / "[date - publication])) or (“the journal of ecology” (" / / "[date - publication] : " / / "[date - publication])) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / cell biology: ("cell"[journal] and ( [issue] or [issue]) and [volume]) or ("nature medicine"[journal] and [volume] and [issue]) or ("cancer cell"[journal] and [volume] and [issue]) or ("cell stem cell"[journal] and [volume] and [issue]) or ("nature cell biology"[journal] and [volume] and [issue]) or ("cell metabolism"[journal] and [volume] and [issue]) or ("science translational medicine"[journal] and [volume] and ( [issue] or [issue] or [issue] or [issue])) or ("cell research"[journal] and [volume] and [issue]) or ("molecular cell"[journal] and [volume] and ( [issue] or [issue])) or("nature structural & molecular biology"[journal] and [volume] and [issue]) or ("the embo journal"[journal] and [volume] and ( [issue] or [issue])) or ("genes & development"[journal] and [volume] and - [issue]) or ("developmental cell"[journal] and [volume] and ( [issue] or [issue])) or ("current biology : cb"[journal] and [volume] and ( [issue] or [issue])) or ("plant cell"[journal] and [volume] and [issue]) screening: screening for each article was performed by two independent reviewers (physiology: tlw, ss, emw, vi, kw, mo; plant science: tlw, sjb; cell biology: ew, ss) using rayyan software (rrid:scr_ ), and disagreements were resolved by consensus. a list of articles is uploaded into rayyan. reviewers independently examined each article and marked whether the article was included or excluded, along with the reason for exclusion. both reviewers screened all articles published in each journal between april and april , to identify full length, original research articles (table s , table s , table s , figure s ) published in the print issue of the journal. articles for online journals that do not publish print issues were included if the publication date was between april and april , . articles were excluded if they were not original research articles, or if an accepted version of the paper was posted as an “in press” or “early release” publication; however, the final version did not appear in the print version of the april issue. articles were included if they contained at least one eligible image, such as a photograph, an image created using a microscope or electron microscope, or an image created using a clinical imaging technology such as ultrasound or mri. blot images were excluded, as many of the criteria in our abstraction protocol cannot easily be applied to blots. computer generated images, graphs and data figures were also excluded. papers that did not contain any eligible images were excluded. abstraction: all abstractors completed a training set of articles before abstracting data. data abstraction for each article was performed by two independent reviewers (physiology: aa, av; plant science: mo, tla, sa, kw, mag, if; cell biology: if, aa, av, kw, mag). when disagreements could not be resolved by consensus between the two reviewers, ratings were assigned after a group review of the paper. eligible manuscripts were reviewed in detail to evaluate the following questions according to a predefined protocol (available at: https://osf.io/b /). supplemental files were not examined, as supplemental images may not be held to the same peer review standards as those in the manuscript. the following items were abstracted: . types of images included in the paper (photograph, microscope image, electron microscope image, image created using a clinical imaging technique such as ultrasound or mri, other types of images) . did the paper contain appropriately labeled scale bars for all images? . were all insets clearly and accurately marked? . were all insets clearly explained in the legend? .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://osf.io/b / https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . is the species and tissue, object, or cell line name clearly specified in the figure or legend for all images in the paper? . are any annotations, arrows or labels clearly explained for all images in the paper? . among images where authors can control the colors shown (e.g. fluorescence microscopy), are key features of the images visible to someone with the most common form of colorblindness (deuteranopia)? . if the paper contains colored labels, are these labels visible to someone with the most common form of color blindness (deuteranopia)? . are colors in images explained either on the image or within the legend? questions and were assessed by using color oracle (rrid:scr_ ) to simulate the effects of deuteranopia. verification: ten percent of articles in each field were randomly selected for verification abstraction, to ensure that abstractors in different fields were following similar procedures. data were abstracted by a single abstractor (tlw). the question on species and tissue was excluded from verification abstraction for articles in cell biology and plant sciences, as the verification abstractor lacked the field-specific expertise needed to assess this question. results from the verification abstractor were compared with consensus results from the two independent abstractors for each paper and discrepancies were resolved through discussion. error rates were calculated as the percentage of responses for which the abstractors’ response was incorrect. error rates were % for plant sciences, % for physiology and % for cell biology. data processing and creation of figures: data are presented as n (%). summary statistics were calculated using python (rrid:scr_ , version . . , libraries numpy . . and matplotlib . . ). charts were prepared with a python-based jupyter notebook (jupyter-client, rrid:scr_ , python version . . , rrid:scr_ , libraries numpy . . and matplotlib . . ) and assembled into figures with vector graphic software. example images were previously published or generously donated by the manuscript authors as indicated in the figure legends. image acquisition was described in references (d.melagenoster images , mouse pancreatic beta islet cells: a. müller personal communication, and orobates pabsti ). images were cropped, labeled, and color-adjusted with fiji (rrid:scr_ ) and assembled with vector-graphic software. color-blind and grayscale rendering of images was done using color oracle (rrid:scr_ ). all poor and clear images presented here are ‘mock examples’ prepared based on practices observed during data abstraction. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / funding tlw was funded by american heart association grant grnt and a robert w. fulk career development award (mayo clinic division of nephrology & hypertension). lhh was supported by the hormel foundation and national institutes of health grant ca . acknowledgements we thank the elife community ambassadors program for facilitating this work, and andreas müller and john a. nyakatura for generously sharing example images. falk hillmann and thierry soldati provided the amoeba strains used for imaging. some of the early career researchers who participated in this research would like to thank their principal investigators and mentors for supporting their efforts to improve science. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplemental tables table s : number of articles examined by journal in physiology journal articles screened (n = ) original research articles (n = , %) included articles (n = , %) journal of pineal research ( %) ( %) acta physiologica ( %) ( %) journal of physiology ( %) ( %) international journal of behavioral nutrition and physical activity ( %) ajp: lung, cellular and molecular physiology ( %) ( %) journal of general physiology ( %) ( %) ajp: endocrinology and metabolism ( %) ( %) frontiers in physiology ( %) ( %) journal of cellular physiology ( %) ( %) ajp: renal physiology ( %) ( %) ajp: cell physiology ( %) ( %) journal of biological rhythms ( %) ( %) ajp: gastrointestinal and liver physiology ( %) ( %) journal of applied physiology ( %) ( %) ajp: heart and circulatory physiology ( %) ( %) values are n, or n (% of all articles). screening was performed to exclude articles that were not full-length original research articles (e.g. reviews, editorials, perspectives, commentaries, letters to the editor, short communications, etc.), were not published in april , or did not include eligible images. abbreviations: ajp, american journal of physiology .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table s : number of articles examined by journal in plant science journal articles screened (n = ) original research articles (n = , %) included articles (n = , %) nature plants ( %) molecular plant ( %) ( %) plant cell * ( %) ( %) plant biotechnology journal ( %) ( %) new phytologist ( %) ( %) plant physiology ( %) ( %) plant cell and environment ( %) ( %) plant journal ( %) ( %) journal of experimental botany ( %) ( %) journal of ecology ** plant and cell physiology ( %) ( %) molecular plant pathology ( %) ( %) environmental and experimental botany ( %) ( %) molecular plant – microbiome interactions ( %) ( %) frontiers in plant science ( %) ( %) * this journal was also included on the cell biology list (table s ). ** no articles from the journal of ecology were screened as the journal did not publish an april issue. values are n, or n (% of all articles). screening was performed to exclude articles that were not full-length original research articles (e.g. reviews, editorials, perspectives, commentaries, letters to the editor, short communications, etc.), were not published in april , or did not include eligible images. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table s : number of articles examined by journal in cell biology journal articles screened (n = ) original research articles (n = , %) included articles (n = , %) cell ( %) ( %) nature medicine ( %) ( %) cancer cell ( %) ( %) cell stem cell ( %) ( %) nature cell biology ( %) ( %) cell metabolism ( %) ( %) science translational medicine ( %) ( %) cell research ( %) ( %) molecular cell ( %) ( %) nature structural and molecular biology ( %) ( %) embo journal ( %) ( %) genes and development ( %) ( %) developmental cell ( %) ( %) current biology ( %) ( %) plant cell * ( %) ( %) * this journal was also included on the plant science list (table s ). values are n, or n (% of all articles). screening was performed to exclude articles that were not full length original research articles (e.g. reviews, editorials, perspectives, commentaries, letters to the editor, short communications, etc.), were not published in april , or did not include eligible images. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table s : scale information in papers field no scale information in any figure some scale information complete scale information some figures, magnification in legend all figures, magnification in legend some figures, scale bar with dimensions in legend some figures, scale bar with dimensions all figures, scale bar with dimensions in legend all figures, scale bar with dimensions physiology . . . . . . . cell biology . . . . . . . plant science . . . . . . . values are % of papers. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure s : flow chart of study screening and selection process .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references . lee p, west jd and howe b. viziometrics: analyzing visual information in the scientific literature. ieee transactions on big data. ; : - . . cromey dw. digital images are data: and should be treated as such. methods mol biol. ; : - . . bik em, casadevall a and fang fc. the prevalence of inappropriate image duplication in biomedical research publications. mbio. ; . . laissue pp, alghamdi ra, tomancak p, reynaud eg and shroff h. assessing phototoxicity in live fluorescence imaging. nat methods. ; : - . . marques g, pengo t and sanders ma. imaging methods are vastly underreported in biomedical research. elife. ; . . pain e. how to (seriously) read a scientific paper. science. . . rolandi m, cheng k and perez-kriz s. a brief guide to designing effective figures for the scientific paper. advanced materials. ; : - . . canese k. pubmed® display enhanced with images from the new ncbi images database. nlm technical bulletin. ; :e . . liechti r, george n, gotz l, el-gebali s, chasapi a, crespo i, xenarios i and lemberger t. sourcedata: a semantic platform for curating and searching figures. nat methods. ; : - . . lindquist m. neuroimaging results altered by varying analysis pipelines. nature. ; : - . . botvinik-nezer r, holzmeister f, camerer cf, dreber a, huber j, johannesson m, kirchler m, iwanir r, mumford ja, adcock ra, avesani p, baczkowski bm, bajracharya a, bakst l, ball s, barilari m, bault n, beaton d, beitner j, benoit rg, berkers r, bhanji jp, biswal bb, bobadilla-suarez s, bortolini t, bottenhorn kl, bowring a, braem s, brooks hr, brudner eg, calderon cb, camilleri ja, castrellon jj, cecchetti l, cieslik ec, cole zj, collignon o, cox rw, cunningham wa, czoschke s, dadi k, davis cp, luca a, delgado mr, demetriou l, dennison jb, di x, dickie ew, dobryakova e, donnat cl, dukart j, duncan nw, durnez j, eed a, eickhoff sb, erhart a, fontanesi l, fricke gm, fu s, galvan a, gau r, genon s, glatard t, glerean e, goeman jj, golowin sae, gonzalez-garcia c, gorgolewski kj, grady cl, green ma, guassi moreira jf, guest o, hakimi s, hamilton jp, hancock r, handjaras g, harry bb, hawco c, herholz p, herman g, heunis s, hoffstaedter f, hogeveen j, holmes s, hu cp, huettel sa, hughes me, iacovella v, iordan ad, isager pm, isik ai, jahn a, johnson mr, johnstone t, joseph mje, juliano ac, kable jw, kassinopoulos m, koba c, kong xz, koscik tr, kucukboyaci ne, kuhl ba, kupek s, laird ar, lamm c, langner r, lauharatanahirun n, lee h, lee s, leemans a, leo a, lesage e, li f, li myc, lim pc, lintz en, liphardt sw, losecaat vermeer ab, love bc, mack ml, malpica n, marins t, maumet c, mcdonald k, mcguire jt, melero h, mendez leal as, meyer b, meyer kn, mihai g, mitsis gd, moll j, nielson dm, nilsonne g, notter mp, olivetti e, onicas ai, papale p, patil kr, peelle je, perez a, pischedda d, poline jb, prystauka y, ray s, reuter-lorenz pa, reynolds rc, ricciardi e, rieck jr, rodriguez- thompson am, romyn a, salo t, samanez-larkin gr, sanz-morales e, schlichting ml, schultz dh, shen q, sheridan ma, silvers ja, skagerlund k, smith a, smith dv, sokol- hessner p, steinkamp sr, tashjian sm, thirion b, thorp jn, tinghog g, tisdall l, tompson sh, toro-serey c, torre tresols jj, tozzi l, truong v, turella l, van 't veer ae, verguts t, vettel jm, vijayarajah s, vo k, wall mb, weeda wd, weis s, white dj, wisniewski d, xifra-porxas a, yearling ea, yoon s, yuan r, yuen ksl, zhang l, zhang .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / x, zosky je, nichols te, poldrack ra and schonberg t. variability in the analysis of a single neuroimaging dataset by many teams. nature. ; : - . . national eye institute. facts about color blindness. . https://nei.nih.gov/health/color_blindness/facts_about. accessed march , . . weissgerber tl. training early career researchers to use meta-research to improve science: a participant guided, “learn by doing” approach. plos biology. . . antonietti a, jambor h, alicea b, audisio tl, auer s, bhardwaj v, burgess s, ferling i, gazda ma, hoeppner l, ilangovan v, lo h, olson m, mohamed sy, sarabipour s, varma a, walavalkar k, wissink em and weissgerber tl. meta-research: creating clear and informative image-based figures for scientific publications. . https://osf.io/b /. accessed january , . . schindelin j, arganda-carreras i, frise e, kaynig v, longair m, pietzsch t, preibisch s, rueden c, saalfeld s, schmid b, tinevez jy, white dj, hartenstein v, eliceiri k, tomancak p and cardona a. fiji: an open-source platform for biological-image analysis. nat methods. ; : - . . rueden ct, schindelin j, hiner mc, dezonia be, walter ae, arena et and eliceiri kw. imagej : imagej for the next generation of scientific image data. bmc bioinformatics. ; : . . schmied c and jambor hk. effective image visualization for publications - a workflow using open access tools and concepts. f research. ; : . . jambor h, surendranath v, kalinka at, mejstrik p, saalfeld s and tomancak p. systematic imaging reveals features and changing localization of mrnas in drosophila development. elife. ; . . nyakatura ja, melo k, horvat t, karakasiliotis k, allen vr, andikfar a, andrada e, arnold p, laustroer j, hutchinson jr, fischer ms and ijspeert aj. reverse- engineering the locomotion of a stem amniote. nature. ; : - . . weissgerber tl, winham sj, heinzen ep, milin-lazovic js, garcia-valencia o, bukumiric z, savic md, garovic vd and milic nm. reveal, don't conceal: transforming data visualization to improve transparency. circulation. ; : - . . jambor h. better figures for the life sciences. ecrlife. august , . https://ecrlife .wordpress.com/ / / /better-figures-for-life- sciences/. accessed september , . . saladi s. jetfighter: towards figure accuracy and accessibility. elife. . . bandrowski a, brush m, grethe js, haendel ma, kennedy dn, hill s, hof pr, martone me, pols m, tan s, washington n, zudilova-seinstra e, vasilevsky n and resource identification initiative members. the resource identification initiative: a cultural shift in publishing. f res. ; : . . the npqip collaborative group. did a change in nature journals’ editorial policy for life sciences research improve reporting? bmj open science. ; :e . . hair k, macleod mr, sena es and collaboration ii. a randomised controlled trial of an intervention to improve compliance with the arrive guidelines (iicarus). res integr peer rev. ; : . . giofre d, cumming g, fresc l, boedker i and tressoldi p. the influence of journal submission guidelines on authors' reporting of statistics and use of open research practices. plos one. ; :e . . diong j, butler aa, gandevia sc and heroux me. poor statistical reporting, inadequate data presentation and spin persist despite editorial advice. plos one. ; :e . . piwowar ha and vision tj. data reuse and the open data citation advantage. peerj. ; :e . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://nei.nih.gov/health/color_blindness/facts_about https://osf.io/b / https://ecrlife .wordpress.com/ / / /better-figures-for-life-sciences/ https://ecrlife .wordpress.com/ / / /better-figures-for-life-sciences/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . markowetz f. five selfish reasons to work reproducibly. genome biol. ; : . . colavizza g, hrynaszkiewicz i, i s, k w and b. m. the citation advantage of linking publications to research data. arxiv. . . cimini ba, norrelykke sf, louveaux m, sladoje n, paul-gilloteaux p, colombelli j and miura k. the neubias gateway: a hub for bioimage analysis methods and materials. f res. ; : . . ellenberg j, swedlow jr, barlow m, cook ce, sarkans u, patwardhan a, brazma a and birney e. a call for public archives for biological image data. nat methods. ; : - . . williams e, moore j, li sw, rustici g, tarkowska a, chessel a, leo s, antal b, ferguson rk, sarkans u, brazma a, salas rec and swedlow jr. the image data resource: a bioimage data integration and publication platform. nat methods. ; : - . . bandrowski ae and martone me. rrids: a simple step toward improving reproducibility through rigor and transparency of experimental methods. neuron. ; : - . . moher d, liberati a, tetzlaff j and altman dg. preferred reporting items for systematic reviews and meta-analyses: the prisma statement. j clin epidemiol. ; : - . . jenny b and kelso nv. color oracle. . https://colororacle.org. accessed march , . . kluyver t, ragan-kelley b, pérez f and granger b. jupyter notebooks - a publishing format for reproducible computational workflows. in: f. l. a. b. scmidt, ed. positioning and power in academic publishing: players, agents and agendas netherlands: ios press; . . harris cr, k.j. m, van der walt sj, gommers r, virtanen p, cornapeau d, wieser e, taylor j, berg s, smith nj, kern r, picus m, hoyer s, van kerkwijk mh, brett m, haldane a, fernández del río j, wiebe m, peterson p, gérard-merchant p, sheppard k, reddy t, weckesser w, abbasi h, gohlke c and oliphant te. array programming with numpy. ; : - . . hunter jd. matplotlib: a d graphics environment. computing in science & engineering. ; : - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://colororacle.org/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / scpnmf: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling scpnmf: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling dongyuan song ,†, kexin aileen li ,†, zachary hemminger , , roy wollman , , , and jingyi jessica li , , ,∗ abstract single-cell rna sequencing (scrna-seq) captures whole transcriptome information of indi- vidual cells. while scrna-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. then a question is how to select those informative genes from scrna-seq data. moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity, and extra (e.g., spatial) information; however, they typically can only measure up to a few hundred genes. then another challenging question is how to select genes for targeted gene profiling based on existing scrna-seq data. here we develop the single-cell projective non-negative matrix factorization (scpnmf) method to select informative genes from scrna-seq data in an unsupervised way. compared with existing gene selection methods, scpnmf has two advantages. first, its selected informative genes can better distinguish cell types. second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. technically, scpnmf modifies the pnmf algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. we demonstrate that scpnmf outperforms the state-of-the-art gene selection methods on diverse scrna-seq datasets. moreover, we show that scpnmf can guide the design of targeted gene profiling experiments and cell-type annotation on targeted gene profiling data. bioinformatics interdepartmental ph.d. program, university of california, los angeles, ca - , department of statistics, university of california, los angeles, ca - , institute for quantitative and computational biosciences, university of california, los angeles, ca , department of integrative biology and physiology, university of california, los angeles, ca - , department of chemistry and biochemistry, university of california, los angeles, ca - , department of human genetics, university of california, los angeles, ca - , department of computational medicine, university of california, los angeles, ca - , usa. † these authors contributed equally to this work. ∗ to whom correspondence should be addressed. contact: jli@stat.ucla.edu .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction the recent development of single-cell rna sequencing (scrna-seq) technologies provides un- precedented opportunities to decipher transcriptome heterogeneity among individual cells [ – ]. a typical scrna-seq dataset contains thousands to tens of thousands of genes; however, a subset of genes, which we call informative genes, are usually sufficient for representing the underlying biological variations of cells in the dataset for two reasons. first, variations of many genes are not related to the biological variations of interest. for instance, fluctuations in the expression levels of housekeeping genes are irrelevant to cell types [ , ]. second, many genes have strongly correlated expression levels, suggesting that one gene may represent a group of genes without much loss of information [ ]. therefore, for scrna-seq data analysis, informative gene selection has three advantages: ( ) enhancing biological signals by removing unwanted technical variations, ( ) improving the interpretability of analysis results by focusing on informative genes, and ( ) reducing the number of genes to save computational resources. besides scrna-seq data analysis, informative gene selection is also crucial for designing single-cell targeted gene profiling experiments, which we define to include all technologies that measure only a specific sets of genes’ expression levels in individual cells. unlike scrna-seq, targeted gene profiling requires a limited number (often no more than hundreds) of genes to be specified before sequencing. examples of targeted gene profiling include spatial technologies (e.g., smfish [ ] and merfish [ ]) and non-spatial technologies (e.g., bart-seq [ ], hypr- seq [ ] and x-genomics targeted gene expression). compared with scrna-seq, targeted gene profiling technologies have advantages such as capturing spatial information (by smfish and merfish), having a lower cost per cell (by bart-seq), and exhibiting a higher sensitivity for detecting lowly expressed genes (by hypr-seq). however, it remains an open and challenging question to optimize the gene selection for targeted gene profiling under a gene number limitation. given the importance of informative gene selection, researchers have developed many gene selection methods for scrna-seq data. most existing methods select genes based on the rela- tionship between per-gene expression means and per-gene expression variances (with the mean and variance of each gene calculated across cells). popular example methods include variance stabilization transformation (vst) [ ] and mean-variance plot (mvp) in the r package seurat [ ], as well as modelgenevar in the r package scran [ ]. these methods select highly variable genes that have large expression variances in relation to their expression means. other methods use various metrics of gene importance instead of the per-gene expression variance. for example, m drop selects the genes that have zero expression levels in many cells [ ]; giniclust selects the genes with large gini indices of expression levels [ ]; scmarker selects the genes that have expression levels bi/multi-modally distributed and are co-expressed or mutually-exclusively expressed with some other genes [ ]. a common limitation of these existing methods is that they are all designed to select a relatively large number of genes; thus, their performance in selecting a small number of genes remains unclear. for instance, in seurat, the default gene number is .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ; scmarker selects - genes in its exemplar applications [ ]. all these gene numbers are much greater than , the maximum gene number allowed by multiple targeted gene profiling technologies. therefore, existing gene selection methods may not be suitable for selecting genes for targeted gene profiling. another drawback of these methods is that their selected genes lack functional interpretability; that is, their selected genes are not categorized as functional gene groups. in addition to these gene selection methods, linear dimensionality reduction methods, such as principal component analysis (pca) and non-negative matrix factorization (nmf), can also used for gene selection. specifically, genes can be selected based on their contributions to the projected low dimensions found by pca or nmf [ – ]. although many variants of pca and nmf algorithms have been developed for scrna-seq data analysis, they are not designed for gene selection [ – ]. here we propose an unsupervised method scpnmf to simultaneously select informative genes and project scrna-seq data onto an interpretable low-dimensional space. leveraging the projec- tive non-negative matrix factorization (pnmf) algorithm [ ], scpnmf combines the advantages of pca and nmf by outputting a non-negative sparse weight matrix that can project cells in a high-dimensional scrna-seq dataset onto a low-dimensional space. unlike the weight matrix (a.k.a., loading matrix) found by pca, the non-negative sparse weight matrix output by scpnmf correspond to bases that each correspond to a group of co-expressed genes. compared with the original pnmf, a unique feature of scpnmf is basis selection: scpnmf uses correlation screening and multimodality testing to remove the bases that cannot reveal potential cell clusters in the input scrna-seq dataset. there are two functionalities of scpnmf: ( ) given a pre-specified gene number and a scrna-seq dataset, scpnmf selects informative genes based on its weight matrix; ( ) given a targeted gene profiling dataset containing the informative genes, scpnmf projects this dataset onto the same low-dimensional space of a reference scrna-seq dataset containing cell type labels, thus enabling cell type annotation on the targeted gene profiling dataset. comprehen- sive benchmark shows that scpnmf outperforms existing gene selection methods in two aspects. first, the informative genes selected by scpnmf lead to the most accurate cell clustering. second, the informative genes and weight matrix of scpnmf lead to the best cell type prediction accuracy for targeted gene profiling data. therefore, scpnmf is a powerful gene selection method that can guide the experimental design and data analysis of single-cell targeted gene profiling. methods the core of scpnmf is to learn a low-dimensional embedding of cells so that the bases of the low-dimensional space correspond to sparse and mutually exclusive gene groups, and that genes in each group are co-expressed and thus functionally related. fig. illustrates the work- flow of scpnmf. the input of scpnmf is a log-transformed gene-by-cell count matrix measured by scrna-seq. there are two main steps in scpnmf: (i) it learns a low-dimensional sparse .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / min ! #$ ||𝐗 − 𝐖𝐖𝐓 𝐗|| 𝐾 basis 𝑝 g en es 𝐖 weight matrix 𝐖 = [𝒘!,𝒘",…,𝒘#] 𝑛 cells 𝑝 g en es 𝐗 . pearson correlation (w/ cell library size ) 𝑅 = . c el l l ib s iz e 𝒔! 𝒔# … 𝑝-value = . 𝑝-value = . . multimodality test … . functional annotations (optional) 𝒘! 𝒘# housekeeping genes … cell-type genes unselected basis selected basis 𝒔! 𝒔# d en si ty 𝐖 𝑅 = . 𝐖$ 𝑝 g en es max gene weights 𝑤! 𝑤" 𝑤# … … 𝐖$ 𝑤(!) 𝑤(") 𝑤(#) … … gene(!) gene(") gene(#) … … order genes by weights 𝑤(!) ≥ 𝑤(") ≥… ≥ 𝑤(#) gene(!) gene(") gene(&) … … truncate by 𝑤(() and keep first 𝑀 genes 𝑀-truncation max gene weights 𝑀: user-defined gene number score matrix 𝐒 = 𝐖𝐓𝐗 𝐾 b as is = × 𝐗𝐖𝐓𝐒 𝑛 cells = 𝒘!𝐓𝐗 𝒘"𝐓𝐗 ⋮ 𝒘#𝐓𝐗 = 𝒔! 𝒔" ⋮ 𝒔# 𝐗(') gene(!) gene(") gene(&) … 𝑛 cells informative gene selection clustering visualization …… informative genes: {gene * ,gene + ,…,gene(()} 𝐖/,( ) new data projection new data projection onto reference data space reference data space = × 𝑛 cells 𝐗(') )*+= ×𝐒(') )*+ gene(!) gene(") gene(&) … trained model 𝑓(𝒔) new cells prediction 𝑓 cell type prediction gene(!) gene(") gene(&) … 𝑛 cells 𝒔 𝐗(')𝐖$,(') 𝐓 𝐖$,(') 𝐓 step i: pnmf step ii: basis selection 𝐾' basis applications 𝐒(') -*. figure : an overview of scpnmf. taking a log-transformed gene-by-cell count matrix as the input, scpnmf first learns a low-dimensional sparse weight matrix w and a low-dimensional cell embedding matrix s. second, it remove the bases irrelevant to cell type variations by examining bases’ functional annotations (optional), pearson correlations with cell library sizes, and multimodality. given a user-defined gene number m, scpnmf performs m-truncation to facilitate two main applications: ( ) selecting the desired number of informative genes; ( ) projecting new targeted gene profiling data onto the low-dimensional space defined by reference scrna-seq data. the details are in the ”methods” section. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / weight matrix by pnmf; (ii) it selects bases in the weight matrix based on functional annotations (optional), correlation screening, and multimodality testing to remove uninformative bases that cannot distinguish cell types. the output of scpnmf includes ( ) the selected weight matrix, a sparse and mutually exclusive encoding of genes as new, low dimensions, and ( ) the score matrix containing embeddings of input cells in the low dimensions. the selected weight matrix has two main applications: extracting informative gene for downstream analyses, such as cell clustering and new marker gene identification, and projecting new targeted gene profiling data for data integration and cell type annotation. . scpnmf step i: pnmf in this section, we review the pnmf algorithm [ , ] as the foundation of scpnmf. we first compare the formulation of pnmf with that of principal component analysis (pca) and non- negative matrix factorization (nmf), and we show that pnmf has the advantages of both pca and nmf so that it can be a useful tool for scrna-seq data analysis. next, we introduce our pnmf implementation. given a log-transformed count matrix x ∈ rp×n≥ , whose p rows correspond to genes and whose n columns represent cells, and a positive integer k ≤ p, pnmf aims to find a k-dimensional space, whose dimensions correspond to non-negative, sparse and mutually exclusive linear com- binations of the p genes, so that projecting the n cells onto the k-dimensional space does not cause much information loss (i.e., projecting the k-dimensional embeddings of the n cells back to the original p-dimensional space can largely restore the original n cells). pnmf tackles this task by solving the optimization problem: min w∈rp×k≥ ‖x−wwtx‖ , ( . ) where ‖ · ‖ denotes the frobenius matrix norm. the solution w is referred to as a weight matrix. each column of w is a basis, whose p entries are the weights of the p genes. pnmf requires all weights to be non-negative, leading to a sparse w with most weights as zeros. pca is similar to pnmf but does not require all weights to be non-negative. we can write the optimization problem of pca as min w∈rp×k,wtw=i ‖x−wwtx‖ , ( . ) whose solution w is also a weight matrix but not sparse, and w is often referred to as the loading matrix. a common property of pnmf and pca is that the transpose of their weight matrix, wt ∈ rk×p, can be used to project a new cell with p gene measurements, x ∈ rp, onto the k-dimensional space as wtx. in contrast to pmnf and pca, nmf finds two non-negative matrices w and h so that their .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / product approximates the original matrix x. nmf solves the optimization problem: min w∈rp×k≥ ,h∈r k×n ≥ ‖x−wh‖ , ( . ) whose solution w still has k columns representing bases, and h has n columns as k-dimensional embeddings of the n cells. due to the non-negative constraint on w and h, w is a sparse matrix [ ]. however, the transpose wt cannot be used as a projection matrix from the original p-dimensional space to a k-dimensional space. the reason is that, if wt is a projection matrix, then by the definition of h we have wtx = h, which would converts the objective function ( . ) of nmf to the objective function ( . ) of pnmf. in other words, pnmf is a constrained version of nmf by requiring wt to be a projection matrix. hence, pnmf inherits the property of nmf by having non-negative, sparse bases that are mostly mutually exclusive (i.e., different bases correspond to different gene groups). moreover, based on the similarities of the objective functions of pnmf ( . ) and pca ( . ), we can see that pnmf also resembles pca by finding a weight matrix whose transpose can serve as a projection matrix and whose bases are largely orthogonal to each other. table summarizes the properties of pnmf, pca, and nmf. table : comparison of the properties of pnmf, pca and nmf optimization problem non- sparsity mutually new data negativity exclusiveness projection pnmf min w ‖x−wwtx‖ s.t. w ≥ yes very high very high yes pca min w ‖x−wwtx‖ s.t. wtw = i no low low yes nmf min w,h ‖x−wh‖ s.t. w, h ≥ yes high high no in the context of scrna-seq data analysis, the above advantages of pnmf lead to an inter- pretable and useful weight matrix w. first, the high sparsity of w makes each basis (column) depend on only a small set of genes, which has been defined as a meta-gene for nmf [ ]. second, the mutual exclusiveness of w makes different bases correspond to different gene sets, easing the interpretation of bases as meta-genes or functional units. third, the projection matrix wt allows the alignment of new data to reference data, thus facilitating cell type annotation on the new data. algorithm summarizes the key steps of pnmf implementation in scpnmf. our implemen- tation mainly follows the two papers that proposed the pnmf algorithm [ , ], and we change the initialization of w to the weight matrix found by pca, wpca, with the absolute value taken on every entry. our initialization is motivated by the desired orthogonality of bases (i.e., columns of w). with the weight matrix w ∈ rp×k≥ learned by pnmf, we obtain the score matrix s = w tx ∈ rk×n≥ , whose k rows correspond to the bases and whose n columns represent the cells. specif- ically, the j-th column of s is the k-dimensional embedding of the j-th cell; the k-th row of s, .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / algorithm pseudocode of pnmf implementation in scpnmf initialize: w = abs(wpca) ∈ r p×k ≥ : while not converge do : for i = , · · · , p; k = , · · · , k do : wik ← wik ( xxtw ) ik (wwtxxtw)ik + (xx twwtw)ik : end for : w ← ‖w‖ w : end while output: w ∈ rp×k≥ , s = w tx ∈ rk×n≥ denoted by stk , contains the scores (i.e., coordinates) of all n cells in the k-th basis: sk = w t kx , ( . ) where wk is the k-th column of w, k = , . . . , k. the low rank k needs to be pre-specified in pnmf, same as in pca and nmf, a larger k preserves more information in x but also removes less noise (technical variation of cells that is not of biological interest), impedes the interpretation of w (more bases are more difficult to interpret), and increases the computational burden. to choose k in a data-driven way, we propose an orthogonality measure, which shows that k = is a reasonable choice for multiple scrna-seq datasets (section s . ). . scpnmf step ii: basis selection the second key step of scpnmf is to select informative bases among the k bases found by pnmf (i.e., columns of w and rows of s) to remove unwanted variations of cells (e.g., variations irrelevant to cell types). the columns of w enjoy high sparsity and mutual exclusiveness; that is, each column contains positive weights corresponding to a unique small set of genes, so it is expected to reflect a certain biological function. however, some biological functions may not be relevant to the cell heterogeneity of interest, e.g., cell type composition. motivated by this, we propose three strategies for selecting informative bases (columns of w and rows of s): functional annotations (optional), correlations with cell library sizes, and tests of multimodality. . . strategy : examine bases by functional annotations (optional) the first, optional strategy is to annotate the biological function(s) of each basis in the weight matrix. for example, scpnmf may apply gene ontology (go) analysis to the top % genes with the highest weights in each basis (column of w) and record the enriched go terms as the basis’ functional annotation. then, users with prior knowledge can interpret the functional .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / annotation on each basis and decide whether or not to remove the basis. for example, if the goal is to delineate cell types in scrna-seq data, a basis corresponding to cell-cycle genes should be removed because they would obscure the distinction of cell types. however, it is worth noting that filtering bases by biological annotations is optional in scpnmf. conservative users can keep all k bases output by pnmf and directly use data-driven basis selection (section . . ). for our results in this paper, scpnmf removes the bases corresponding to well-known housekeeping genes (section s ). . . data-driven strategies . . . strategy : examine bases by correlations with cell library sizes note that the input of scpnmf is a log-transformed unnormalized count matrix for users’ conve- nience. hence, scpnmf does not adjust for cell library sizes in the computation of w and s in step i. given that the variance of cell library sizes contributes to unwanted variations of cells [ ], it is necessary to remove the bases whose corresponding rows in s are strongly correlated with cell library sizes. we use the total log-transformed counts to approximate the library size of each cell, and we calculate the pearson correlation between each sk and the library sizes of n cells. the strategy is to retain the bases whose pearson correlations are under a pre-defined threshold, which we set to . based on empirical observations (section s . ). . . . strategy : examine bases by multimodality tests another data-driven strategy is to retain the bases whose corresponding scores are multi-modally distributed. if a basis’ score vector (row in s) contains n scores with a multimodality pattern, then it is likely to distinguish cell types and should be retained. to implement this strategy, we use the acr test [ ] to check the multimodality of each basis’ score vector. the null hypothesis is that the score vector contains n scores sampled from a unimodal distribution, and the alternative hypothesis is that the distribution has more than one mode. after performing multiple multimodality tests, one per basis, we use the benjamini-hochberg procedure to set a p-value threshold by controlling the false discovery rate under %. the bases whose p-values are under this threshold will be retained. in summary, scpnmf step ii allows users to use strategy to filter out uninformative bases based on functional annotations if available; then it implements data-driven strategies and to further remove bases that have strong correlations with cell library sizes and exhibit unimodality patterns. the retained bases will have their corresponding columns in w selected and stacked into the selected weight matrix ws ∈ r p×k ≥ , where k is the number of selected bases. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . applications of scpnmf output: informative gene selection and new data projection the selected weight matrix ws output by scpnmf has two main applications: selection of a desired number of informative genes and projection of new targeted gene profiling data onto the low-dimensional space defined by ws. given a gene number m (e.g., ), scpnmf uses m- truncation, a step to select m rows in ws, resulting in m informative genes and a truncated, selected weight matrix ws,(m) ∈ r m×k ≥ for new data projection. . . m-truncation and informative gene selection we denote the desired number of informative genes by m ∈ n, with m ≤ # of non-zero rows in ws. m-truncation has three steps. . for each gene i, calculate its largest weight wi across bases in ws: wi = max k= ,...,k (ws)ik, i = , , . . . , p . ( . ) . order genes by their maximum weights w( ) ≥ w( ) ≥ ··· ≥ w(p) and set the truncation threshold as w(m). identify the first m genes as informative genes. . construct the truncated, selected weight matrix ws,(m): ( ) truncate the selected weight matrix ws by setting all (ws)ik < w(m) to be ; ( ) keep the m rows with non-zero entries; stack them by row into ws,(m) based on the order of the informative genes. in short, scpnmf selects informative genes based on their maximum weights in the selected bases. the rationale is that a gene’s maximum weight reflects the gene’s contribution to the establishment of the k -dimensional space, which preserves the n cells’ biological variations of interest. hence, genes with larger maximum weights are more informative in the sense of encoding cells’ biological variations. an important application of informative gene selection is to guide the design of targeted gene profiling experiments. . . new data projection given the selected m informative genes, once new cells are measured by targeted gene profiling on these genes, ws,(m) can be used to project the new cells onto the k -dimensional space where the cells in the input scrna-seq data are embedded in. if the input data has cell type annotations, we refer to the input data as reference data, then we can predict the new cells’ types from the types of the cells in the reference data. in detail, new data projection has the following steps: .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . apply scpnmf with m-truncation to input, reference data x ∈ rp×n≥ with n cells to obtain the truncated, selected weight matrix ws,(m). construct x(m) ∈ r m×n ≥ as a submatrix of x, with rows corresponding to the rows of ws,(m), i.e., the m informative genes. hence, the k -dimensional embeddings of the n cells in the reference data are the columns of sref(m) = w t s,(m) ×x(m) ∈ r k ×n . ( . ) . denote the targeted gene profiling data of n′ new cells with m informative genes measured by xnew (m) ∈ rm×n ′ ≥ . note that x new (m) contains log-transformed counts and has rows (genes) corresponding to the rows of x(m). project the n ′ cells to the k -dimensional space by snew(m) = w t s,(m) ×x new (m) ∈ r k ×n′ ( . ) . (optional) normalize snew (m) and sref (m) to remove batch effects, if existent, by using a single-cell integration method such as harmony [ ]. now the n reference cells and the n′ new cells are in the same k -dimensional space with biological variations preserved. then a classifier can be trained on the n reference cells’ types and sref (m) for cell type prediction, and it can be used to predict the n′ cells’ types from snew (m) . results . scpnmf outputs a sparse and functionally interpretable repre- sentation of scrna-seq data we first demonstrate that scpnmf step i, pnmf, outputs a sparse and functionally interpretable gene encoding of cells. we use the freggold dataset [ ], which consists of three cell types (three human lung adenocarcinoma cell lines), and set the basis number k = for demonstration purpose. both pca and pnmf learn a weight matrix that can project the original scrna-seq data onto a -dimensional space. unlike the weight matrix of pca that has no zero entries, the weight matrix of pnmf is non-negative, highly sparse, containing . % of entries as zeros, and has bases that are largely mutually exclusive (i.e., non-zero entries in different columns correspond to different rows/genes) (fig. a). go enrichment analysis shows that high weight genes in each pnmf basis are enriched with conceptually-similar go terms, and high weight genes in different pnmf bases are enriched with conceptually-different go terms (fig. b). this result indicates that pnmf bases correspond to gene groups with distinct functions. on the contrary, the pca bases do not have good functional interpretations: the high weight genes in each pca basis are not enriched with conceptually-similar go terms, and different pca bases share many high weight genes (fig. s ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : illustration of the sparse and interpretable projection found by scpnmf. we use the freggold dataset as an example. (a) comparison of the weight matrices of pca and pnmf. heatmaps visualize the learned weight matrices of pca (top) and pnmf (bottom), where rows are genes and columns are bases. red represents positive weights while blue represents negative weights. the rows are ordered by gene-wise hierarchical clustering. compared to pca, the weight matrix of pnmf is strictly non-negative, much more sparse and mutually exclusive between bases. (b) go analysis result of each basis in the weight matrix of pnmf. texts in black boxes summarize the functions of genes in each basis. the enriched go terms are almost mutually exclusive, implying that each basis represents a unique gene functional cluster. (c) statistical tests on each basis in the score matrix of pnmf. top row: scatter plots of scores and total log-counts (cell library sizes). each dot represents a cell. cell scores in bases and are highly correlated with cell library sizes. bottom row: histograms of cell scores in each basis. scores in bases and show strong multimodality patterns (adjusted p-value ≤ . ). (d) umap visualizations of cells based on high weight genes in the unselected bases and and those in the selected bases , , and . genes in the unselected bases completely fail to distinguish the three cell types, while genes in the selected bases lead to a clear separation of the three cell types. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / to further analyze the pnmf bases, we list the top high weight genes in each basis (table s ), from which we identify many well-known genes with important functions. for instance, basis contains classic housekeeping genes, such as gapdh [ ] and ribosomal protein genes (rps-) [ ]; basis contains well-known tumor-related genes, including egfr [ ] and cdk [ ]. in particular, the cells of the hcc cell line (one of the three cell types) have overall high scores in basis (fig. s ), a reasonable result because the hcc cell line contains an egfr activating mutation [ ]. in summary, scpnmf step i outputs bases representing sparse and functionally interpretable gene sets. . basis selection is an essential step in scpnmf here we explain why basis selection is an essential step in scpnmf. in the last section, we show that each pnmf basis of the freggold dataset approximately represents one functional gene group. it is well known that housekeeping genes (basis ) and cell-cycle genes (basis ) are usually irrelevant to cell type distinctions. however, such biological knowledge is not always available or certain. therefore, scpnmf mainly relies on the two data-driven strategies: correlations with cell library sizes and multimodality tests (section . . ) for selecting informative bases. fig. c visualizes the two strategies: cell scores in bases and are highly correlated with cell library sizes (pearson correlations > . ); cell scores in bases and show strong evidence as multi-modally distributed (adjusted p-value < . ). hence, strategy will not retain bases and , and strategy will not retain bases , , and ; together, bases and will be removed, and bases , , and will be selected. to verify the effectiveness of basis selection, we use umap to visualize cells based on the top high weight genes in the unselected bases and vs. those in the selected bases , , and (fig. d). we observe that the top genes in the unselected bases completely fail to separate the three cell types, while the top genes in the selected bases perfectly distinguish the three cell types. this result strongly supports that basis selection is a necessary step of scpnmf. . scpnmf outperforms state-of-the-art gene-selection methods on diverse scrna-seq datasets in this section, we demonstrate scpnmf’s capacity for informative gene selection. we compre- hensively benchmark scpnmf against other single cell informative selection methods (table s ) on seven scrna-seq datasets (table s ) using three clustering methods (louvain clustering, k-means clustering, and hierarchical clustering). for fair benchmarking, the seven scrna-seq datasets cover both unique molecule identifier (umi) and non-umi protocols and include various biological samples. using the adjusted rank index (ari) as the metric of clustering accuracy, we calculate the ari values of the three clustering methods on each dataset using informative .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : benchmarking scpnmf against informative gene selection methods on seven scrna-seq datasets. (a) clustering accuracies (ari values) of three clustering methods based on the informative genes selected. gene selection methods are ordered from left to right by their average ari across the three clustering methods and the seven datasets. (b) umap visualization of cells in the zheng dataset based on informative genes selected by each method. genes selected by scpnmf lead to a clear separation between naive cytotoxic t cells and regulatory t cells, while the genes selected by others methods do not. genes selected by each gene selection method, as genes are commonly used in targeted gene profiling. fig. a shows that scpnmf has overall the highest ari values across datasets and clustering methods. in particular, scpnmf has the highest average ari value with each clustering method (louvain: . ; k-means: . ; hierarchical clustering: . ) and the highest overall average ari ( . ) across datasets and clustering methods. note that the mean of the overall average ari values of all methods except scpnmf is only . . we further show the umap visualization of cells in the zheng dataset based on the informa- tive genes selected by each of the gene selection methods (fig. b). only scpnmf leads to a clear separation of naive cytotoxic t cells and regulatory t cells, while the informative genes selected by other methods except corfs and irlbapcafs cannot distinguish the two cell types at all. we also compare the methods under a varying number of informative genes: , , , and , the commonly used gene numbers in targeted gene profiling. we observe that the overall average ari values of scpnmf are consistently higher than those of other methods, across all informative gene numbers (fig. s ). moreover, compared with other methods, scpnmf leads to more stable overall average ari values under varying numbers of informative genes, indicating its stronger robustness to the gene number constraint of targeted gene profiling. these results strongly support the superior performance of scpnmf as an informative gene selection method. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . scpnmf guides targeted gene profiling experimental design and cell-type prediction in this section, we demonstrate how scpnmf can guide the selection of genes to be measured in a targeted gene profiling experiment, and how scpnmf enables subsequent cell type annotation on the targeted gene profiling data. we design two case studies with paired scrna-seq reference data and “pseudo” targeted gene profiling data, whose per-cell sequencing depth is higher than that of the corresponding scrna-seq data. in the first case study, we use the zheng dataset (measured by the x protocol) as the refer- ence dataset. to generate the pseudo targeted gene profiling data, we use a new single-cell gene expression simulator that captures gene correlations, scdesign [ ], to generate data with a - time higher per-cell sequencing depth. in the second case study, we use the pbmc x dataset (measured by x protocol) as the reference dataset, and we use pbmcsmartseq (measured by smart-seq ) as the pseudo targeted gene profiling data because smart-seq has a higher per- gene sequencing depth than x does. in both case studies, for each gene selection method, the corresponding pseudo targeted gene profiling datasets only contain the m informative genes selected by the method. we benchmark scpnmf against the gene selection methods in terms of cell type prediction on the pseudo targeted gene profiling data. to avoid the bias for a specific classification algorithm, we apply three popular algorithms for cell type prediction: random forest (rf) [ ], k-nearest neighbors (knn) [ ], and support vector machine (svm) [ ]. in each case study, we first train each classification algorithm on the low-dimensional embeddings of the reference cells sref (m) given the m = informative genes selected by each gene selection method. then we apply the trained classifier to the low-dimensional embeddings of the cells in the pseudo targeted gene profiling data snew (m) . table shows that scpnmf leads to the highest average prediction accuracy ( . ) across six combinations (two case studies × three classification algorithms). moreover, scpnmf achieves the highest accuracy in each combination except zheng + random forest where it is the second best. these results confirm that scpnmf effectively guides the selection of genes to measure in targeted gene profiling experiments, and it enables accurate cell type annotation on newly generated targeted gene profiling datasets. discussion we propose scpnmf, an unsupervised gene selection and data projection method for scrna-seq data. the major goal of scpnmf is to select a fixed number of informative genes to distinguish cell types and guide gene selection for targeted gene profiling experiments. moreover, scpnmf can project a new targeted gene profiling dataset with the selected genes to the low-dimensional space that embeds a reference scrna-seq dataset. we perform a comprehensive benchmark to evaluate scpnmf in terms of informative gene selection against the state-of-the-art gene selection .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table : prediction accuracy of cell types based on informative genes selected by gene selection methods in the two case studies with paired reference scrna-seq data and targeted gene profiling data method zheng pbmc average rf knn svm rf knn svm accuracy scpnmf . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . m drop . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . seuratdisp . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . corfs . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . giniclust . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . scran . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . seuratmvp . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . scanpy . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . scmarker . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . seuratvst . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . danb . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . irlbapcafs . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . ( . , . ) . parentheses are % confidence intervals. highest number within each column is labeled by underline. methods. our results show that scpnmf consistently outperforms existing methods for a wide range of informative gene numbers (from to ) on diverse scrna-seq datasets. we also demonstrate that the informative genes selected by scpnmf can effectively guide gene selection for targeted gene profiling and lead to accurate cell type annotation on targeted gene profiling data based on reference scrna-seq data. besides gene selection and data projection, scpnmf also works as a dimensionality reduction method with good interpretability. each dimension in the low-dimensional space found by scpnmf can be considered as a new functional “feature” (as a linear combination of correlated and thus functionally related genes). moreover, the mutual exclusiveness makes the pnmf bases used in scpnmf advantageous over the pca bases in terms of removing confounding effects. for example, cell-cycle genes obscure the identification of cell types and should be removed from low-dimensional embeddings of cells. for pca, cell-cycle genes affect many pca bases, so the popular scrna-seq pipeline seurat implements a complicated approach that first calculates “cell- cycle scores” and then regresses each basis (principal component) on these scores to remove the effects of cell-cycle genes [ ]. in contrast, cell-cycle genes are concentrated in only one pnmf basis, so it is easy to remove that basis to clear the effects of cell-cycle genes. therefore, scpnmf has great potentials in deciphering cell heterogeneity in single-cell data by working as an interpretable dimensionality reduction method. the current implementation of scpnmf focuses on single-cell gene expression data. consid- .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ering the rapid development of single-cell multi-omics technologies, we plan to extend scpnmf to accommodate other technologies that measure other genomics features such chromatin ac- cessibility landscapes measured by single-cell atac-seq [ ], or even to integrate data across multi-omics datasets. another note is that the multimodality test for basis selection in scpnmf only accounts for discrete cell types but not continuous cell trajectories. therefore, other tests or strategies are needed to select informative bases to capture biological variations along continuous cell trajectories. an important question for gene selection is: how many genes should be selected as informative genes to fully capture the biological variations of interest? in our studies, we observe that, after the informative gene number reaches , the clustering accuracies based on the selected informative genes plateau for most gene selection methods including scpnmf. therefore, genes may be sufficient for capturing biological variations in scrna-seq data. however, it remains challenging to decide the minimum number of informative genes, given that the underlying cell sub-population structure is data-specific and might be complex. we plan to explore this problem in future with the possible use of information theory. software and code the r package scpnmf is available at https://github.com/jsb-ucla/scpnmf. acknowledgements we acknowledge the comments and feedback from the members of the junction of statistics and biology at ucla (http://jsb.ucla.edu). funding this work was supported by the following grants: nsf dms- and dbi- , nih/nigms r gm , phrma foundation research starter grant in informatics, johnson and johnson wistem d award, and sloan research fellowship (to j.j.l.); nih/ninds r ns (to r.w). competing interests none. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/jsb-ucla/scpnmf http://jsb.ucla.edu https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references [ ] s steven potter. single-cell rna sequencing for the study of development, physiology and disease. nature reviews nephrology, ( ): – , . [ ] kenneth d birnbaum. power in numbers: single-cell rna-seq strategies to dissect complex tissues. annual review of genetics, : – , . [ ] chenxu zhu, sebastian preissl, and bing ren. single-cell multimodal omics: the power of many. nature methods, ( ): – , . [ ] olivier thellin, willy zorzi, bernard lakaye, b de borman, bernard coumans, georges hennen, thierry grisar, ahmed igout, and ernst heinen. housekeeping genes as internal standards: use and limits. journal of biotechnology, ( - ): – , . [ ] eli eisenberg and erez y levanon. human housekeeping genes, revisited. trends in genetics, ( ): – , . [ ] aravind subramanian, rajiv narayan, steven m corsello, david d peck, ted e natoli, xiaodong lu, joshua gould, john f davis, andrew a tubelli, jacob k asiedu, et al. a next generation connectivity map: l platform and the first , , profiles. cell, ( ): – , . [ ] arjun raj, patrick van den bogaard, scott a rifkin, alexander van oudenaarden, and sanjay tyagi. imaging individual mrna molecules using multiple singly labeled probes. nature methods, ( ): – , . [ ] jeffrey r moffitt, junjie hao, guiping wang, kok hao chen, hazen p babcock, and xiaowei zhuang. high-throughput single-cell gene-expression profiling with multiplexed error-robust fluorescence in situ hybridization. proceedings of the national academy of sciences, ( ): – , . [ ] fatma uzbas, florian opperer, can sönmezer, dmitry shaposhnikov, steffen sass, christian krendl, philipp angerer, fabian j theis, nikola s mueller, and micha drukker. bart-seq: cost-effective massively parallelized targeted sequencing for genomics, transcriptomics, and single-cell analysis. genome biology, ( ): – , . [ ] jamie l marshall, benjamin r doughty, vidya subramanian, philine guckelberger, qingbo wang, linlin m chen, samuel g rodriques, kaite zhang, charles p fulco, joseph nasser, et al. hypr-seq: single-cell quantification of chosen rnas via hybridization and sequencing of dna probes. proceedings of the national academy of sciences, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] christoph hafemeister and rahul satija. normalization and variance stabilization of single- cell rna-seq data using regularized negative binomial regression. genome biology, ( ): – , . [ ] tim stuart, andrew butler, paul hoffman, christoph hafemeister, efthymia papalexi, william m mauck iii, yuhan hao, marlon stoeckius, peter smibert, and rahul satija. comprehensive integration of single-cell data. cell, : – , . doi: . /j.cell. . . . url https://doi.org/ . /j.cell. . . . [ ] aaron tl lun, karsten bach, and john c marioni. pooling across cells to normalize single- cell rna sequencing data with many zero counts. genome biology, ( ): , . [ ] tallulah s andrews and martin hemberg. m drop: dropout-based feature selection for scrnaseq. bioinformatics, ( ): – , . [ ] lan jiang, huidong chen, luca pinello, and guo-cheng yuan. giniclust: detecting rare cell types from single-cell gene expression data with gini index. genome biology, ( ): , . [ ] fang wang, shaoheng liang, tapsi kumar, nicholas navin, and ken chen. scmarker: ab initio marker selection for single cell transcriptome profiling. plos computational biology, ( ):e , . [ ] evan z macosko, anindita basu, rahul satija, james nemesh, karthik shekhar, melissa goldman, itay tirosh, allison r bialas, nolan kamitaki, emily m martersteck, et al. highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. cell, ( ): – , . [ ] maayan baron, adrian veres, samuel l wolock, aubrey l faust, renaud gaujoux, amedeo vetere, jennifer hyoje ryu, bridget k wagner, shai s shen-orr, allon m klein, et al. a single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. cell systems, ( ): – , . [ ] xun zhu, travers ching, xinghua pan, sherman m weissman, and lana garmire. detecting heterogeneity in single-cell rna-seq data by non-negative matrix factorization. peerj, :e , . [ ] philippe boileau, nima s hejazi, and sandrine dudoit. exploring high-dimensional biological data with sparse contrastive principal component analysis. bioinformatics, ( ): – , . [ ] zhana duren, xi chen, mahdi zamanighomi, wanwen zeng, ansuman t satpathy, howard y chang, yong wang, and wing hung wong. integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. proceedings of the national academy of sciences, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /j.cell. . . https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] ghislain durif, laurent modolo, jeff e mold, sophie lambert-lacroix, and franck picard. probabilistic count matrix factorization for single cell expression data analysis. bioinformatics, ( ): – , . [ ] shuqin zhang, liu yang, jinwen yang, zhixiang lin, and michael k ng. dimensionality reduction for single cell rna sequencing data using constrained robust non-negative matrix factorization. nar genomics and bioinformatics, ( ):lqaa , . [ ] chao gao and joshua d welch. iterative refinement of cellular identity from single-cell data using online learning. in international conference on research in computational molecular biology, pages – . springer, . [ ] zi yang and george michailidis. a non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. bioinformatics, ( ): – , . [ ] joshua d welch, velina kozareva, ashley ferreira, charles vanderburg, carly martin, and evan z macosko. single-cell multi-omic integration compares and contrasts features of brain cell identity. cell, ( ): – , . [ ] zhijian yuan, zhirong yang, and erkki oja. projective nonnegative matrix factorization: sparseness, orthogonality, and clustering. neural process. lett, pages – , . [ ] zhirong yang and erkki oja. linear and nonlinear projective nonnegative matrix factorization. ieee transactions on neural networks, ( ): – , . [ ] daniel d lee and h sebastian seung. learning the parts of objects by non-negative matrix factorization. nature, ( ): – , . [ ] jean-philippe brunet, pablo tamayo, todd r golub, and jill p mesirov. metagenes and molecular pattern discovery using matrix factorization. proceedings of the national academy of sciences, ( ): – , . [ ] jose ameijeiras-alonso, rosa m crujeiras, and alberto rodrı́guez-casal. mode testing, critical bandwidth and excess mass. test, ( ): – , . [ ] ilya korsunsky, nghia millard, jean fan, kamil slowikowski, fan zhang, kevin wei, yuriy baglaenko, michael brenner, po-ru loh, and soumya raychaudhuri. fast, sensitive and accurate integration of single-cell data with harmony. nature methods, ( ): – , . [ ] saskia freytag, luyi tian, ingrid lönnstedt, milica ng, and melanie bahlo. comparison of clustering tools in r for medium-sized x genomics single-cell rna-sequencing data. f research, , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] robert d barber, dan w harmer, robert a coleman, and brian j clark. gapdh as a housekeeping gene: analysis of gapdh mrna expression in a panel of human tissues. physiological genomics, ( ): – , . [ ] nicholas silver, steve best, jie jiang, and swee lay thein. selection of housekeeping genes for gene expression studies in human reticulocytes using real-time pcr. bmc molecular biology, ( ): , . [ ] collin m blakely, thomas bk watkins, wei wu, beatrice gini, jacob j chabon, caroline e mccoach, nicholas mcgranahan, gareth a wilson, nicolai j birkbak, victor r olivas, et al. evolution and clinical impact of co-occurring genetic alterations in advanced-stage egfr- mutant lung cancers. nature genetics, ( ): – , . [ ] ben o’leary, richard s finn, and nicholas c turner. treating cancer with selective cdk / inhibitors. nature reviews clinical oncology, ( ): – , . [ ] carminia maria della corte, umberto malapelle, elena vigliar, francesco pepe, giancarlo troncone, vincenza ciaramella, teresa troiani, erika martinelli, valentina belli, fortunato ciardiello, et al. efficacy of continuous egfr-inhibition and role of hedgehog in egfr acquired resistance in human lung cancer cells with activating mutation of egfr. oncotarget, ( ): , . [ ] tianyi sun, dongyuan song, wei vivian li, and jingyi jessica li. scdesign : an interpretable simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. biorxiv, . [ ] leo breiman. random forests. machine learning, ( ): – , . [ ] bernhard e boser, isabelle m guyon, and vladimir n vapnik. a training algorithm for optimal margin classifiers. in proceedings of the fifth annual workshop on computational learning theory, pages – , . [ ] sebastian pott and jason d lieb. single-cell atac-seq: strength in numbers. genome biology, ( ): – , . [ ] angelo duò, mark d robinson, and charlotte soneson. a systematic performance evaluation of clustering methods for single-cell rna-seq data. f research, , . [ ] jiarui ding, xian adiconis, sean k simmons, monika s kowalczyk, cynthia c hession, nemanja d marjanovic, travis k hughes, marc h wadsworth, tyler burks, lan t nguyen, et al. systematic comparison of single-cell and single-nucleus rna-sequencing methods. nature biotechnology, pages – , . [ ] jose alquicira-hernandez, anuja sathe, hanlee p ji, quan nguyen, and joseph e powell. scpred: accurate supervised method for cell-type classification from single-cell rna-seq data. genome biology, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] spyros darmanis, steven a sloan, ye zhang, martin enge, christine caneda, lawrence m shuer, melanie g hayden gephart, ben a barres, and stephen r quake. a survey of human brain transcriptome diversity at the single cell level. proceedings of the national academy of sciences, ( ): – , . [ ] itay tirosh, benjamin izar, sanjay m prakadan, marc h wadsworth, daniel treacy, john j trombetta, asaf rotem, christopher rodman, christine lian, george murphy, et al. dissect- ing the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. science, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplementary materials s choice of parameters and robustness analysis s . low rank k in the development of scpnmf, motivated by the objective function of the pnmf method, min w∈rp×k≥ ‖x−wwtx‖ , (s ) pnmf aims to inherit the advantages such as basis orthogonality and the ability to project the new data from pca. however, a key constraint in pca, wtw = i, is sacrificed in order to meet with the condition w ≥ in pnmf. to get closer to pca and thus attain its nice properties, we propose to use the normalized difference between wtw and i to measure the orthonality of w: dev.ortho = ‖i−wtw‖/k , (s ) which is an implication of the performance in the downstream analysis as well. it naturally follows a method for determining the number of basis, k: we perform pnmf for a sequence of k’s, calculate the dev.ortho measure for each w ∈ rp×k≥ optimized by pnmf for each k, and then look at the plot of dev.ortho against k. users can decide cutoff where it reaches stability or there is a clear elbow in the graph. in fig. s , with zheng [ ] dataset, we demonstrate that ( ) the dev.ortho measure is highly correlated with the performance of w in the downstream analysis; ( ) in real data application, the dev.ortho measure shows a clear elbow pattern, which is helpful for users to determine k. empirically, we see that dev.ortho reaches stability at k = for most scrna-seq data. for the purpose of providing suggestion for users and saving computational energy, we set the default number of bases in scpnmf to be k = . s . r : threshold for correlations between score vectors and cell library sizes in scpnmf step ii: basis selection in real data application, the threshold for correlations between score vectors and cell library sizes in scpnmf step ii: basis selection, r , needs to be pre-defined. in the field, researchers often use thresholds as accurate as with one decimal digit, such as . . by empirically running k-means clustering on the seven datasets (see table s ) with different thresholds { . , . , . , . , . }, as shown in fig. s , we suggest setting r = . for k ≥ , and more conservatively, r = . when the basis number k is small (k < ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / s functional annotation we use the r package clusterprofiler [y] to perform go analysis. we set the gene ontology as “bp”, adjusted p-value cutoff as . . the output go terms are simplified by clusterprofiler. in this paper, we only perform a very conservative filtering based on functionality. we define the common housekeeping gene list as actb, actg , b m, gapdh, malat . if the top high weight genes from one basis contain any of these genes, this basis will be filtered out. s data preprocessing scpnmf only performs minimum data preprocessing to avoid information loss. denote a scrna- seq count matrix scpnmf further investigates as xc ∈ np×n, with rows representing p genes and columns representing n cells. users make the log count matrix x ∈ rp×n≥ by taking the log transformation with a pseudo count : xij = log ( xcij + ) , i = , · · · , p, j = , · · · , n. (s ) scpnmf takes the log count matrix x ∈ rp×n≥ as the input. with log transformation, the effect of a few extremely large counts will be alleviated, and the transformed continuous values are more flexible to model. we introduce the pseudo count to avoid negative and infinite values in the later pnmf optimization step. for scrna-seq data used in this paper (table s ), we filtered out genes that are expressed in fewer than % of the cells, and then filtered out cells that are expressed in fewer than % of the remaining genes. additionally, malat , mitochondrial and ribosomal genes are filtered for datasets pbmc x and pbmcsmartseq according to the reference paper [ ]. users are able to adjust the filtering process before they input the log count matrix into scpnmf. s details in informative gene selection and clustering in this paper, we compare scpnmf with other different informative gene selection methods (table s ). some gene selection methods cannot let users pre-define an arbitrary gene number; for those methods (e.g., scmarker [ ]), we shift the tuning parameters until their output gene numbers equals the desired gene number. therefore, their outputs might not achieve their the optimal results. we apply three clustering algorithm, louvain clustering (by seurat), k-means clustering (by r function kmeans), hierarchical clustering (by r function hclust). we perform pca on informative genes and use the top pcs for clustering. the adjusted rank index (ari) is as the metric of .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / clustering accuracy. ari is defined as: ari (p, t) = ∑ l,s ( nls ) − [∑ l ( al )∑ s ( bs )] / ( n ) [∑ l ( al ) + ∑ s ( bs )] − [∑ l ( al )∑ s ( bs )] / ( n ) , (s ) where p = (p , · · · , pl) denotes the inferred cluster labels, and t = (t , · · · , ts) denotes the true cluster labels. l and s are not necessarily to be equal. nls = ∑ ij i(pi = l)i(tj = s), al = ∑ s nls, bs = ∑ l nls. ari ∈ [ , ], an ari value close to means more accurate inferred clusters. to minimize the effects caused by parameters (resolution r in louvain and number of cluster k in k-means and hierarchical clustering), we try a sequence of parameters: r ∈{ . , . , . , . , . , . , . , . , . , . , . , . , . , . } , k ∈{ , , , · · · , } , (s ) and use the average of top three high ari across different parameters as the final output. s details in new data projection and cell type predic- tion we use two datasets, zheng and pbmc x, as the reference scrna-seq datasets. for zheng dataset, we first use scdesign [ ] to learn the underlying parameters, and then simulate a new dataset with same genes and cell types but times higher sequencing depth compared to the zheng dataset. for pbmc x dataset, we use the pbmcsmartseq dataset, which measures the exact same example and contains all genes measured in pbmc x. given m selected genes, the simulated zheng and pbmc x are extracted with those certain genes, and play role as the “pseudo” targeted gene profiling only measuring m genes. for cell type prediction, we project every targeted gene profiling dataset and its scrna-seq reference on the same low-dimensional space, which mainly follows the idea from scpred [ ]. when applying scpnmf, we use the weight matrix ws,(m) to project both the reference dataset and the targeted gene profiling dataset. for other gene selection methods, we first subset the reference dataset with only m selected genes, run pca to get a weight matrix wpca, and use it to project both the reference dataset (with only m genes) and targeted gene profiling dataset. after getting two low-dimensional embeddings of reference and targeted gene profiling data, we run the harmony algorithm [ ] to remove the technical variations between two low-dimensional em- beddings. then we apply three classification algorithms, random forest (rf), k-nearest neighbors (knn) and support vector machine with radial kernel (svmradial) in r package caret [k]. when fitting the training model, we use -fold cross-validation with three repeats. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table s : top high weight genes in each pnmf basis of fretaggold dataset basis gene symbol description rps , tmsb x, gapdh, rpl , rpl , fth , malat , cox , rpl , rps highly expressed housekeeping genes cd , ptgr , hla-b, aldh a , c orf , lcn , igfbp , saa , cxcl , hla-dra immune-related genes sec g, cdk , ccn , g s , eloc, vopp , egfr, f , cdkn a, epcam tumor-related genes (oncogenes, tumor suppressor genes) h c , cks b, hmgb , smc , pttg , kpna , ccnb , cdkn , cks , cdc genes related to mitotic cell cycle hspb , ube s, cald , tmem , fis , isoc , zn- hit , c orf , ndufa , ppp r a genes related to mitochondrion .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table s : overview of informative gene selection used in this study method user-defined gene # language package reference corfs yes r m drop (version . . ) [ ] danb yes r m drop (version . . ) [ ] giniclust yes r m drop (version . . ) [ ] irlbapcafs yes r m drop (version . . ) [ ] m drop yes r m drop (version . . ) [ , ] scanpy yes python scanpy (version . . ) [w] scmarker no r scmarker [ ] scran yes r scran (version . . ) [ ] seuratdisp yes r seurat (version . . ) [ , ] seuratmvp no r seurat (version . . ) [ ] seuratvst yes r seurat (version . . ) [ ] : due to failure in scmarker r package installation, we run the r script downloaded from https://github.com/kchen-lab/scmarker on september , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/kchen-lab/scmarker https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table s : overview of datasets used in this study dataset sequencing proto- col gene # cell # cell type # true label description ref darmanis smart-seq no human adult corti- cal samples [ ] freytaggold xgenomics chromium yes mixture of human lung adenocarcinoma cell lines [ ] tirosh smart-seq no human melanoma tumors [ ] pbmc x xgenomics chromium no human peripheral blood mononuclear cells. x-v for sample in the original paper. [ ] pbmcsmartseq smart-seq no human peripheral blood mononuclear cells. smart-seq for sample in the original paper. [ ] zheng xgenomics gemcode yes mixture of human peripheral blood mononuclear cells [ , z] zheng xgenomics gemcode yes mixture of human peripheral blood mononuclear cells [ , z] .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . . . . . . . . . k d e v. o rt h o a r i choice of k figure s : comparison of dev.ortho and k-means ari against low rank k on zheng [ ] dataset. . . . . . . . . . . . r a r i choice of r figure s : comparison of k-means ari against r , the threshold for correlations between score vectors and cell library sizes in scpnmf step ii: basis selection. the mean ari and the error bars are calculated across seven datasets (see table s ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure s : go annotation on weight matrix of pca. the enriched go terms between basis are largely overlapped. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure s : scpnmf scores versus total log-counts of freggold dataset colored by cell types. basis distinguishes h from the other two cell types and basis distinguishes hcc from the other two cell types. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure s : benchmarking scpnmf and other informative gene selction methods using , , , genes. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure s : comparison of overall average ari of different methods versus gene numbers. the y-axis indicates the average ari values across seven datasets and three clustering methods for each gene selection methods. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references [ ] s steven potter. single-cell rna sequencing for the study of development, physiology and disease. nature reviews nephrology, ( ): – , . [ ] kenneth d birnbaum. power in numbers: single-cell rna-seq strategies to dissect complex tissues. annual review of genetics, : – , . [ ] chenxu zhu, sebastian preissl, and bing ren. single-cell multimodal omics: the power of many. nature methods, ( ): – , . [ ] olivier thellin, willy zorzi, bernard lakaye, b de borman, bernard coumans, georges hennen, thierry grisar, ahmed igout, and ernst heinen. housekeeping genes as internal standards: use and limits. journal of biotechnology, ( - ): – , . [ ] eli eisenberg and erez y levanon. human housekeeping genes, revisited. trends in genetics, ( ): – , . [ ] aravind subramanian, rajiv narayan, steven m corsello, david d peck, ted e natoli, xiaodong lu, joshua gould, john f davis, andrew a tubelli, jacob k asiedu, et al. a next generation connectivity map: l platform and the first , , profiles. cell, ( ): – , . [ ] arjun raj, patrick van den bogaard, scott a rifkin, alexander van oudenaarden, and sanjay tyagi. imaging individual mrna molecules using multiple singly labeled probes. nature methods, ( ): – , . [ ] jeffrey r moffitt, junjie hao, guiping wang, kok hao chen, hazen p babcock, and xiaowei zhuang. high-throughput single-cell gene-expression profiling with multiplexed error-robust fluorescence in situ hybridization. proceedings of the national academy of sciences, ( ): – , . [ ] fatma uzbas, florian opperer, can sönmezer, dmitry shaposhnikov, steffen sass, christian krendl, philipp angerer, fabian j theis, nikola s mueller, and micha drukker. bart-seq: cost-effective massively parallelized targeted sequencing for genomics, transcriptomics, and single-cell analysis. genome biology, ( ): – , . [ ] jamie l marshall, benjamin r doughty, vidya subramanian, philine guckelberger, qingbo wang, linlin m chen, samuel g rodriques, kaite zhang, charles p fulco, joseph nasser, et al. hypr-seq: single-cell quantification of chosen rnas via hybridization and sequencing of dna probes. proceedings of the national academy of sciences, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] christoph hafemeister and rahul satija. normalization and variance stabilization of single- cell rna-seq data using regularized negative binomial regression. genome biology, ( ): – , . [ ] tim stuart, andrew butler, paul hoffman, christoph hafemeister, efthymia papalexi, william m mauck iii, yuhan hao, marlon stoeckius, peter smibert, and rahul satija. comprehensive integration of single-cell data. cell, : – , . doi: . /j.cell. . . . url https://doi.org/ . /j.cell. . . . [ ] aaron tl lun, karsten bach, and john c marioni. pooling across cells to normalize single- cell rna sequencing data with many zero counts. genome biology, ( ): , . [ ] tallulah s andrews and martin hemberg. m drop: dropout-based feature selection for scrnaseq. bioinformatics, ( ): – , . [ ] lan jiang, huidong chen, luca pinello, and guo-cheng yuan. giniclust: detecting rare cell types from single-cell gene expression data with gini index. genome biology, ( ): , . [ ] fang wang, shaoheng liang, tapsi kumar, nicholas navin, and ken chen. scmarker: ab initio marker selection for single cell transcriptome profiling. plos computational biology, ( ):e , . [ ] evan z macosko, anindita basu, rahul satija, james nemesh, karthik shekhar, melissa goldman, itay tirosh, allison r bialas, nolan kamitaki, emily m martersteck, et al. highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. cell, ( ): – , . [ ] maayan baron, adrian veres, samuel l wolock, aubrey l faust, renaud gaujoux, amedeo vetere, jennifer hyoje ryu, bridget k wagner, shai s shen-orr, allon m klein, et al. a single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. cell systems, ( ): – , . [ ] xun zhu, travers ching, xinghua pan, sherman m weissman, and lana garmire. detecting heterogeneity in single-cell rna-seq data by non-negative matrix factorization. peerj, :e , . [ ] philippe boileau, nima s hejazi, and sandrine dudoit. exploring high-dimensional biological data with sparse contrastive principal component analysis. bioinformatics, ( ): – , . [ ] zhana duren, xi chen, mahdi zamanighomi, wanwen zeng, ansuman t satpathy, howard y chang, yong wang, and wing hung wong. integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. proceedings of the national academy of sciences, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /j.cell. . . https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] ghislain durif, laurent modolo, jeff e mold, sophie lambert-lacroix, and franck picard. probabilistic count matrix factorization for single cell expression data analysis. bioinformatics, ( ): – , . [ ] shuqin zhang, liu yang, jinwen yang, zhixiang lin, and michael k ng. dimensionality reduction for single cell rna sequencing data using constrained robust non-negative matrix factorization. nar genomics and bioinformatics, ( ):lqaa , . [ ] chao gao and joshua d welch. iterative refinement of cellular identity from single-cell data using online learning. in international conference on research in computational molecular biology, pages – . springer, . [ ] zi yang and george michailidis. a non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. bioinformatics, ( ): – , . [ ] joshua d welch, velina kozareva, ashley ferreira, charles vanderburg, carly martin, and evan z macosko. single-cell multi-omic integration compares and contrasts features of brain cell identity. cell, ( ): – , . [ ] zhijian yuan, zhirong yang, and erkki oja. projective nonnegative matrix factorization: sparseness, orthogonality, and clustering. neural process. lett, pages – , . [ ] zhirong yang and erkki oja. linear and nonlinear projective nonnegative matrix factorization. ieee transactions on neural networks, ( ): – , . [ ] daniel d lee and h sebastian seung. learning the parts of objects by non-negative matrix factorization. nature, ( ): – , . [ ] jean-philippe brunet, pablo tamayo, todd r golub, and jill p mesirov. metagenes and molecular pattern discovery using matrix factorization. proceedings of the national academy of sciences, ( ): – , . [ ] jose ameijeiras-alonso, rosa m crujeiras, and alberto rodrı́guez-casal. mode testing, critical bandwidth and excess mass. test, ( ): – , . [ ] ilya korsunsky, nghia millard, jean fan, kamil slowikowski, fan zhang, kevin wei, yuriy baglaenko, michael brenner, po-ru loh, and soumya raychaudhuri. fast, sensitive and accurate integration of single-cell data with harmony. nature methods, ( ): – , . [ ] saskia freytag, luyi tian, ingrid lönnstedt, milica ng, and melanie bahlo. comparison of clustering tools in r for medium-sized x genomics single-cell rna-sequencing data. f research, , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] robert d barber, dan w harmer, robert a coleman, and brian j clark. gapdh as a housekeeping gene: analysis of gapdh mrna expression in a panel of human tissues. physiological genomics, ( ): – , . [ ] nicholas silver, steve best, jie jiang, and swee lay thein. selection of housekeeping genes for gene expression studies in human reticulocytes using real-time pcr. bmc molecular biology, ( ): , . [ ] collin m blakely, thomas bk watkins, wei wu, beatrice gini, jacob j chabon, caroline e mccoach, nicholas mcgranahan, gareth a wilson, nicolai j birkbak, victor r olivas, et al. evolution and clinical impact of co-occurring genetic alterations in advanced-stage egfr- mutant lung cancers. nature genetics, ( ): – , . [ ] ben o’leary, richard s finn, and nicholas c turner. treating cancer with selective cdk / inhibitors. nature reviews clinical oncology, ( ): – , . [ ] carminia maria della corte, umberto malapelle, elena vigliar, francesco pepe, giancarlo troncone, vincenza ciaramella, teresa troiani, erika martinelli, valentina belli, fortunato ciardiello, et al. efficacy of continuous egfr-inhibition and role of hedgehog in egfr acquired resistance in human lung cancer cells with activating mutation of egfr. oncotarget, ( ): , . [ ] tianyi sun, dongyuan song, wei vivian li, and jingyi jessica li. scdesign : an interpretable simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. biorxiv, . [ ] leo breiman. random forests. machine learning, ( ): – , . [ ] bernhard e boser, isabelle m guyon, and vladimir n vapnik. a training algorithm for optimal margin classifiers. in proceedings of the fifth annual workshop on computational learning theory, pages – , . [ ] sebastian pott and jason d lieb. single-cell atac-seq: strength in numbers. genome biology, ( ): – , . [ ] angelo duò, mark d robinson, and charlotte soneson. a systematic performance evaluation of clustering methods for single-cell rna-seq data. f research, , . [ ] jiarui ding, xian adiconis, sean k simmons, monika s kowalczyk, cynthia c hession, nemanja d marjanovic, travis k hughes, marc h wadsworth, tyler burks, lan t nguyen, et al. systematic comparison of single-cell and single-nucleus rna-sequencing methods. nature biotechnology, pages – , . [ ] jose alquicira-hernandez, anuja sathe, hanlee p ji, quan nguyen, and joseph e powell. scpred: accurate supervised method for cell-type classification from single-cell rna-seq data. genome biology, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] spyros darmanis, steven a sloan, ye zhang, martin enge, christine caneda, lawrence m shuer, melanie g hayden gephart, ben a barres, and stephen r quake. a survey of human brain transcriptome diversity at the single cell level. proceedings of the national academy of sciences, ( ): – , . [ ] itay tirosh, benjamin izar, sanjay m prakadan, marc h wadsworth, daniel treacy, john j trombetta, asaf rotem, christopher rodman, christine lian, george murphy, et al. dissect- ing the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. science, ( ): – , . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction methods scpnmf step i: pnmf scpnmf step ii: basis selection strategy : examine bases by functional annotations (optional) data-driven strategies applications of scpnmf output: informative gene selection and new data projection m-truncation and informative gene selection new data projection results scpnmf outputs a sparse and functionally interpretable representation of scrna-seq data basis selection is an essential step in scpnmf scpnmf outperforms state-of-the-art gene-selection methods on diverse scrna-seq datasets scpnmf guides targeted gene profiling experimental design and cell-type prediction discussion choice of parameters and robustness analysis low rank k r : threshold for correlations between score vectors and cell library sizes in scpnmf step ii: basis selection functional annotation data preprocessing details in informative gene selection and clustering details in new data projection and cell type prediction a like-for-like comparison of lightweight-mapping pipelines for single-cell rna-seq data pre-processing a like-for-like comparison of lightweight-mapping pipelines for single-cell rna-seq data pre-processing mohsen zakeri , avi srivastava , hirak sarkar , and rob patro ,� department of computer science and center for bioinformatics and computational biology, university of maryland, college park, md, usa new york genome center and nyu center for genomics and systems biology, new york city, ny, usa harvard medical school, boston, massachusetts, usa abstract: recently, booeshaghi and pachter ( ) published a benchmark comparing the kallisto-bustools pipeline ( ) for single-cell data pre-processing to the alevin-fry pipeline ( ). their benchmarking adopted drastically dissimilar configurations for these two tools, and overlooked the time- and space-frugal configurations of alevin-fry previously benchmarked by sarkar et al. ( ). in this manuscript, we provide a small set of modifications to the benchmarking scripts of booeshaghi and pachter that are necessary to perform a like-for-like comparison between kallisto-bustools and alevin-fry. we also address some misuses of the alevin-fry commands and include important data on the exact reference transcriptomes used for processing . using the same benchmarking scripts of booeshaghi and pachter ( ), we demonstrate that, when configured to match the computational com- plexity of kallisto-bustools as closely as possible, alevin-fry processes data faster (∼ . times as fast on average) and uses less peak memory (∼ . times as much on average) compared to kallisto-bustools, while producing results that are similar when assessed in the manner done by booeshaghi and pachter ( ). this is a notable inversion of the performance characteristics presented in the previous benchmark. rna-seq, single-cell rna-seq, quantification correspondence: rob@cs.umd.edu introduction alevin-fry ( ) is a new pipeline for single-cell rna-seq pre-processing, which is currently being developed. while there are many relevant design decisions and performance implications we hope to discuss in detail in the preprint describing alevin-fry, one crucial aspect motivating the development of the alevin-fry pipeline is to allow testing the effect of different algorithmic choices on the gene expression estimates eventually produced by the pipeline. for example, alevin-fry exposes both a selective-alignment ( , ) mode and pseudoalignment ( ) with structural constraints mode for mapping reads. further, after read mapping, the alevin-fry tool exposes multiple algorithms for generating a permit list (sometimes called a “whitelist”) of barcodes corresponding to what are believed to be high-confidence cells, and for resolving umis into counts. when applying any of the probabilistic methods it implements for umi resolution, alevin-fry also allows assessing quantification uncertainty in the estimated counts via a bootstrapping procedure that can output either the bootstrap samples, or their summary statistics. exploring these different algorithms in a unified framework is an important task to optimize the pre-processing of single-cell the authors later updated their repository to contain a link to a deposition with the reference data they used, but that information was not available in the original repository commit d e c b c f eed fa eb at the time the preprint was published, and our framework was in place by the time this update was made. this is further described in "methods." sequencing data, and there may not be a single algorithm that is best suited to all different single-cell technologies. for example, while the benefits of selective-alignment and the use of an expanded index in the processing of bulk rna-seq data have been highlighted in a growing number of scenarios ( , , ), these tradeoffs have not been thoroughly explored in the context of single-cell (and particularly tagged-end) data. given that the majority of common tagged-end single-cell analyses are performed at the gene rather than transcript level, and in light of the extensive use of techniques like unique molecular identifier (umi) tagging, it may be the case that different tradeoffs in mapping specificity versus speed are appropriate or desirable — indeed, an argument for simpler but faster methods in this space has been made by melsted et al. ( ) and in subsequent work by the same authors. likewise, the effect of different approaches for umi error correction and umi resolution (and how they may interact with different read mapping strategies) has not been thoroughly evaluated across many different single-cell technologies, to understand if, and when, different approaches may lead to different results in downstream analysis. in the alevin-fry poster ( ), we described the results of benchmarking starsolo ( ), kallisto-bustools ( ) and alevin-fry ( ), running the latter tool with a number of different configurations of read mapping algorithm and umi resolution algorithm. we observed that the “fast” configurations of alevin-fry tested in ( ), which adopt some of the major simplifications argued for by melsted et al. ( ), are faster than kallisto-bustools, and that all of the configurations tested there use less peak memory. the recent preprint of booeshaghi and pachter ( ) omits all of the fast and memory-frugal configurations tested in sarkar et al. ( ), and instead compares the time and memory requirements of only the most computationally- and memory-intensive configuration of the alevin-fry pipeline to the kallisto-bustools pipeline. we are encouraged that others in the community are eager to try out new tools like alevin-fry for the pre-processing of single-cell data, and we recognize that fairly comparing new pipelines to existing ones can be a difficult task in the absence of sufficient documentation and tutorials. admittedly, we have not yet produced sufficient tutorials or documentation for alevin-fry given that our efforts have been in continuing to develop the tool itself. at the same time, it is not possible to “faithfully” follow recommended practice ( ) when the best practices have not yet been established for a fledgling method; in such a case, benchmarking multiple configurations (especially those that have already been tested in previous benchmarks ( )) may be a reasonable approach. spurred by booeshaghi and pachter ( ), we have now created a simple-to-follow tutorial for speed- optimized single-cell pre-processing using alevin-fry (https:// combine-lab.github.io/alevin-fry-tutorials/ /running-alevin-fry-fast/). here, we benchmark zakeri et al. | biorχiv | february , | – .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://combine-lab.github.io/alevin-fry-tutorials/ /running-alevin-fry-fast/ https://combine-lab.github.io/alevin-fry-tutorials/ /running-alevin-fry-fast/ https://combine-lab.github.io/alevin-fry-tutorials/ /running-alevin-fry-fast/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / this workflow for alevin-fry (using the same versions of salmon ( . . ) and alevin-fry ( . . ) , and kallisto( . . )-bustools( . . ) adopted in ( )). methods in order to assess how the results of the benchmark proposed by booeshaghi and pachter ( ) change when a like-for-like com- parison between alevin-fry and kallisto-bustools is carried out, we start with the experimental framework introduced in that preprint and describe here the necessary modifications to the benchmarking scripts that were made. the first difference we note is the versions of the references used for quantification. the reposi- tory provided by booeshaghi and pachter ( ) (https: //github.com/pachterlab/bp_ / at commit d e c b c f eed fa eb, which was the version available when the preprint was published) lacked both the specific reference sequences used and the urls from which these reference sequences were obtained . the paper refers the reader to ( ), wherein the relevant metadata is contained in an excel file, which still lacks adequate specificity (e.g. it lists the caenorhabditis elegans transcriptome used only as “modified ws ”). thus, in this manuscript, we have adopted the following procedure for normalizing the reference transcriptomes. for human, mouse, and combined human/mouse data, we have used the latest reference bundles provided by x genomics as of jan , (named as “ -a”), and extracted the transcriptomes from the provided genomes and respective gtf files using gffread ( ). for all other organisms, we have adopted the latest ensembl ( ) reference transcriptomes for each organism. for danio rerio, c. elegans, drosophila melanogaster and rattus norvegicus this is from release ; for arabidopsis thaliana it is from release . these updated reference transcriptomes lead, in some cases, to quite different memory usages from those reported in ( ). for the alevin-fry pipeline, this is largely explained by the fact that we index the same reference sequences as used for kallisto ( ) (that is, we do not compare indexing the transcriptome in kallisto to indexing the transcriptome and genome in salmon ( )). however, the increased memory usage of kallisto-bustools likely stems from variation in the specific reference transcriptomes used. for example, using the current x reference transcriptome for grch ( -a, from https://cf. xgenomics.com/supp/ cell-exp/refdata-gex-grch - -a.tar.gz), the peak memory usage of kallisto becomes ∼ gb during mapping ( fig. ), rather than the ∼ gb reported in ( ) (while the peak memory usage of salmon during mapping reaches ∼ . gb). our version of the repository contains a file called gather_refs.sh with the commands used to obtain these reference transcriptomes. furthermore, the following additional modifications have been made to the benchmarking, which otherwise remains the same we do not use the modified version of alevin-fry . . that booeshaghi and pachter ( ) altered to convert encoded barcode identifiers in the generate- permit-list step into character strings, but instead the tagged . . release with the rand crate additionally pinned at . . to enable compilation. a subsequent commit included a link to a deposition of the references they used, but our framework was in place and benchmarking underway by the time that commit was made. further, it is informative to see how even modest changes in the specific reference used can lead to large changes in the memory requirements of a tool. as was performed in ( ). we run alevin with the --sketch flag when producing the mapping file (called a rad file); this uses pseudoalignment ( ) with structural constraints rather than selective-alignment . we do not consider a configuration of either kallisto-bustools or alevin-fry that corrects to or uses the full x permit list. in the original benchmark, the authors used alevin-fry with the -b flag, treating the full list of x barcodes as a filtered permit list; however the -b flag is meant to accept a list of barcodes corresponding to high-confidence cells that have passed external filtering. passing the full x barcode list to the -b flag is neither intended nor currently supported in alevin-fry (though we are planning to add this functionality), which we have now clarified in the documentation — as we had previously clarified this same point in the alevin ( ) documentation. we have both methods generate their own permit list, and perform quantification on their corresponding filtered cells. we pass the -d fw flag to alevin-fry’s generate-permit-list step rather than -d either, as the rad file records the orientation of each read with respect to the target transcript, and all the technologies evaluated here expect the second read to map to the transcriptome in the forward orientation. mappings in an unexpected orientation should be filtered. we have used the cr-like resolution strategy when invoking the alevin-fry quant command; this implements a simple but fast umi resolution algorithm that breaks ties by umi frequency alone and discards reads for which a most frequent unique gene cannot be determined. we have also removed the step of the pipeline that converts the rad (respectively bus) file into a text format. the binary to text conversion may be useful for debugging purposes, but is not a standard or necessary part of these pre-processing pipelines, as the bus and rad files are primarily intended for the storage and processing of data rather than human inspection. further, contrary to the supposition of booeshaghi and pachter ( ), this conversion is likely a case where language choice, and usage of standard language idioms, leads to different performance characteristics. unlike c++, rust places the standard output stream behind a lock to ensure threadsafe access, a decision that imposes a cost for programs that are heavy on writing to the standard output stream in a line-oriented manner when standard idioms are used. while we do not view the optimization of the command that dumps a rad file to text as particularly high-priority, we will nonetheless explore making use of unsafe c system calls in this command until a comparable solution is exposed natively in rust. the benchmarking scripts used to produce the results described here can be found at https://github.com/combine-lab/ bp_ -lfl (these are the same as the benchmarking scripts of https://github.com/pachterlab/bp_ at commit d e c b c f eed fa eb with the modifications described above). we encourage users to run these benchmarks for themselves, and welcome feedback and suggestions. despite the additions and modifications we describe here, neither our repository nor the original repository of booeshaghi and pachter ( ) enable full reproducibility without non-trivial effort or investigation. one complication is that there existed multiple candidate scripts for performing specific steps of the data analysis within different directories of the repository, and none had complete this sketch mode was evaluated in detail in the poster of sarkar et al. ( ), where its scalability was assessed and its mappings were paired with a number of different umi resolution strategies. | biorχiv zakeri et al. | lightweight single-cell rna-seq pre-processing .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/pachterlab/bp_ / https://github.com/pachterlab/bp_ / https://cf. xgenomics.com/supp/cell-exp/refdata-gex-grch - -a.tar.gz https://cf. xgenomics.com/supp/cell-exp/refdata-gex-grch - -a.tar.gz https://github.com/combine-lab/bp_ -lfl https://github.com/combine-lab/bp_ -lfl https://github.com/pachterlab/bp_ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / $ kallisto index $ salmon index purpose: build mapping index command ti m e [m in ] time kallisto alevin m em or y [g b] memory kallisto alevin $ kallisto bus $ salmon alevin --rad --sketch purpose: perform mapping ti m e [m in ] kallistoalevin m em or y [g b] kallisto alevin bus file rad file purpose: store result of mapping si ze [g b] kallisto alevin $ bustools sort + bustools whitelist $ alevin-fry generate-permit-list --knee-distance purpose: error correct of barcodes ti m e [m in ] kallistoalevin m em or y [g b] kallisto alevin $ bustools correct + bustools sort $ alevin-fry collate purpose: enable streaming . . . . ti m e [m in ] kallistoalevin m em or y [g b] kallisto alevin $ bustools count $ alevin-fry quant purpose: generate count matrix . . . ti m e [m in ] kallistoalevin m em or y [g b] kallisto alevin $ kallisto + bustools pipeline $ salmon + alevin pipeline purpose: process single cell data number of reads ti m e [m in ] kallistoalevin number of reads m em or y [g b] kallisto alevin mouse-srr _v worm-srr _v mouse-srr _v human_mouse-hgmm k_v human-pbmc k_v human_mouse-hgmm k_v mouse-heart k_v mouse-srr _v mouse-heart k_v human-srr _v rat-srr _v fly-srr _v zebrafish-srr _v arabidopsis-srr _v human-srr _v mouse-emtab _v mouse-neuron k_v mouse-srr _v human-pbmc k_v human_mouse-hgmm k_v mouse-srr _v worm-srr _v mouse-srr _v human_mouse-hgmm k_v human-pbmc k_v human_mouse-hgmm k_v mouse-heart k_v mouse-srr _v mouse-heart k_v human-srr _v rat-srr _v fly-srr _v zebrafish-srr _v arabidopsis-srr _v human-srr _v mouse-emtab _v mouse-neuron k_v mouse-srr _v human-pbmc k_v human_mouse-hgmm k_v fig. . the time and memory used by the relevant steps of the alevin-fry and kallisto-bustools pipelines for pre-processing the diverse tagged-end single-cell rna-seq datasets used in ( ). the plots are generated using the analysis/notebooks/memtime.ipynb notebook. or adequate instructions for generating the plots. for example, there exist multiple versions of the run_gsea_bar_full.r script for performing gene set enrichment analysis, which each required building certain sub-directories in the main di- rectory of the repository in order to be executed without any errors. eventually, we used the run_gsea_bar_full.r script located within analysis/notebooks rather than the one located in analysis/scripts/code, since the latter version had hard-coded paths and no central way to uniformly and globally change the working directory (e.g. https://github.com/pachterlab/bp_ /blob/ e e bf d fa dbbebd c dd / analysis/scripts/code/gsea_bar_full.r#l ). after providing the required data, we ran mkdata.py and mk- plot.py within the analysis/scripts/code directory to prepare the plots for comparing the gene count estimates provided by both tools. furthermore, since we benchmarked an unmodified version of alevin-fry, we had to modify the mkdata.py script to load a single column file as alevin’s permit list (which we took from the quants_mat_rows.txt file accompanying each cell by gene count matrix), and also to remove the lines which were intended for dealing with decoy aware results. for producing the zakeri et al. | lightweight single-cell rna-seq pre-processing biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/pachterlab/bp_ /blob/e e bf d fa dbbebd c dd /analysis/scripts/code/gsea_bar_full.r#l https://github.com/pachterlab/bp_ /blob/e e bf d fa dbbebd c dd /analysis/scripts/code/gsea_bar_full.r#l https://github.com/pachterlab/bp_ /blob/e e bf d fa dbbebd c dd /analysis/scripts/code/gsea_bar_full.r#l https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . a comparison of the resulting count matrices obtained from alevin-fry and kallisto-bustools, as run in this manuscript, for the pbmc_ k_v dataset. panels a-h have the same inter- pretation as in fig. of booeshaghi and pachter ( ), and compare the count matrices at the gene and cell levels. the plots are generated using the analysis/scripts/mkplots.py, analysis/scripts/mkdata.py and analysis/notebooks/run_gsea_bar_full.r scripts. | biorχiv zakeri et al. | lightweight single-cell rna-seq pre-processing .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / time and memory plots, we used the memtime.ipynb notebook located in analysis/notebooks after making the required modifications to compare the time and memory of the relevant steps in both tools. while we have addressed any issues as we encountered them, and have documented how we have run this pipeline, we have not undertaken the effort of fully removing all barriers to “trivial” reproducibility, as it is outside the scope of the current work. finally, we also note that the benchmark of booeshaghi and pachter ( ) focuses only on comparing kallisto-bustools to a single configuration of alevin-fry, excluding other relevant tools like starsolo ( ), which is a fast, flexible, and popular tool for the pre-processing of tagged-end single-cell data. the benchmark also omits another recently-published, lightweight-mapping based tool, raindrop ( ), from the benchmark (though, seemingly, this would currently have to be restricted to x chromium v data). a more extensive benchmark, including other tools, is likely to provide greater value to the broader community. however, the primary focus in this manuscript is to highlight the effect on the original benchmark that results from running the tools considered therein in a like-for-like configuration. thus, we have not added starsolo or raindrop to the current benchmark, though it may provide a useful perspective on these tools to the broader community. all experiments were performed on a server with dual intel xeon cpus (e - v ), each with cores clocked at . ghz, gb of . ghz ddr memory, and an array of . tb toshiba mg aca hdds configured as independent disks. results fig. shows the overall time and peak memory taken by both alevin-fry and kallisto-bustools when pre-processing the diverse tagged-end single-cell x chromium datasets evaluated in ( ). alevin- fry is faster than kallisto-bustools on all datasets (between ∼ . and ∼ . times as fast, and ∼ . times as fast on average). also, alevin-fry uses less peak memory than kallisto-bustools on of the datasets tested, with the peak memory of kallisto-bustools ranging from ∼ % of that used by alevin-fry to ∼ times that used by alevin-fry (kallisto- bustools used ∼ . times as much peak memory on average). in addition to the overall runtime and peak memory usage (bottom row of fig. ), the figure also shows the time and memory required for the main steps of the pipelines. while there is not a perfect correspondence between the specific set of commands used by the two tools, the fundamental steps include mapping the reads, generating a permit-list of valid filtered barcodes (with each method using its own algorithm to infer the filtered set of corrected barcodes), rearranging the mapping information for all records having the same corrected barcode so that they are adjacent in the resulting file, and applying a umi resolution algorithm to obtain a gene-by-cell count matrix. looking across the datasets, some general characteristics emerge. if one evaluates the ratio of the total runtime of kallisto-bustools to the total runtime of alevin-fry, one observes that alevin-fry is faster in the processing of every dataset, with a speedup (i.e. runtime of kallisto- bustools/ runtime of alevin-fry ratio) ranging from ∼ . up to ∼ . (with an average runtime speedup of ∼ . ). if one evaluates the same ratio in terms of peak memory usage instead of total runtime, a similar trend emerges. in of the datasets tested here, the kallisto-bustools pipeline exhibits a higher peak memory usage than alevin-fry. in the mouse-srr _v dataset, kallisto-bustools’ peak memory usage reached % of that of alevin-fry (with alevin-fry requiring a maximum of ∼ . gb of memory and kallisto-bustools requiring ∼ . gb of memory). in every other dataset, kallisto-bustools used more peak memory than alevin-fry, with the kallisto-bustools pipeline using at most ∼ times as much peak memory and, on average, ∼ . times as much peak memory as alevin-fry. the peak memory usage of both tools reached their respective maxima on the hybrid human-mouse dataset, where the peak memory usage of kallisto-bustools is ∼ . gb (which occurs during pseudoalignment) and the peak memory usage of alevin-fry is ∼ gb (which occurs during mapping record collation). while the step of indexing only has to be done once per reference sequence (i.e. with each new organism, or when the reference anno- tation is updated), we also evaluate the time and memory required to build all indices used in these experiments. this is important, since the peak memory usage during indexing may dictate whether the index can be built on the same machine used for subsequent quantification, or if it must be constructed on a machine with more memory. fig. shows that, as with the pre-processing of reads, alevin-fry is faster and uses less memory for index construction for each reference con- sidered. the slowest index construction for both tools was for the human-mouse combined transcriptome, where the kallisto-bustools pipeline took ∼ . minutes and required ∼ . gb of memory, while indexing this transcriptome with the alevin-fry pipeline took ∼ minutes (a ∼ . times speedup compared to kallisto) and ∼ . gb of memory (∼ % of the the memory usage of kallisto). when eval- uating the time differences, it is important to note that the alevin-fry pipeline can make use of multiple threads when indexing (here we used as in ( )), while the indexing in the kallisto-bustools pipeline is currently restricted to a single thread. the memory usage in the alevin- fry pipeline does not vary considerably with the number of threads used during indexing. the peak memory reduction during indexing and mapping in alevin-fry arise primarily due to alevin-fry’s use of the pufferfish ( ) index, while a number of different factors at both the im- plementation and design level contribute to the runtime improvements. when assessing the same summary statistics and count com- parisons considered by booeshaghi and pachter ( ) to evaluate the similarity of the resulting quantifications, we find that the cell by gene count matrices produced by both tools are similar under these metrics (fig. ). as is expected, these evaluations show that the data sum- maries are more similar than in the configuration tested in ( ). in that comparison, booeshaghi and pachter ( ) claim that differences in re- sulting gene expressions between the configurations of the tools tested therein are “irrelevant for downstream analysis” (presumably implying all possible downstream analyses). it is not clear how these compar- isons justify such a sweeping claim. yet, while these comparisons do not necessarily imply that no differences will manifest in downstream processing of the alevin-fry quantified data compared to the kallisto-bustools quantified data, they do suggest that the differences that may arise under this configuration of alevin-fry are likely to be less extreme than differences that may arise in the configuration tested in ( ). we also note that, while booeshaghi and pachter ( ) observe no significant gene sets found when comparing the quantifications of kallisto-bustools and the configuration of alevin-fry that they tested on the pbmc_ k_v data, we do observe some genes as detected. the alevin-fry peak memory usage in this dataset happens during the collate step, which can easily be made to operate within a strict desired ram budget; a feature on which we are currently working. zakeri et al. | lightweight single-cell rna-seq pre-processing biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / this appears to stem from our use of the most recent release of seu- rat ( ) (currently version . . ), which modified the default behavior of the findmarkers() function to “prefilter genes and report fold change using base , as is commonly done in other differential ex- pression packages, instead of natural log” (https://satijalab. org/seurat/articles/v _changes.html). for the purposes of keeping this benchmark like-for-like, we have run both tools in configurations where they must generate their own permit list without the input of an external set of valid but unfiltered (i.e. possible) barcodes. we choose this configuration for two reasons. first, unlike x chromium v and v experiments, many single-cell technologies supported by both pipelines do not provide an external list of barcodes, and so, this is indicative of the general case where pipelines must provide their own method for generating a permit list of barcodes. second, since alevin-fry does not yet support unfiltered external permit lists, there is not a way to fairly compare it against a method that can take advantage of this information. we consider it a development priority to support this feature in alevin-fry for single-cell technologies where this information is available. nonetheless, we tested the effect that requiring a second sort of a (sorted and filtered permit list corrected) bus file had on the overall runtime, compared to correcting the initial unsorted file with an unfiltered permit list, and processing the unfiltered file in the remainder of the pipeline. to do this, we also ran a configuration of kallisto-bustools where the raw bus file was first corrected with the external, unfiltered x permit list, then the file was sorted, then a permit list was extracted from this sorted file (to allow subsequent filtering of an unfiltered count matrix), and finally, the count step was performed. this process results in an unfiltered matrix which may then be filtered using the generated permit list. alevin-fry was, on average, ∼ . times as fast as kallisto-bustools under this configuration, rather than ∼ . as fast; in other words, the runtime costs of the different kallisto-bustools configurations were very similar for these data. finally, though we have retained the parallelism settings used in the original benchmark for the purposes of the main results reported in this manuscript, we also evaluated, on one of the larger datasets (pbmc_ k_v ), how both tools scale to a higher thread count of . in this case, we found that the total runtime for alevin-fry dropped from . minutes with threads to . minutes with threads, and the total runtime for kallisto-bustools went from . minutes with threads to . minutes with threads. so, in this case, increasing the thread count by lead to a ∼ . times increase in the speed of alevin-fry and a ∼ . times increase in the speed of kallisto-bustools. conclusions we find that when alevin-fry is benchmarked in a like-for-like comparison with kallisto-bustools, it is both faster and uses less memory while producing similar results. of course, in this configuration, alevin-fry ( ), unlike the original alevin ( ) or other configurations of alevin-fry, is adopting some of the computational simplifications for which melsted et al. ( ) argue, and the similarity of these results is fully expected. in their manuscript, booeshaghi and pachter ( ) repeatedly refer to alevin-fry as a “reimplementation” of bustools. this characteriza- tion is untrue both in detail and in spirit. the alevin-fry tool has not been designed to reimplement the bustools commands or interface, or specifically to match the implementation of bustools. it has been de- signed as a way to allow the exploration and configuration, in a unified framework, of a variety of different algorithms for single-cell data pre- processing, many of which don’t currently exist in kallisto-bustools. for example, it implements multiple different methods for generating permit lists, and multiple different algorithms for umi resolution, including some that correct for umi sequencing errors, resolve multi- gene umis by parsimony, probabilistically (which kallisto-bustools subsequently implemented after it was introduced in alevin ( )), or both, as well as the functionality to quantify the uncertainty of proba- bilistic resolution through bootstrapping. of course it is the case that, in designing such a tool after the work of melsted et al. ( ) was published and in widespread use, one should learn from the design decisions of that work that proved to be effective and useful. the main such design decisions in this case are first, the separation of the read mapping from the subsequent processing of barcodes and umis via intermediate files (as is also done internally by starsolo), and second, the arrangement of mapping records relevant to a given corrected barcode subsequently so that cells can be processed in an effectively independent manner. the alevin-fry tool adopts these choices described by melsted et al. ( ), as we see no reason to avoid relevant design decisions demonstrated by prior tools, that seem to work, when building new tools. we look forward to discussing these design decisions, as well as some novel design choices we have made, when we publish the alevin-fry preprint. we have not completed, to our satisfaction, a thorough investigation of the effect of different mapping approaches, permit list generating methods and umi correction and resolution strategies provided by alevin-fry across a wide range of tagged-end single-cell rna-seq data and technologies (which have, in general, distinct characteristics compared to both bulk rna-seq data and full-length single-cell rna-seq data). once we have adequately explored this algorithmic parameter space, we plan to publish a full manuscript describing the design and implementation of the alevin-fry pipeline, highlighting where it derives design decisions from kallisto-bustools and other tools, and where it differs, as well as the effect that different configurations have on runtime and memory performance, the raw count matrices and common downstream analyses, and how those effects may vary in different single-cell technologies. we have described in this manuscript, and demonstrated in the associated code repository and tutorial, how alevin-fry can optionally be configured so as to match the computational complexity of kallisto- bustools as closely as possible. in this like-for-like comparison of these two pipelines, we have shown that, while the estimated gene expressions are similar — at least when assessed in the manner done by booeshaghi and pachter ( ) — the runtime and memory character- istics are not. rather, while using the same benchmarking framework as booeshaghi and pachter ( ), instead of alevin-fry taking ∼ times as long to pre-process data (on average) than kallisto-bustools and using many times as much memory in the worst case ( ), we find that alevin-fry is both faster and uses less memory than kallisto-bustools. specifically, alevin-fry is on average ∼ . times as fast as kallisto- bustools and consumes, on average, only ∼ . as much peak memory. according to the formulae used in the jupyter ( ) notebooks of booe- shaghi and pachter ( ) to estimate costs for performing processing on amazon web services compute instances, pre-processing the pbmc_ k_v dataset using the configuration of the alvein-fry pipeline we have tested in this manuscript costs $ . , which is half of the cost of running the kallisto-bustools pipeline ($ . ). further- | biorχiv zakeri et al. | lightweight single-cell rna-seq pre-processing .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://satijalab.org/seurat/articles/v _changes.html https://satijalab.org/seurat/articles/v _changes.html https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / more, if one needs to first build the reference index, the peak runtime memory for kallisto-bustools exceeds gb and so a more expensive instance would be necessary. in that case, building the reference index and processing the pbmc_ k_v would cost $ . using kallisto-bustools (while the cost would remain at $ . using alevin- fry, even if index construction is included). the cost of the alevin-fry pipeline we have benchmarked in this manuscript is times smaller than what was reported in ( ) for this dataset, while the cost of the kallisto-bustools pipeline is twice as large due to the increased memory requirements when using the newer human transcriptome annotation. if one is comfortable with the simplifying assumptions being made, the performance profiles observed in this manuscript provide a com- pelling case for the use of this configuration of alevin-fry for the rapid and lightweight pre-processing of single-cell rna-seq data. finally, it is important to note that alevin-fry is still undergoing active development and improvement, which is, in part, why no full preprint has yet been published describing the tool and underlying methods and implementation in detail. of course, one can use the tool today to obtain gene expression counts for single-cell data, but we expect that alevin-fry will continue to advance and expand to offer more capabilities and to be further optimized. disclosure.rp is a co-founder of ocean genomics inc. funding. this work is supported by the us national institutes of health [r hg ], and the national science foundation [ccf- , cns- ]. the funders had no role in this research, or the decision to publish. references . a. sina booeshaghi and lior pachter. benchmarking of lightweight- mapping based single-cell rna-seq pre-processing. biorxiv, . doi: . / . . . . . páll melsted, a. sina booeshaghi, fan gao, eduardo beltrame, lambda lu, kristján eldjárn hjorleifsson, jase gehring, and lior pachter. modular and efficient pre-processing of single-cell rna-seq. biorxiv, . doi: . / . . hirak sarkar, avi srivastava, mohsen zakeri, scott van buren, naim u rashid, michael love, and rob patro. accurate, efficient, and uncertainty-aware expression quantification of single-cell rna-seq data. . doi: . /m .figshare. .v . . avi srivastava, laraib malik, hirak sarkar, mohsen zakeri, fatemeh almodaresi, charlotte soneson, michael i love, carl kingsford, and rob patro. alignment and mapping methodology influence transcript abundance estimation. genome biology, ( ): – , . . hirak sarkar, mohsen zakeri, laraib malik, and rob patro. towards selective-alignment: bridging the accuracy gap between alignment-based and alignment-free transcript quantification. in proceedings of the acm international conference on bioinformatics, computational biology, and health informatics, pages – , . . nicolas l bray, harold pimentel, páll melsted, and lior pachter. near-optimal probabilistic rna-seq quantification. nature biotechnology, ( ): – , . . matthew d. shirley, viveksagar k. radhakrishna, javad golji, and joshua m. korn. pisces: a package for rapid quantitation and quality control of large scale mrna-seq datasets. biorxiv, . doi: . / . . . . . avi srivastava, mohsen zakeri, hirak sarkar, charlotte soneson, carl kingsford, and rob patro. accounting for fragments of unexpected origin improves transcript quantification in rna-seq simulations focused on increased realism. biorxiv, . doi: . / . . . . . ash blibaum, jonathan werner, and alexander dobin. starsolo: single-cell rna-seq analyses beyond gene expression. . doi: . /f research. . . . geo pertea and mihaela pertea. gff utilities: gffread and gffcompare. f research, : , september . doi: . /f research. . . . andrew d yates, premanand achuthan, wasiu akanni, james allen, jamie allen, jorge alvarez-jarreta, m ridwan amode, irina m armean, andrey g azov, ruth bennett, et al. ensembl . nucleic acids research, (d ): d –d , . . rob patro, geet duggal, michael i love, rafael a irizarry, and carl kingsford. salmon provides fast and bias-aware quantification of transcript expression. nature methods, ( ): – , . . avi srivastava, laraib malik, tom smith, ian sudbery, and rob patro. alevin efficiently estimates accurate gene abundances from dscrna-seq data. genome biology, ( ): – , . . stefan niebler, andré müller, thomas hankeln, and bertil schmidt. raindrop: rapid activation matrix computation for droplet-based single-cell rna-seq reads. bmc bioinformatics, ( ): – , . . fatemeh almodaresi, hirak sarkar, avi srivastava, and rob patro. a space and time-efficient index for the compacted colored de bruijn graph. bioinformatics, ( ):i –i , . . yuhan hao, stephanie hao, erica andersen-nissen, william m. mauck, shiwei zheng, andrew butler, maddie j. lee, aaron j. wilk, charlotte darby, michael zagar, paul hoffman, marlon stoeckius, efthymia papalexi, eleni p. mimitou, jaison jain, avi srivastava, tim stuart, lamar b. fleming, bertrand yeung, angela j. rogers, juliana m. mcelrath, catherine a. blish, raphael gottardo, peter smibert, and rahul satija. integrated analysis of multimodal single-cell data. biorxiv, . doi: . / . . . . . thomas kluyver, benjamin ragan-kelley, fernando pérez, brian granger, matthias bussonnier, jonathan frederic, kyle kelley, jessica hamrick, jason grout, sylvain corlay, paul ivanov, damián avila, safia abdalla, carol willing, and jupyter development team. jupyter notebooks - a publishing format for reproducible computational workflows. in fernando loizides and birgit scmidt, editors, positioning and power in academic publishing: players, agents and agendas, pages – , netherlands, . ios press. zakeri et al. | lightweight single-cell rna-seq pre-processing biorχiv | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / cutevariant: a gui-based desktop application to explore genetics variations journal title here, , – doi: doi here advance access publication date: day month year paper cutevariant: a gui-based desktop application to explore genetics variations sacha schutz, , tristan montier and emmanuelle genin univ brest, chru brest, inserm, efs, umr , ggb, , brest, france and inserm, univ brest, efs, umr , ggb, , brest, france ∗corresponding author. sacha@labsquare.org for publisher only received on date month year; revised on date month year; accepted on date month year abstract cutevariant is a user-friendly gui based desktop application for genomic research designed to search for variations in dna samples collected in annotated files and encoded in the variant calling format. the application imports data into a local relational database wherefrom complex filter-queries can be built either from the intuitive gui or using a domain specific language (dsl). cutevariant provides more features than any existing applications without compromising on performance. the plugin based architecture provides highly customizable features. cutevariant is distributed as a multiplatform client-side software under an open source licence and is available at https://github.com/labsquare/cutevariant. it has been designed from the beginning to be easily adopted by it-agnostic end-users. key words: genomics, dna variant, desktop application, domain specific language, graphic user interface introduction next-generation sequencing (ngs) has opened new opportunities in genomic research such as identification of dna variations from genome, exome or panel experiments. these data are delivered as files encoded in the standard variant calling format (vcf version . ) [ ] where the variations are listed together with the genotype information of different samples. tools such as vep [ ] or snpsift [ ] can be use to add annotations such as genes or functional impact. biologists can then filter out variants applying customized criteria on these annotations. in medicine, the identification of mutations in rare diseases would be a typical use case. this filtering procedure implements sophisticated software tools that can be easily adopted by end-users who are not necessarily it-aware. several management systems have been developed to ease the usage of the filtering step. gemini [ ] and varianttools [ ] are command line applications where data from the vcf files are loaded into a relational database managed by sqlite [ ]. filtering can thus be made very efficient using the sql query syntax. other tools such as snpsift [ ] or bcftools [ ] apply filters directly while reading the vcf files line by line, thus avoiding the need to create an intermediate data structure. this comes at the cost of poor timing efficiency especially when it is necessary to sort or group variants. while these tools are quite flexible allowing any kind of filtering, the command line interface is not very intuitive, thus reducing the incentive to use it for non it-specialists. this called for the development of applications steered by user-friendly graphical user interfaces (gui). some specializing in diagnostics offer online solutions with a complete set of patient management features but require uploading the vcf files. the most popular of the kind are either private software such as seqone [ ] and or those distributed under the open source licence such as the recently published varfish [ ]. a major drawbacks of this scheme comes from the transit of a large amount of genetic data through public networks raising on one hand confidentiality and performance issues, and requiring on the other hand a dedicated server which might not be available for every end-users. moreover, these solutions are tailored for human species data and therefore cannot be adopted for all end-users. gui applications that do not require a server and offering an out-of-the-box solution are therefore a preferable solution. the web-based applications vcfminer [ ], browsevcf [ ] and vcf.filter [ ] implement such a solution. vcfminer is distributed as a package container running with docker [ ] requiring thus a customized desktop configuration. browsevcf provides its own launcher making it quite user friendly but the application is not supported anymore. both applications import the data from vcf files into an indexed database and provide different gui forms to create filters. their main drawback resides in the limited filter settings available (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint email:email-id.com https://github.com/labsquare/cutevariant https://doi.org/ . / . . . short article title fig. : cutevariant database schema. only mandatory fields are displayed. fields n are dynamically created during the import step based on the content of the vcf file through the gui, complex filters requiring a domain specific language. in addition, web applications offer poor timing performances compared to native desktop applications. despite the availability of these tools, many biologists still use microsoft excel to filter their variants and are facing severe problems [ ]. to address the shortcomings of the existing applications, we have developed cutevariant, a user-friendly and ergonomic desktop application implemented in python within the qt framework. it takes full advantage of both a gui and command line user-interface, a domain specific language called vql allowing the user to build complex filter expressions. it is distributed as a multi-platform client-side software under an open source licence. thanks to an architecture based on plugins, cutevariant is fully customizable, allowing to easily extend the application with additional features. materials and methods vcf file importation and preprocessing cutevariant imports data from vcf files into a normalized sqlite database (figure ) stored as a *.db file, and optionally with a ped file to describe affected samples and their relationship. fields from variants and annotations tables are dynamically created according to the content of the vcf file. this importation step proceeds using a vcf parser to produce json-like arrays tailored for populating the sqlite database. it is based on a strategy design pattern so that any formats can be supported by subclassing an abstract reader object. the available distribution supports raw vcf files and vcf files annotated with vep or snpeff following the ann specifications [ ]. before importation into the database, data are cleaned and normalized following the same procedure as the vt norm [ ] application: single lines of multi-allelic variants are split into multiple lines. computed annotations, not present in the original file, are automatically created. as for example, the count var field contains the number of samples that carry the variant. it is thus possible to filter variants present in more than n samples by filtering on this column. this feature is similar to countvar() from the snpsift [ ] filter command. from the cutevariant main window, the new project button starts a wizard and triggers the importation process. depending on the size of the input, the importation and indexation process might take some time but this has only minimal impact on the performance since this step is performed only once. alternatively, vcf files import can be triggered from the command line using the cutevariant-cli button. this feature offers to knowledgeable experts the possibility to integrate the import process at the end of a pipeline. user interface layout the main view (figure ) of the cutevariant gui displays the list of variants together with their annotations. several gui controllers allow the user to update the view and display the list in different formats. • fields editor: to show or hide selected annotations. • filter editor: to build a nested list of conditional rules with or/and binary operators. • variant info: to display in an organised way all annotations related to the currently selected variant. • source editor: to manage different views and perform set operations (union, intersection, difference) and bed file intersections. • word set: to manage lists of words used to generate simple filters, e.g., filter all variants belonging to a given gene list or a dbsnp list. most of these actions end up building a vql query that can be checked in the vql-editor sub-window. the variants list can (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . short article title fig. : the cutevariant main view showing the variants list sub-window (middle), different controllers sub-windows but not all are displayed (left) and the vql editor sub-window (bottom). then be updated either with the controllers or by editing the vql query directly. variant query language (vql) to facilitate the composition of complex query-filters, the application integrates a domain specific language (dsl) named variant query language (vql). the syntax of vql has been designed to look like a subset of the sql language working on a virtual database schema. it makes use of the python module textx [ ] which provides several tools to define a grammar and create parsers with an abstract syntax tree. vql queries can be composed in the vql editor sub-window. however, to avoid forcing users to learn the vql language, a query can as well be defined from the gui using the different available controller sub-window listed above. the vql query is translated through the intermediary of a json object into a well formatted sql query and processed by the sqlite database manager. as an example, the following vql query: select chr,pos,consequence,sample['na '].gt from variants where gene = 'cftr' and impact = 'high' is translated into the following sql query : select distinct `variants`.`id`, `variants`.`chr`, `variants`.`pos`, `annotations`.`consequence`, `sample_na `.`gt` as "sample('na ').gt" from variants left join annotations on annotations.variant_id = variants.id inner join sample_has_variant `sample_na ` on `sample_na `.variant_id = variants.id and `sample_na `.sample_id = where ( `annotations`.`gene` = 'cftr' and `annotations`.`impact` = 'high') limit offset filter expressions filter expressions are defined from the vql where clause. from the filter editor, it is displayed as a nested set of editable condition rules. logical (and/or) and arithmetic (=, <, >, ≤, ≥, =, in, not in, is null) operators are supported. regular expression using the binary ones complement operator (∼) and a special wordset keyword are included as well. this keyword allows the user to test if a fields belongs to a set of words defined a priori. for instance, in vql, to select all variants from a list of a user-defined genes: create set genes ('gene.txt') select * from variants where gene in wordset['genes'] (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . short article title fig. : abstract syntax tree (ast) of the vql query select chr,pos,consequence from variants where gene='cftr' and impact='high'. the ast is parsed into a python object. group variants the group by keyword allows the user to split the view in two panels: left the list of groups and right the list of all variants belonging to the selected group. with this feature the exploration is made easier by, for instance, grouping variants by genes helping to detect compound heterozygous. set operation just like variant tools, cutevariant supports operations between variant sets. each query result can be stored in a view using the create vql keywords or by clicking the corresponding gui button. for instance, the following query will create a new view called new view. create new_view from variants where gene='cftr' it is then possible to build a query directly from this view. the following query returns the same output as the previous one: select chr, pos from new_view each view behaves as a set with three operations available (difference, intersection, union) by comparing variants fields on chr, pos, ref and alt. the following queries show how to create a new view based on different set operation: # difference create second_view = variants - new_view # union create second_view = variant + new_view # intersection create second_view = variant & new_view plugins architectures the cutevariant gui architecture relies entirely on plugins which source is available in the plugins directory. a plugin consists of a module containing different python files implementing the creation of a plugin class instance with several overloaded virtual methods. adding or removing gui controllers becomes therefore straightforward. in addition, similarly to excel, cells of the variant view can be formatted conditionally. by subclassing the formatter class, one can change the style of the cell with different colors, text or icons according to the value of the cell. for instance, impact fields with high as value can be displayed with a red background to catch the user’s attention. currently, cutevariant supports only one formatters: cutestyle. cutevariant allows the user to build a custom url from a variant and open it from an external application. this is used for example to open a web link on a dbsnp database or to show bam alignment from igv software at the corresponding variant location. with plugins, experienced users can customize cutevariant with dedicated features or create new ones and share them with the users community. technical details and continuous integration cutevariant is a cross platform application implemented in python . using the qt framework for the user interface (pyside ≥ . ). the vcf parser uses the pyvcf ≥ . . library. syntax and parser of the vql language rely on the textx ≥ . . library. sqlite is the database manager interfaced with the python standard library. the source code and documentation are available on github [ ]. continuous integration are made on github-ci and unit tests are made with the pytest framework [ ]. the application is distributed as windows bits and bits packages. cutevariant is also available as a python package from the python package index pypi [ ]. results in table we list the features available in cutevariant compared to other applications available on the market. the timing performance of cutevariant to execute different actions is reported in table and compared to the timing performance of vcf-miner, the fastest application we have evaluated. cutevariant outperforms vcf-miner except for kg.chr .anno.vcf. the reason comes from the large number of samples required to compute the joint table between samples and variants. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . short article title table . features available in various applications available on the market. gui command line features cutevariant browsevcf vcf-miner vcf-explorer vcf-server vcf-filters gemini variant tools snpsift process annotations no no no no yes no yes no no vep parser yes yes no no no no yes no no snpeff parser yes yes no no no no yes yes yes sql like query yes no no no no no yes yes yes regular expressions yes no no no no no no∗ no∗ yes bed file intersection yes no yes no no yes no no yes set operations yes no no no no no no yes yes sorting yes yes yes no yes no yes yes yes intersect with wordset yes yes no no no no no no yes plugins extension yes no no no no no no no no indexed database sqlite berkeley db mongodb raw file mongodb raw file sqlite sqlite raw file data encryption no∗∗ no no no yes no no no no language py /qt py /html js/html c++/qt node.js java py py java pedigree file yes no no no no no yes yes yes application type desktop web web web web desktop console console console multi-users support no no no no yes no no no cvs/excel export yes yes yes yes yes no yes yes yes ∗support like sql expression ∗∗possible with sqlite encryption extension table . comparaison of time performance between cutevariant and vcf-miner for importation and query execution. the query used filters the variants with qual ≥ and depth ≥ . executed on intel(r) core(tm) i - k cpu @ . ghz with gb ram input file kg.chr .anno.vcf corpas.quartlet.vcf na .vcf variant count ’ ’ ’ ’ sample count software cutevariant vcf-miner cutevariant vcf-miner cutevariant vcf-miner importation time s s s s s s query execution time* ≈ s ≈ s . s ≈ s . s ≈ s use case : sars-cov- -analysis in the context of the covid- pandemia, we have tested cutevariant to identify mutations along the genome of the sars- cov- virus. for this, we have downloaded from the ena database, a dataset (prjna ) with samples stored in a fastq file produced by the illumina sequencing plateform using an amplicon librarie. the pipeline is available on github [ ].the data originate from the us delaware public health laboratory. fastq files have been aligned on the nc . genome of sars-cov- with the bwa software [ ]. variants have been called with the freebayes application [ ] and all samples have been merged into one single vcf file annotated with snpeff[ ]. this file has been imported into cutevariant for exploration. we executed a vql statements (fig. ) to extract variants within the gene s and sorted the result by count var annotation showing the total number of samples carrying the variant. the sorting process is easily done by clicking on the corresponding header of the view. the mutation p.asp gly (highlighted in fig. ) is found in samples out of . this variant has already been described [ ] as a dominant one emerging at the beginning of the pandemia. in the same way, by scrutinizing all the genes, we have identified two others mutation: (orf ab)p.thr ile and (orf a)p.gln his which are exclusive to the north american population [ ]. fig. : mutation found in gene s of sars-cov- by a cutevariant analysis of samples. use case : cohort analysis we have repeated with cutevariant the analysis given as an example by snpsift [ ]. it is a cohort analysis of individuals among which are affected by a nonsense mutation in the cftr gene (g *). this analysis cannot be performed with any of the graphics application listed previously (table ). after (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . short article title importing the annotated vcf file and the corresponding ped file, the following vql query was processed by cutevariant selecting variants with high impact which are homozygous in case samples but are not in control samples. snpsift uses the following query: cat protocols/ex .ann.cc.vcf \ | java -jar snpsift.jar filter \ (cases[ ]= ) & (controls[ ]= ) ((ann[*].impact='high')|\ (ann[*].impact='moderate')) \ > protocols/ex .filtered.vcf the cutevariant equivalent vql query providing the same results reads as: select chr, pos from variants where case_count_hom= control_count_hom= and impact in ('high', 'moderate') discussion performance cutevariant is implemented within the open-source qt for python [ ] that provides a set of python bindings to build modern user interface. instead of using native qt/c++ as coding language, we have opted for python because it is by far the most frequently used coding language in the bioinformatics community. this choice does not cause any significant performance degradation of the cutevariant gui. execution time for queries performed on a complete genome with many filters can become particularly slow. this long execution time is primarily due to the sql count statement which browses through all the variants to calculate the total number of variants. the table join statement is also time consuming. this is the consequence of the choice made for curevariant, unlike gemini, to store samples and a few annotations in separate tables to avoid table denormalization and to minimize disk space occupation. this time penalty has been minimized on one hand by using a memory cache so that identical vql queries do not need to recalculate the count of variants and, on the other hand, by using asynchronous queries performed in dedicated threads, thus avoiding to freeze the gui with the progress bar showing the loading status. web app vs desktop app cutevariant is a serverless desktop application and therefore does not provide annotation- or multiuser-features. the annotation step must be carried out upstream at the end of an analysis pipeline by using dedicated tools such as snpsift or vep. multi-users capabilities allow users to share custom annotations and comments. for instance, a user marks a variant as pathogenic and this information is shared among all users. although this feature is not supported by cutevariant, it can be delegated to other tools such as myvariant.info [ ]. it provides a database of variants with which cutevariant can communicate through a rest api. these data can then be used as a source of annotation in the annotation step of the pipeline. a general purpose and customizable tool cutevariant is a general purpose tool to filter variants and is fully customizable thanks to its plugin-based implementation and thus offers features and modularity that are not available with existing applications. since cutevariant is not specific to the analysis of the human genome, it can be use with any vcf file as we demonstrated here with the sars-cov- example. gui options dedicated to specific tasks are not hard coded in the application but can easily be added to cutevariant by creating new plugins. as an example of such added gui options, the trio analysis plugin selected from the tools menu users to build from the gui a vql filter including transmission mode and the family tree. conclusion cutevariant is a new desktop application devoted to explore genetic variations in vcf data provided by next generation sequencing. it is the first gui software of the kind that integrates both a user friendly graphical user interface and a domain specific language. starting from a low learning threshold, end-users can easily perform complex filtering to identify variants of interest. cutevariant is a standalone application that runs on standard desktop computers either under linux, macos or windows operating systems. the python-based plugins architecture makes the application easily expandable with the addition of new features, thus offering the possibility to involve the biocomputer scientists community at large in new features developments. acknowledgments we would like to thank lucas bourneuf and pierre vignet for contribution to the development. funding this work has been supported by ubo, université de bretagne occidentale, france. conflict of interest: none declared references . petr danecek, adam auton, goncalo abecasis, cornelis a. albers, eric banks, mark a. depristo, robert e. handsaker, gerton lunter, gabor t. marth, stephen t. sherry, gilean mcvean, and richard durbin. the variant call format and vcftools. bioinformatics, : – , . . william mclaren, laurent gil, sarah e. hunt, harpreet singh riat, graham r.s. ritchie, anja thormann, paul flicek, and fiona cunningham. the ensemble variant effect predictor. genome biology, : – , . . pablo cingolani, adrian platts, le lily wang, melissa coon, tung nguyen, luan wang, susan j. land, xiangyi lu, and douglas m. ruden. a program for annotating and predicting the effects of single nucleotide polymorphisms, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . short article title snpeff: snps in the genome of drosophila melanogaster strain w ; iso- ; iso- . fly, : – , . . umadevi paila, brad a. chapman, rory kirchner, and aaron r. quinlan. gemini: integrative exploration of genetic variation and genome annotations. plos computational biology, , . . gao t. wang, bo peng, and suzanne m. leal. variant association tools for quality control and analysis of large- scale sequence and genotyping array data. american journal of human genetics, : – , . . richard d hipp. sqlite. https://www.sqlite.org/index.html, . . heng li. a statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. bioinformatics, ( ): – , . . anne-sophie lebre and jean-marc rey. seqone. https://seq.one/, jan . . manuel holtgrewe, oliver stolpe, mikko nieminen, stefan mundlos, alexej knaus, uwe kornak, dominik seelow, lara segebrecht, malte spielmann, björn fischer-zirnsak, felix boschann, ute scholl, nadja ehmke, and dieter beule. varfish: comprehensive dna variant analysis for diagnostics and research. nucleic acids research, (w ):w –w , . . steven n. hart, patrick duffy, daniel j. quest, asif hossain, mike a meiners, and jean-pierre kocher. vcf-miner: gui-based application for mining variants and annotations stored in vcf files. briefings in bioinformatics, ( ): – , . . et al. w james kent. the human genome browser at ucsc. genome res., ( ): – , . . heiko müller, raul jimenez-heredia, ana krolo, tatjana hirschmugl, jasmin dmytrus, kaan boztug, and christoph bock. vcf.filter: interactive prioritization of disease- linked genetic variants from sequencing data. nucleic acids research, (w ):w –w , . . empowering app development for developers. https://www.docker.com/. . mark ziemann, yotam eren, and assam el-osta. gene name errors are widespread in the scientific literature. genome biology, , . . pablo cingolani, fiona cunningham, will mclaren, and kai wang. variant annotations in vcf format. http://www.ensembl.org/help/glossary?id= . . adrian tan, gonçalo r. abecasis, and hyun min kang. unified representation of genetic variants. bioinformatics, ( ): – , . . i. dejanović, r. vaderna, g. milosavljević, and vuković. textx: a python tool for domain-specific languages implementation. knowledge-based systems, : – , . . cutevariant. https://github.com/labsquare/cutevariant. . pytest. https://docs.pytest.org/en/stable. . python package index. https://pypi.org/. . githubcovid.https : //github.com/dridk/sars−cov − − ngs − pipeline. . heng li and richard durbin. fast and accurate short read alignment with burrows-wheeler transform. bioinformatics, : – , . . erik garrison and gabor marth. haplotype- based variant detection from short-read sequencing. http://arxiv.org/abs/ . , . . p. cingolani, a. platts, m. coon, t. nguyen, l. wang, s.j. land, x. lu, and d.m. ruden. a program for annotating and predicting the effects of single nucleotide polymorphisms, snpeff: snps in the genome of drosophila melanogaster strain w ; iso- ; iso- . fly, ( ): – , . . bette korber, will m. fischer, sandrasegaram gnanakaran, hyejin yoon, james theiler, werner abfalterer, nick hengartner, elena e. giorgi, tanmoy bhattacharya, brian foley, kathryn m. hastie, matthew d. parker, david g. partridge, cariad m. evans, timothy m. freeman, thushan i. de silva, adrienne angyal, rebecca l. brown, laura carrilero, luke r. green, danielle c. groves, katie j. johnson, alexander j. keeley, benjamin b. lindsey, paul j. parsons, mohammad raza, sarah rowland-jones, nikki smith, rachel m. tucker, dennis wang, matthew d. wyles, charlene mcdanal, lautaro g. perez, haili tang, alex moon-walker, sean p. whelan, celia c. labranche, erica o. saphire, and david c. montefiori. tracking changes in sars-cov- spike: evidence that d g increases infectivity of the covid- virus. cell, : – .e , . . xumin ou, zhishuang yang, dekang zhu, sai mao, mingshu wang, renyong jia, shun chen, mafeng liu, qiao yang, ying wu, xinxin zhao, shaqiu zhang, juan huang, qun gao, yunya liu, ling zhang, maikel peopplenbosch, qiuwei pan, and anchun cheng. tracing two causative snps reveals sars- cov- transmission in north america population. biorxiv, page . . . , . . snpeff usage example. https://pcingola.github.io/snpeff/examples/. . the qt company. qt for python: the official python bindings for qt. https://www.qt.io/qt-for-python. . variant annotation as a service. https://myvariant.info/. . silvia salatino and varun ramraj. browsevcf: a web-based application and workflow to quickly prioritize disease-causative variants in vcf files. briefings in bioinformatics, : – , . . steven n. hart, patrick duffy, daniel j. quest, asif hossain, mike a. meiners, and jean pierre kocher. vcf-miner: gui- based application for mining variants and annotations stored in vcf files. briefings in bioinformatics, : – , . . jianping jiang, jianlei gu, tingting zhao, and hui lu. vcf- server: a web-based visualization tool for high-throughput variant data mining and management. molecular genetics and genomic medicine, , . . f. anthony san lucas, gao wang, paul scheet, and bo peng. integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools. bioinformatics, : – , . . the qt company. cross-platform software development for embedded and desktop. https://www.qt.io/. . manuel holtgrewe, oliver stolpe, mikko nieminen, stefan mundlos, alexej knaus, uwe kornak, dominik seelow, lara segebrecht, malte spielmann, björn fischer-zirnsak, felix boschann, ute scholl, nadja ehmke, and dieter beule. varfish: comprehensive dna variant analysis for diagnostics and research. nucleic acids research, :w –w , . . damian smedley, julius o b jacobsen, marten jager, sebastian köhler, manuel holtgrewe, max schubach, enrico siragusa, tomasz zemojtel, orion j buske, nicole l washington, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . short article title william p bone, melissa a haendel, and peter n robinson. next-generation diagnostics and disease-gene discovery with the exomiser. nature protocols, : , . . dna sequencing. https://www.integragen.com/service- solutions/dna-sequencing, oct . . adrian tan, gonçalo r. abecasis, and hyun min kang. unified representation of genetic variants. bioinformatics, : – , . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction materials and methods vcf file importation and preprocessing user interface layout variant query language (vql) filter expressions group variants set operation plugins architectures technical details and continuous integration results use case : sars-cov- -analysis use case : cohort analysis discussion performance web app vs desktop app a general purpose and customizable tool conclusion acknowledgments funding accelerating covid- research with graph mining and transformer-based learning accelerating covid- research with graph mining and transformer-based learning ilya tyagin center for bioinformatics and computational biology university of delaware newark, de tyagin@udel.edu ankit kulshrestha computer and information sciences university of delaware newark, de akulshr@udel.edu justin sybrandt∗ school of computing clemson university clemson, sc jsybran@clemson.edu krish matta charter school of wilmington wilmington, de matta.krish@charterschool.org michael shtutman drug discovery and biomedical sciences university of s. carolina columbia, sc shtutmanm@sccp.sc.edu ilya safro computer and information sciences university of delaware newark, de isafro@udel.edu abstract in , the white house released the, “call to action to the tech community on new machine readable covid- dataset,” wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science commu- nity answer high-priority scientific questions related to covid- . the allen institute for ai and collaborators announced the availabil- ity of a rapidly growing open dataset of publications, the covid- open research dataset (cord- ). as the pace of research acceler- ates, biomedical scientists struggle to stay current. to expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. we present an automated general purpose hypothesis generation systems agatha-c and agatha-gp for covid- research. the systems are based on graph-mining and the transformer model. the systems are massively validated using retrospective information rediscovery and proactive analysis in- volving human-in-the-loop expert analysis. both systems achieve high-quality predictions across domains (in some domains up to . % roc auc) in fast computational time and are released to the broad scientific community to accelerate biomedical research. in addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research find- ings such as the relationship between covid- and oxytocin hormone. reproducibility: all code, details, and pre-trained models are available at https://github.com/ilyatyagin/agatha-c-gp ccs concepts • applied computing → bioinformatics; document management and text processing; • computing methodologies → learning latent representations; neural networks; information extraction; semantic networks. ∗now with google brain. contact: jsybrandt@google.com. keywords hypothesis generation, literature-based discovery, transformer models, semantic networks, biomedical recommendation, introduction development of vaccines for covid- is a major triumph of mod- ern medicine and humankind’s ability to accelerate scientific re- search. while we are all hoping to see large-scale positive changes from fast mass adoption of the existing vaccines, there remain significant open research questions around covid- . the scien- tific community has a responsibility to do everything possible to block the ongoing transmission of the dangerous virus and acceler- ate research to mitigate its consequences. we present the following automated knowledge discovery system in order to propose new tools that could compliment the existing arsenal of techniques to accelerate biomedical and drug discovery research for events like covid- . the covid- pandemic became one of the most important events in the information space since the end of . the pace of published scientific information is unprecedented and spans all resolutions, from the news and pop-science articles to drug design at the molecular level. the pace of scientific research has already been a significant problem in science for years [ ], and under current circumstances this factor becomes even more pronounced. several thousands papers are being added weekly to cord- [ ] (the dataset of publications related to covid- ) and even more in medline [ ]. as a result, groups working on similar problems may not be immediately aware of the other’s findings, which can lead to inefficient investments and production delays. under normal circumstances, the medline database of biomed- ical citations receives approximately , new papers per year. currently this database indexes million total citations. this pace challenges traditional research methods, which often rely on human intuition when searching for relevant information. as a result, the demand for modern ai solutions to help with the automated anal- ysis of scientific information is incredibly high. for instance, the field of drug discovery has explored a range of ai analytical tools .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/ilyatyagin/agatha-c-gp https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure : number of new citations per week in cord- dataset. to expedite new treatments [ ]. designing lab experiments and finding candidate chemical compounds is a costly and long-lasting procedure, often taking years. to accelerate scientific discovery, researchers came up with a family of strategies to utilize public knowledge from databases like medline that are available through the national institute of health (nih), which facilitate automated hypothesis generation (hg) also known as literature-based discov- ery. undiscovered public knowledge, information that is implicitly present within available literature, but is not yet explicitly known by an individual who can act on that information, represents the target of our work. although, there are quite a few automated hg systems [ ] in- cluding those we have previously proposed [ , ], none of them is currently customized and available in the open domain to mas- sively process covid- related queries. in addition to the traditional general requirements for hg systems, such as high-quality results of hypotheses, interpretability and availability for broad scientific community, a specific demand for covid- data analysis requires: ( ) customization of the vocabulary and other logical units such as subject-verb-object predicates; ( ) customization of the training data that in the reality of urgent research contains a lot of controver- sial and incorrect information; ( ) models for different information resolutions; and ( ) validation on the on-going domain-specific discovery. our contribution: in this work we bridge this gap by releasing, agatha-c and agatha-gp , reliable and easy to use hg sys- tems that demonstrate state-of-the art performance and validate their inference capabilities on both covid- related and general biomedical data. to make them closely related to different goals of covid- research, they correspond to micro- (agatha-c, for covid- ) and macroscopic (agatha-gp, for general purpose) scales of knowledge discovery. both systems are able to process any queries to connect biomedical concepts but agatha-c exhibits better results on the molecular scale queries, e.g., those that are relevant to drug design, and agatha-gp works better for general queries, e.g., establishing connections between certain profession and covid- transmission. both systems are the next generation of the agatha knowl- edge network mining transformer model [ ]. they substantially improve the quality of the previous agatha by introducing new information layer into multi-layered semantic knowledge network pipeline, and expanding new information retrieval techniques that facilitate inference. we deploy the deep learning transfer model trained with up-to date datasets and provide easy to use interface to broad scientific community to conduct covid- research. we validate the system via candidate ranking [ , ] using very recent scientific publications containing findings absent in the training set. while the original agatha has demonstrated state-of-the- art performance for the time of its release, agatha and other systems were found to perform with notably lower quality on ex- tremely rapidly changing covid- research. we demonstrate a remarkable improvement in the range of approximately - % (in roc-auc) on the average on different types of queries with very fast query process that allows massive validation. in addition, we demonstrate that the proposed system can identify recently uncovered gene (bst ) and hormone (oxytocin and melatonin) re- lationships to covid- , using only papers published before these connections were discovered. reproducibility: all code, details, and pre-trained models are available at https://github.com/ilyatyagin/agatha-c-gp background cord- dataset [ ] was released as a response to the world’s covid- pandemic to help data science experts and researchers to tackle the challenge of answering the high priority scientific questions. it updates daily and was created by the allen institute for ai in collaboration with microsoft research, nlm, ibm and other organizations. at the time of this publication it contains over . scientific abstracts and over . full-text papers about coronaviruses, primarily covid- . medline is a database of nih that includes almost million citations (as of ) of scientific papers related to the biomedical and related fields. some of the citations are provided with mesh (medical subject headings) terms and other metadata. medline is one of the largest and well-known resources for biomedical text mining. hypothesis generation systems. the hg field has been present in information sciences for several decades. the first notable ap- proach was proposed by swanson et al. in [ ], which is called the a-b-c model. the concept of a-b-c model is to discover in- termediate (b) terms which occur in titles of publications for both terms a (source) and c (target). in their experiments, swanson et al. discovered an implicit connection between raynauld’s syndrome (term a) and fish oil (term c) through blood viscosity (term b), which was mentioned in both sets. the hypothesis that fish oil can be used for patients with raynaud’s disease was experimentally confirmed several years later [ ]. the key idea of the proposed method is that all fragmented bits of information are explicitly known, but their implicit relationships is what hg systems are aimed to uncover. we note the difference between hg and traditional information retrieval. the information retrieval techniques which represent the vast majority of biomedical literature based discovery systems are trained and (what is even more important) validated to retrieve existing information whereas the hg techniques predict undiscov- ered knowledge and thus must be massively validated on it. the hg validation requires training the system strictly on historical data rather than sampling it over the entire time. the advances in machine and deep learning transformed the algorithmics of hg systems (see sec. ) that are now able to pro- cess much larger information volumes demonstrating much higher .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/ilyatyagin/agatha-c-gp https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / quality predictions. however, lack of broader applicability of hg systems in the situation with covid- pandemic demonstrates that several major issues exist and require immediate attention: ( ) most of the existing hg systems are domain-specific (e.g., gene- disease interactions) that is usually expressed in limiting the pro- cessed information (e.g., significant filtering vocabulary and papers to a specific domain in probabilistic topic modeling [ ]); ( ) a proper validation of hg system remains a technical problem because multiple large-scale models have to trained with all het- erogeneous data carefully eliminated several years back; ( ) moreover, a large number of hg systems are not massively validated at all except of very old findings rediscovery [ ] or demonstrating of just a few proactive examples in humanly cu- rated investigation; and ( ) interpretability and explainbability of generated hypotheses remains a major issue. the umls metathesaurus [ ] is the nih database containing information about millions of concepts (both medical and general) and their synonyms. metathesaurus accumulates information about its entries from more than different vocabularies allowing to map and connect concepts from different terminologies. metathe- saurus also keeps metadata about the concepts such as semantic types and their hierarchy. the core unit of information in umls is the concept unique identifier, or cui. cui is a codified representa- tion of a specific term, which includes its different atoms (spelling variants or translations of the term on other languages), vocabulary entries, definitions and other metadata. semrep [ ] is a software kit developed by nih for extraction of semantic predicates (subject-verb-object triples) from the provided corpus. it also allows to extract entities not involved in any semantic predicate, if the corresponding option is selected. the official exam- ple of possible semrep output is: input = “we used hemofiltration to treat a patient with digoxin overdose that was complicated by refractory hyperkalemia.”, output = “hemofiltration-treats- patients; digoxin overdose-process_of-patients; hyperkalemia- complicates-digoxin overdose; hemofiltration-treats(infer)- digoxin overdose”. semrep handles word sense disambiguation and performs terms mapping to the corresponding cuis from umls metathesaurus. scispacy [ ] scispacy is a special version of spacy maintained by allenai, containing spacy models for processing scientific and bio-related texts. scispacy models are trained on different sources, such as pmc-pretrained word vec representations, medmentions entity linking dataset and so on. scispacy can handle various nlp tasks, such as ner, dependency parsing and pos-tagging, where achieves state of the art performance. scibert [ ] is a bert-like transformer pretrained language model, where full-text scientific papers were used as a training dataset. embeddings are learned in a word-piece fashion, which makes them capture the relationships between not only words in a sentence, but also between word parts in each word. faiss [ ] is a library for fast approximate clustering and similarity search between dense vectors. it scales to the huge datasets that do not fit in ram and can be used in a distributed fashion. faiss is used in our pipeline to perform 𝑘-means clustering of pq-quantizated sentence vectors to generate 𝑘-nearest neighbor edges for similar sentences (nodes) in knowledge network. figure : agatha multi-layered graph schema. ptbg [ ] (stands for pytorch biggraph) is a high-performance graph embedding system allowing distributed training. it was de- signed to handle large heterogeneous networks containing hun- dreds of millions of nodes of different types and billions of typed edges. distributed training is achieved by computing embeddings on disjoint node sets. allennlp open information extraction. allennlp [ ] is a powerful library developed by allenai that uses pytorch backend to provide deep-learning models for various natural processing tasks. specifically, allennlp open information extraction provides a trained deep bi-lstm model for extracting predicates from un- structured text. an api is provided for running inference in both single sentence and batch modes. pipeline summary we briefly summarize the agatha semantic graph construction pipeline. it is described in greater detail in the original paper [ ]. text pre-processing. the input for our system is a corpora of scientific citations from the medline and cord- datasets. these files contain titles and abstracts for millions of biomedical papers. we filter non-english documents, using the fasttext langauge identification model [ ] if the language is not provided. after that we split all abstracts into sentences and process all sentences with scispacy library. from each sentence we extract pos-annotated lemmas, entities and perform 𝑛-gram mining, where 𝑛 ∈ [ , , ] and 𝑛-grams are composed of frequently co-occurring lemmas. additionally, we associate all sentences with any relevant metadata, such as the mesh/umls keywords provided along with the citation. semantic graph construction. we construct a semantic graph containing different types of nodes, namely, sentences, entities, coded terms (from umls and mesh), 𝑛-grams, lemmas, and pred- icates following the schema depicted in figure . edges between sentences are induced from the nearest-neighbors network of sen- tence embeddings. we also include an edge between two sentences that appear sequentially within the same abstract, counting the title as the first sentence. other edges can be inferred directly from the recorded metadata. for instance, the node representing the en- tity “covid- ” is connected to every sentence and predicate that discuss covid- . nlm umls implementation. the prior agatha semantic net- work only includes umls terms that appear in semmeddb predi- cates [ ] which is a major limitation. in this work we enrich the “coded term” layer by introducing an additional preprocessing phase wherein we run the semrep tool with full-fielded output option ourselves on the entire input corpora. this phase would be necessary as cord- and most recent medline citations are not represented within slowly updated semmeddb. however, we find that we can substantially increase the quality of recovered terms by applying these tools ourselves. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / by doing that we not only enrich the "coded terms" semantic network layer, but also introduce a significant number of uncovered previously semantic predicates. it happens because semmeddb is a cumulative database, having various citations in the database processed over many years with various versions of semrep and various umls releases available at different time periods. to illustrate what was just said, let us consider the following example (pmid: ): "the results showed that v. cholerae o and also other related enteric pathogens have the essential cass components (crispr and cas genes) to mediate a rnai-like path- way." the current semrep version extracts the following predicate: crispr-affects-rnai, while semmeddb does not contain any predicates for this sentence. the year of publication of the corre- sponding paper is , but crispr term (c ) did not exist in the umls metathesaurus on or before , that is why at the time of adding this citation to semmeddb crispr-involved relation could not be identified. graph embedding. we embed our large semantic graph using a heterogeneous technique that captures node similarity through a biased transformed dot product. by explicitly including a bias term for each node, we capture a concepts overall affinity within the network that is critical for such general terms as “coronavirus.” by learning transformations between each pair of node types (e.g., between sentences and lemmas), we enable each type to occupy embedding spaces with differing characteristics. specifically, we fit an embedding model that optimizes the following similarity measure: s(𝑢, 𝑣) = 𝑢 + 𝑣 +𝑇𝑢𝑣 + 𝑑∑ 𝑖= 𝑢𝑖 (𝑣𝑖𝑇𝑢𝑣𝑖 ), ( ) where 𝑢, 𝑣 are nodes in the semantic graph with embeddings 𝑢, 𝑣, and 𝑇𝑢𝑣 is the directional transformation vector between nodes of 𝑢’s type to nodes of 𝑣’s. we use the ptbg heterogeneous graph embedding library to learn 𝑑 = dimensional embeddings for each node of our large semantic graph. while fitting embeddings (𝑢) and transformation vectors (𝑇𝑢𝑣), we represent each edge of the semantic graph as two directed edges. these learned values are optimized using softmax loss, where the similarity for one edge is compared against the similarities of negative samples. ranking semantic predicates (transformer model). after we obtain embeddings per node in the semantic graph, we train aga- tha system ranking model. this model is trained to rank published subject-object pairs above randomly composed pairs of umls con- cepts (negative samples). two coded terms, along with a fixed-size random subsample of predicates containing each term are input to this model. graph embeddings for each term and predicate are fed into stacked transformer encoder layers, which apply multi-headed self-attention across the embedding set. the last set of encodings are averaged and the result is projected to the unit interval, forming a scalar prediction for the input’s “plausibility.” allennlp predictor cord- process abstracts umls concept tagging semnet filter final predicates medline figure : predicate extraction pipeline with deep learning based open ie system. formally, the model to evaluate term pairs is defined as: 𝑓 (𝑥,𝑦) = 𝑔 ([ 𝑥 𝑦 𝑥′ . . . 𝑥 ′ 𝑘 𝑦′ . . .𝑦 ′ 𝑘 ]) 𝑔(𝑋) = sigmoid(mΘ) m = |𝑋 | colsum (e𝑁 (feedforward(𝑋))) e (𝑋) = 𝑋 e𝑖+ (𝑋) = layernorm (feedforward(a(𝑋)) + a(𝑋)) a(𝑋) = layernorm (multiheadattention(𝑋) + 𝑋) , ( ) where each 𝑥′ and 𝑦′ are randomly sampled from the neighbor- hoods of 𝑥 and 𝑦 respectively, and each ·̂ denotes the graph embed- ding of the given node. furthermore, Θ represents a free parameter, which is fit along with parameters internal to each feedforward and multiheadattention layer, following the standard conventions for each. the above model is fit using margin ranking loss, where pred- icates from the training set are compared against a large set of negative samples. additional details pertaining to specific opti- mization choices surrounding this model are present in the work originally proposing this model [ ]. augmenting semantic predicates with deep learning we used semrep predicate extraction system in the first system, agatha-c , to extract predicates from the abstracts. however, semrep relies on expert coded rules and heuristics to extract biomed- ical relations leading to significantly fewer predicates for training. thus, in order to augment the predicates (for the second system, agatha-gp ) we decided to use a deep learning based informa- tion extraction system by stanvosky et al. [ ]. figure shows our overall predicate extraction pipeline. abstract pre-processing. the input for the proposed semantic predicate extraction system is the output files generated by semrep tool with full-fielded output option enabled, obtained from the pre- processing stage described in sec. . as it was mentioned previously, semrep system extracts not only semantic triples, but also maps entities found in the input corpus to their corresponding umls concept ids, this is the data which is used for the following method. the initial set of records includes the sentence raw texts and ex- tracted from them umls terms and is augmented throughout the pipeline making it easier to extract final predicates for downstream training. raw predicate extraction. we use a pre-trained instance of rnnoie [ ] provided as an api by allennlp. the model was trained on the oie corpus. at a high level the model aims to learn a joint .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / embedding of individual words and their corresponding beginning- input-output (bio) tags. the output of the model is a probability distribution over the bio tags. during inference the model selects specific phrases and groups them into arg , v, arg tags. by con- vention, we treat arg as the subject and arg as the object in a subject-verb-object tuple. to speed up processing and scale it to thousands of abstracts, we leverage model-parallelism across differ- ent machines and run batch-mode inference on chunks of abstracts. once the model predictions have been extracted we extract the phrases with relevant tags into raw predicates and augment them in the record. a subsequent filtering is performed by extracting the terms matching with previously detected umls concepts in the sentence. semnet filtering using a general purpose rnnoie model has it’s own challenges. during processing we noted that a lot of raw predicates were either too general or contained too little meaning to be useful for training a prediction model. to overcome this challenge we designed a corrective filter to reduce noise and retain most useful predicates. we call this filter the semnet filter. each umls concept has an associated semantic type (e.g., covid- has an associated semantic type of dsyn (disease)). this is useful for summarizing large set of diverse text concepts into smaller num- ber of categories. we used the metadata from semantic types to construct two networks - a semantic network and a hierarchical network. the semantic network consists of semantic types as nodes and the edges imply a corresponding direct relation between them. the hierarchical network is a network of a semantic type connected to its more general semantic types. for example, a semantic type dsyn (disease) is more generally associated with a biof (biological function) or a pathf (pathological function). in order to filter a predicate, all edges emanating from the subject’s semantic types are computed on a per-predicate basis. these edges also include any specific-general concept relationships. if the object’s semantic type is found to be in the candidate edge set, then we deem the predicate as valid. in our experiments, we found that this filtering method significantly eliminates predicates which do not directly pertain to the biomedical domain. processing abstracts at scale building a pipeline that scales to thousands of abstracts is not a trivial task. in order to extract predi- cates from rnnoie model and extract quality terms of interest we not only have to contend with the problem of running inference on a deep neural network but also the task of aligning the extracted terms with the entities recognized by semrep. deployment details: the rnnoie model by stanovsky et al. uses a deep bi-lstm [ ] model to learn the joint word embedding and predict the resulting semantic position tags. since lstms are inherently sequential model, it means that the inference time per sentence would be considerable. we first tried processing an entire collection of abstracts at once on a cluster of machines each consisting of cpus using the dask [ ] library. the entire process took more than hours. considering that we had about such collections, this inference time was prohibitively high. in order to speed up inference we read each collection once and distributed chunks of abstracts over the machines. this change helped us to cut down the processing time from over a week to just over days for the medline corpus. for the cord- corpus the processing time was even faster at days. the next step was to align the extracted predicates with the semrep recognized biomedical concepts. we achieved this alignment by first building an index of files that contained a specific abstract id and then processing the rnnoie predicates with the aforementioned index. we further optimized the indexing phase by updating the existing index each time we processed more than 𝜏 abstracts. the semnet filter does not introduce additional computational overhead and can process a thousand abstracts in under second. hence, to obtain the most relevant set of predicates we were able to parallelize over “checkpoints" (each of which contained k abstracts) in an hour. validation a fair validation of hg systems is extremely challenging, as these models are designed to predict novel connections that are unknown to even those who evaluate the system [ ]. in addition, even if validated by rediscovering findings using historical, the process is computationally expensive because of the need to train multiple models to understand how many months (or years) back, the hg system can predict the findings which requires careful filtering of the used papers, vocabulary and other types of data. to present our results in terms of its usefulness for urgent cord- -related hg, we use a historical benchmark, which is conceptually described in [ ]. this technique is fully automated and does not require any domain experts intervention. positive samples collection. we use semrep and proposed in sec. approach to process the most recent cord- citations, which were published after the specific cut date making sure that the citations are not included in the training set. after that we extract all subject-object pairs from the obtained results and explicitly check that none of these pairs are presented in the training set. pairs mentioned in the cord- less than twice are filtered out from the validation set. almost all of them are either noisy or represent information that already appears in other pairs (e.g., because of the difference in grammar). we also use the strategy of subdomain recommendation. this strategy works in the following way. for each umls term we collect its semantic type (which is a part of the metadata provided in umls metathesaurus) and group all extracted semrep pairs by the term-pair criteria (combination of subject and object types). then we identify the top- most common term-pairs subdomains and construct the validation set from pairs belonging to these subdomains. negative samples generation. to generate negative samples per domain, the random sampling is used, that is, for each positive sample we keep its subject and randomly sample the object belong- ing to the same semantic type as the object of the source pair. we do this times, thus having negative domain-specific samples for each positive sample. when the validation set is generated, we apply our ranking criteria to it, obtaining a numerical score value 𝑠 per each sample, where 𝑠 ∈ [ , ]. evaluation metrics. we propose our approach as a recommenda- tion system and to report our results we use a combination of the following classification and recommendation metrics. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / • classification metrics: ( ) area under the receiver-operating- characteristic curve (auc roc); ( ) area under the precision- recall curve (auc pr). • recommendation metrics: ( ) top-k precision (p.@k); ( ) average precision (ap.@k); and ( ) overall reciprocal rank (rr). we report these numbers in per subdomain manner to better un- derstand how the system performs with respect to specific task (e.g. drug repurposing). results to report results, we provide the performance measures for three agatha models trained on the same input data (medline corpus and cord- abstracts dataset): ( ) agatha-o : baseline agatha model [ ]; ( ) agatha-c : agatha-o with new umls layer and semrep enrichment; ( ) agatha-gp : agatha-c with additional deep learning- based extracted and further filtered predicates. it is done in this particular manner because the major role in learn- ing the proposed ranking criteria depends heavily on the quality of extracted semantic predicates and their number, as they form the training set for the agatha ranking module. at the moment of writing, no other general purpose and available for public use hg system compliant with the three validation criteria, namely, (a) ability to run thousands of queries in a reasonable time, (b) ability to process covid- related vocabulary, and (c) ability to operate in multiple domains was available for comparison. the performance of both agatha-c and agatha-gp allows to run thousands of queries in a very short time (in the order of minutes), making the validation on a large number of samples pos- sible. unfortunately, given the current circumstances, large-scale validation for the specific scientific subdomain (covid- related hypotheses) is hard to implement, because well-established and reliable factual base is being actively developed at the moment and big historic gap for the vocabulary simply does not exist (e.g., the covid- term is just approximately one year old). we, how- ever, provide the validation set including positive connections extracted from cord- dataset citations added within the time frame from october , to january , , which numbered at thousand abstracts. table : graph metrics (m = millions, b = billions). counts node type agatha-o agatha-c agatha-gp sentence . m. . m. . m. predicate . m. . m. . m. lemma . m. . m. . m. entity . m. . m. . m. coded term , , , 𝑛-grams . . . total nodes , m. . m. . m. total edges . b. . b. . b. in table , we share some basic graph metrics for the models agatha-o , agatha-c and agatha-gp . the most signifi- cant change is observed in the number of semantic predicates and coded terms, which clearly represents the purpose of introducing additional preprocessing steps. in table , we compare aforementioned models using the met- rics described in sec. . we present predicate types with nlm semantic type codes [ ] due to space restrictions. both agatha- c and agatha-gp models show significant gains when compared to agatha-o baseline model. benefits in the most problematic for the baseline model areas (e.g., (gene) → (gene) denoted by (gngm,gngm)) serve the best illustration for that, showing up to almost percent advantage in roc auc. now all most popular biomedical subdomains are covered by the proposed models and show auc roc results at at least . . average roc auc value is increased by . . our validation strategy involves a big number of many-to-many queries, making the area under precision-recall curve another very illustrative metric. this is where the newly proposed models show even more drastic improvements over the baseline agatha-o . for some subdomains, like (gene or genome) → (gene or genome) (gngm,gngm) or (amino acid, peptide, or protein) → (gene or genome) (aapp,gngm), we observe that new models take the recommenda- tions performance to the new quality level. average pr auc value is increased by . . the approximate running time with corresponding types of used hardware is presented in table . each row corresponds to the stage in the agatha-c /agatha-gp pipelines. the column “m” (machines) and cpu show the number of machines and required cpus, respectively. in the column “gpu” we indicate if gpu was required or optional. for agatha training we used two nvidia v per machine. the minimal requirements for ram per machine are in column “ram”. the running time of queries is negligible. case study the proactive discovery of ongoing research findings is an impor- tant component in the validation of hypothesis generation systems [ ]. in particular, in the current uncertain situation when a lot of unintentionally incorrect discoveries are published, the validation must include human-in-the-loop part even in limited capacity such as in [ , ]. to demonstrate the predictive potential of agatha-c we perform a case study on three covid- -related novel connec- tions manually selected by the domain expert. these connections were published after the cut date before which any data used in training was available to download at nih. at a low level, all agatha models use entity subsampling to calculate pairwise ranking criteria, which means that the absolute numbers may fluctuate slightly. thus, to present the numeric scores, each experiment was repeated times to compute the average and standard deviation that we present in table . agatha-c was tested whether it will be able to predict com- pounds potentially applicable for the treatment of covid- and the genes involved in the sars-cov- pathogenesis. the data con- firming cardiovascular protective effects of hormone oxytocine were published recently [ , ]. the protective effect is linked to .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / table : classification and recommendation quality metrics across recently popular covid- -related biomedical subdomains. labels o, c and gp stand for agatha-o , agatha-c and agatha-gp models, respectively. roc auc pr auc rr p.@ p.@ ap.@ ap.@ o c gp o c gp o c gp o c gp o c gp o c gp o c gp orch:dsyn . . . . . . . . . . . . . . . . . . . . . aapp:dsyn . . . . . . . . . . . . . . . . . . . . . phsu:dsyn . . . . . . . . . . . . . . . . . . . . . orch:orch . . . . . . . . . . . . . . . . . . . . . phsu:phsu . . . . . . . . . . . . . . . . . . . . . orch:phsu . . . . . . . . . . . . . . . . . . . . . fndg:dsyn . . . . . . . . . . . . . . . . . . . . . orch:aapp . . . . . . . . . . . . . . . . . . . . . geoa:spco . . . . . . . . . . . . . . . . . . . . . geoa:idcn . . . . . . . . . . . . . . . . . . . . . topp:dsyn . . . . . . . . . . . . . . . . . . . . . hlca:dsyn . . . . . . . . . . . . . . . . . . . . . gngm:dsyn . . . . . . . . . . . . . . . . . . . . . fndg:humn . . . . . . . . . . . . . . . . . . . . . gngm:gngm . . . . . . . . . . . . . . . . . . . . . dsyn:fndg . . . . . . . . . . . . . . . . . . . . . phsu:fndg . . . . . . . . . . . . . . . . . . . . . dsyn:humn . . . . . . . . . . . . . . . . . . . . . dsyn:dsyn . . . . . . . . . . . . . . . . . . . . . aapp:gngm . . . . . . . . . . . . . . . . . . . . . mean . . . . . . . . . . . . . . . . . . . . . table : running time and hardware requirements. stage time hardware m cpu gpu ram semrep processing d - + opt n/a allennlp predicates d - + opt n/a graph construction d + + opt gb+ graph conversion h + opt tb+ graph embedding d + opt gb+ agatha training h + + yes gb+ network adjacency d + opt . tb+ table : scores for valid recently published connections ob- tained by different agatha models. reported average val- ues for runs and standard deviation. agatha-o agatha-c agatha-gp covid- :melatonin . ± . . ± . . ± . covid- :oxytocin . ± . . ± . . ± . covid- :bst gene . ± . . ± . . ± . anti inflammatory activity of the hormone. for this connection agatha-c generated the score of . . similarly, we tested the prediction of the effects of the other hormone, melatonin. several publications, started from november [ , , , ] show the protective effects of melatonin, specifi- cally for covid- neurological complications. the activity was linked to anti-oxidative effects of the melatonin. for this connection agatha-c generated the score of . . our system accurately predicted with score of . the involve- ment of tetherin (bst ). the results published in [ ] show that tetherin restricts the secretion of sars-cov- viral particles and is downregulated by sars-cov- . therefore, pharmacological activation of tetherin expression, or inhibition of the degradation could be a promising direction of the development of sars-cov- treatment. lessons learned and open problems quality of the information retrieval pipelines. information retrieval is an important part of any hg pipeline. in order to uncover implicit connections, the system should be able to capture existing explicit connections with as much quality as possible. given that human knowledge is usually stored in a non-structured manner (e.g., scientific texts), the quality of systems that process raw textual data, such as those that solve the named entity recognition, or word sense disambiguation problems, is crucial. we observed that the semrep system performs better concept and relation recognition when full abstracts are used as input data instead of single sentences. semrep also allows to perform optional sortal anaphora resolution to extract co-references to the entities from neighbouring sentences, which was shown to be useful in [ ] and is used in this work. "positive" research bias. the absence of published negative re- search results is a big problem for the hg field. with mostly posi- tive results available, often we have to generate negative examples through some kind of random sampling. these negative samples likely do not adequately represent the real nature of negatively confirmed scientific findings. likely, one of the most important .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / future work directions in the area of hg is to accurately distinguish and leverage positive and negative proposed results. domain experts involvement. when any hypothesis generation system is built, one of the first questions a designer should address is extent that domain experts are expected to participate in the pipeline. modern decision-making systems allow a fully automated discovery process (like the agatha system), but this may not be sufficient. a domain expert who interfaces with a hg system as a black box may not trust generated results or know how best to interpret them. the challenge of interpretable hypothesis genera- tion remains a significant barrier to widespread adoption of these kinds of research tools. for this we advocate using our “structural” learning hg system moliere [ ] in which with the topical mod- eling and network analytic measures we interpret and explain the results. the nature of input corpora. the question of what should be used as input to a topic-modeling based hypothesis generation sys- tem is raised in [ ]. using full-text papers shows an improvement, but the trade-off between run time and output quality was barely justifiable. however, deep learning models have a greater potential for extracting useful information from large input sources, and as it was demonstrated in our previous work [ ], show significant per- formance advancements. thus the question of using full-text papers in deep learning-based hypothesis generation systems should be addressed. unfortunately, it is currently too computationally expen- sive our resources as the number of sentences and thus predicates and edges will be significantly larger. knowledge resolution. our newly proposed systems showed that the knowledge resolution plays a major role in subdomain recom- mendation. to increase the scope of model expertise (and the scope of potential applications beyond the biomedical fields) we deliber- ately incorporate a general-purpose information retrieval system rnnoie into agatha-gp . this additional information results in significant gains in broad subdomains like (geographic area) → (idea or concept) (geoa,idcn). at the same time, we observe that agatha-c performs better in “microscopic” biomedical areas, e.g. (organic chemical) → (organic chemical) (orch,orch), which raises the question of choosing the appropriate model for every specific use case. although, both systems process all types of queries, the general purpose predicates participated in training significantly improve “macroscopic” types of queries. related work a number of works have been proposed to organize the cord- literature into a structured knowledge graph for different purposes. for instance, basu et al. [ ] propose erlkg - a knowledge graph built on cord- with entities corresponding to gene/chemical/dis- ease names and the edges forming relations between the concept. they use a fine tuned scibert model for both entity and relation extraction. the main purpose of the knowledge graph is to predict a link between a given chemical-disease and chemical-protein pair using a trained gcn autoencoder [ ] approach. in another similar work, oniani et al. [ ] build a co-occurrence network on a subset of cord- with the edges corresponding to either gene-disease, gene-mutation or chemical-disease type. the network is then em- bedded into latent space using a node vec walk. link prediction is performed on the nodes by training different classical machine learning algorithms. a major shortcoming of these approaches is that they limit themselves to either specific kind of entities or re- lations or both and as a result not only the scope of possible new literature is narrowed but a lot of additional useful knowledge is filtered out of the system. in contrast, our system does not limit itself to specific entity or relation type and is able to capture much more information from the same corpus. a major interest of constructing knowledge graphs is to al- low medical researchers to re-purpose existing drugs for treating covid- . zhang et al. [ ] develop a system that uses combined semantic predications from semmeddb and cord- (extracted using semrep) to recommend drugs for covid- treatment. to improve the predications from cord- , the authors fine tune various transformer based models on a manually annotated inter- nal dataset. their resulting knowledge graph consists of , nodes and , , edges. our work on the other hand utilizes similar technologies and produces a bigger graph with , , nodes and , , , edges. moreover, we do not post-process extracted relations from semrep and are still able to achieve a higher roc metric. another system proposed by martinc et al. [ ] uses a fine-tuned scibert model to generate contextualized embed- dings of cord- articles and using an initial seed set of targets proposes possible therapy targets. however, this system is very different from ours as it treats the entire article as a bag of words and directly trains a word embedding model on cord- . it was earlier noted that kinderminer [ ] provides a web-based literature discovery tool and supports covid- queries. the underlying algorithm is based on a simple keyword co-count between source and target words in a given corpus. while co-count is a fast and scalable approach, it suffers from a lack of “discrimination" i.e. two keywords occurring together more frequently do not always imply a high degree of correlation. the vastness of covid- literature also spurned the need for having systems that could allow researchers and base users alike to get their covid- queries answered. systems like ckg (wise et al.) [ ] and scisight (hope et al.) [ ] currently provide this functionality. while we do aim to provide an easy to use web- framework for medical researchers, the scope of the aforementioned systems is beyond the scope of our work. unfortunately, no existing system out of those that are trained to accept terms related to covid- or sars-cov- provided an open access for massive validation for a fair comparison with or was able to be tested in multiple domains like agatha-c . conclusions we present two graph mining transformer based models agatha- c and agatha-gp , for micro- and macroscopic scales of queries respectively, which are designed to help domain experts solve high- priority research problems and accelerate scientific discovery. we perform per-subdomain validation of these new models on a rapidly changing covid- focused dataset, composed of recently pub- lished concept pairs and demonstrate that the proposed models achieve state-of-the-art prediction quality. both models signifi- cantly outperform the existing baseline system agatha-o . we deploy the proposed models to the broad scientific community and .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / believe that our contribution can raise more interest in prospective hypothesis generation applications. references [ ] [n.d.]. citations added to medline by fiscal year. https://www.nlm.nih.gov/ bsd/stats/cit_added.html [ ] marina aksenova, justin sybrandt, biyun cui, vitali sikirzhytski, hao ji, diana odhiambo, matthew d lucius, jill r turner, eugenia broude, edsel peña, et al. . inhibition of the dead box rna helicase prevents hiv- tat and cocaine- induced neurotoxicity by targeting microglia activation. journal of neuroimmune pharmacology ( ), – . [ ] lise alschuler, ann marie chiasson, randy horwitz, esther sternberg, robert crocker, andrew weil, and victoria maizes. . integrative medicine consid- erations for convalescence from mild-to-moderate covid- disease. explore ( ). [ ] patrick arnold and erhard rahm. . semrep: a repository for semantic mapping. datenbanksysteme für business, technologie und web (btw ) ( ). [ ] sayantan basu, sinchani chakraborty, atif hassan, sana siddique, and ashish anand. . erlkg: entity representation learning and knowledge graph based association analysis of covid- through mining of unstructured biomed- ical corpora. in proceedings of the first workshop on scholarly document pro- cessing. association for computational linguistics, online, – . https: //doi.org/ . /v / .sdp- . [ ] iz beltagy, arman cohan, and kyle lo. . scibert: pretrained contextualized embeddings for scientific text. arxiv preprint arxiv: . ( ). [ ] olivier bodenreider. . the unified medical language system (umls): inte- grating biomedical terminology. [ ] daniel p cardinali, gregory m brown, and seithikurippu r pandi-perumal. . can melatonin be a potential “silver bullet” in treating covid- patients? diseases , ( ), . [ ] phuoc-tan diep. . is there an underlying link between covid- , ace , oxytocin and vitamin d? medical hypotheses ( ), . [ ] r. a. digiacomo, j. m. kremer, and d. m. shah. . fish-oil dietary supple- mentation in patients with raynaud’s phenomenon: a double-blind, controlled, prospective study. am j med , (feb ), – . [ ] matt gardner, joel grus, mark neumann, oyvind tafjord, pradeep dasigi, nelson f. liu, matthew peters, michael schmitz, and luke s. zettlemoyer. . allennlp: a deep semantic natural language processing platform. arxiv:arxiv: . [ ] vishrawas gopalakrishnan, kishlay jha, wei jin, and aidong zhang. . a survey on literature based discovery approaches in biomedical domain. journal of biomedical informatics ( ), . [ ] ping ho, jing-quan zheng, chia-chao wu, yi-chou hou, wen-chih liu, chien- lin lu, cai-mei zheng, kuo-cheng lu, and you-chen chao. . perspective adjunctive therapies for covid- : beyond antiviral therapy. international journal of medical sciences , ( ), . [ ] tom hope, jason portenoy, kishore vasan, jonathan borchardt, eric horvitz, daniel s. weld, marti a. hearst, and jevin west. . scisight: combining faceted navigation and research group detection for covid- exploratory scientific search. arxiv: . [cs.ir] [ ] jeff johnson, matthijs douze, and hervé jégou. . billion-scale similarity search with gpus. arxiv preprint arxiv: . ( ). [ ] armand joulin, edouard grave, piotr bojanowski, matthijs douze, hérve jégou, and tomas mikolov. . fasttext.zip: compressing text classification models. arxiv preprint arxiv: . ( ). [ ] h. kilicoglu, g. rosemblat, m. fiszman, and t. c. rindflesch. . sortal anaphora resolution to enhance relation extraction from biomedical literature. bmc bioin- formatics (apr ), . [ ] halil kilicoglu, dongwook shin, marcelo fiszman, graciela rosemblat, and thomas c. rindflesch. . semmeddb: a pubmed-scale repository of biomedi- cal semantic predications. bioinform. , ( ), – . http://dblp.uni- trier.de/db/journals/bioinformatics/bioinformatics .html#kilicoglusfrr [ ] thomas n. kipf and max welling. . semi-supervised classification with graph convolutional networks. in international conference on learning repre- sentations (iclr). [ ] f. kuusisto, j. steill, z. kuang, j. thomson, d. page, and r. stewart. . a simple text mining approach for ranking pairwise associations in biomedical applications. amia jt summits transl sci proc ( ), – . [ ] adam lerer, ledell wu, jiajun shen, timothee lacroix, luca wehrstedt, abhijit bose, and alex peysakhovich. . pytorch-biggraph: a large-scale graph embedding system. in proceedings of the nd sysml conference. palo alto, ca, usa. [ ] matej martinc, blaž Škrlj, sergej pirkmajer, nada lavrač, bojan cestnik, martin marzidovšek, and senja pollak. . covid- therapy target discovery with context-aware literature mining. in discovery science, annalisa appice, grigorios tsoumakas, yannis manolopoulos, and stan matwin (eds.). springer international publishing, cham, – . [ ] a. t. mccray, a. burgun, and o. bodenreider. . aggregating umls semantic types for reducing conceptual complexity. stud health technol inform , pt ( ), – . [ ] mark neumann, daniel king, iz beltagy, and waleed ammar. . scispacy: fast and robust models for biomedical natural language processing. arxiv preprint arxiv: . ( ). [ ] david oniani, guoqian jiang, hongfang liu, and feichen shen. . con- structing co-occurrence network embeddings to assist association extraction for covid- and other coronavirus infectious diseases. journal of the american medical informatics association , ( ), – . [ ] matthew rocklin. . dask: parallel computation with blocked algorithms and task scheduling. in proceedings of the th python in science conference, kathryn huff and james bergstra (eds.). – . [ ] m. schuster and k. k. paliwal. . bidirectional recurrent neural networks. ieee transactions on signal processing , ( ), – . https://doi.org/ . / . [ ] neil r smalheiser. . rediscovering don swanson: the past, present and future of literature-based discovery. journal of data and information science , ( ), – . [ ] scott spangler. . accelerating discovery: mining unstructured information for hypothesis generation. chapman and hall/crc. [ ] scott spangler, angela d wilkins, benjamin j bachman, meena nagarajan, tajhal dayaram, peter haas, sam regenbogen, curtis r pickering, austin comer, jef- frey n myers, et al. . automated hypothesis generation based on mining scientific literature. in proceedings of the th acm sigkdd international confer- ence on knowledge discovery and data mining. – . [ ] gabriel stanovsky, julian michael, luke zettlemoyer, and ido dagan. . super- vised open information extraction. in proceedings of the th annual conference of the north american chapter of the association for computational linguistics: human language technologies (naacl hlt). association for computational linguistics, new orleans, louisiana, (to appear). [ ] hazel stewart, kristoffer h johansen, naomi mcgovern, roberta palmulli, george w carnell, jonathan luke heeney, klaus okkenhaug, andrew firth, andrew a peden, and james r edgar. . sars-cov- spike downregulates tetherin to enhance viral spread. biorxiv ( ), – . [ ] don r swanson. . fish oil, raynaud’s syndrome, and undiscovered public knowledge. perspectives in biology and medicine , ( ), – . [ ] justin sybrandt, angelo carrabba, alexander herzog, and ilya safro. . are ab- stracts enough for hypothesis generation?. in ieee international conference on big data (big data). – . https://doi.org/ . /bigdata. . [ ] justin sybrandt, michael shtutman, and ilya safro. . moliere: auto- matic biomedical hypothesis generation system. in proceedings of the rd acm sigkdd international conference on knowledge discovery and data min- ing (halifax, ns, canada) (kdd ’ ). acm, new york, ny, usa, – . https://doi.org/ . / . [ ] justin sybrandt, micheal shtutman, and ilya safro. . large-scale validation of hypothesis generation systems via candidate ranking. in ieee international conference on big data (big data). – . https://doi.org/ . /bigdata. . [ ] justin sybrandt, ilya tyagin, michael shtutman, and ilya safro. . agatha: automatic graph mining and transformer based hypothesis generation approach. association for computing machinery, new york, ny, usa, – . https: //doi.org/ . / . [ ] huijun wang, ying ding, jie tang, xiao dong, bing he, judy qiu, and david j wild. . finding complex biological relationships in recent pubmed articles using bio-lda. plos one , ( ), e . [ ] lucy lu wang, kyle lo, yoganand chandrasekhar, russell reas, jiangjiang yang, darrin eide, k. funk, rodney michael kinney, ziyang liu, w. merrill, p. mooney, d. murdick, devvret rishi, jerry sheehan, zhihong shen, brandon brandon stil- son stilson, alex d wade, kuansan wang, christopher wilhelm, boya xie, dou- glas m. raymond, daniel s. weld, oren etzioni, and sebastian kohlmeier. . cord- : the covid- open research dataset. arxiv ( ). [ ] stephani c wang and yu-feng wang. . cardiovascular protective properties of oxytocin against covid- . life sciences ( ), . [ ] colby wise, vassilis n. ioannidis, miguel romero calvo, xiang song, george price, ninad kulkarni, ryan brand, parminder bhatia, and george karypis. . covid- knowledge graph: accelerating information retrieval and discovery for scientific literature. arxiv: . [cs.ir] [ ] rui zhang, dimitar hristovski, dalton schutte, andrej kastrin, marcelo fiszman, and halil kilicoglu. . drug repurposing for covid- via knowledge graph completion. arxiv: . [cs.cl] [ ] petra zimmermann and nigel curtis. . why is covid- less severe in children? a review of the proposed mechanisms underlying the age-related difference in severity of sars-cov- infections. archives of disease in childhood ( ). .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.nlm.nih.gov/bsd/stats/cit_added.html https://www.nlm.nih.gov/bsd/stats/cit_added.html https://doi.org/ . /v / .sdp- . https://doi.org/ . /v / .sdp- . https://arxiv.org/abs/arxiv: . https://arxiv.org/abs/ . http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics .html#kilicoglusfrr http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics .html#kilicoglusfrr https://doi.org/ . / . https://doi.org/ . / . https://doi.org/ . /bigdata. . https://doi.org/ . / . https://doi.org/ . /bigdata. . https://doi.org/ . /bigdata. . https://doi.org/ . / . https://doi.org/ . / . https://arxiv.org/abs/ . https://arxiv.org/abs/ . https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / abstract introduction background pipeline summary augmenting semantic predicates with deep learning validation results case study lessons learned and open problems related work conclusions references prediction of adverse drug reactions associated with drug-drug interactions using hierarchical classification prediction of adverse drug reactions associated with drug-drug interactions using hierarchical classification catherine kim * and nicholas tatonetti . jericho senior high school, cedar swamp rd, jericho, ny . department of biomedical informatics, department of systems biology, & department of medicine, columbia university, west th st. ph new york, ny *corresponding author: cathy.kim@jerichoapps.org .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / abstract adverse drug reactions (adrs) associated with drug-drug interactions (ddis) represent a significant threat to public health. unfortunately, most conventional methods for prediction of ddi-associated adrs suffer from limited applicability and/or provide no mechanistic insight into ddis. in this study, a hierarchical machine learning model was created to predict ddi- associated adrs and pharmacological insight thereof for any drug pair. briefly, the model takes drugs’ chemical structures as inputs to predict their target, enzyme, and transporter (tet) profiles, which are subsequently utilized to assess occurrences of adrs, with an overall accuracy of ~ %. the robustness of the model for adr classification was validated with ddis involving three widely prescribed drugs. the model was then applied for interstitial lung disease (ild) associated with ddis involving atorvastatin, identifying the involvement of multiple targets, enzymes, and transporters in ild. the model presented here is anticipated to serve as a versatile tool for enhancing drug safety. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction adverse drug reactions (adrs) represent a significant threat to public health worldwide, accounting for considerable morbidity and mortality with estimated costs of ~$ billion annually [ , ]. as adrs continue to present a growing concern in modern health care systems, their identification and prevention are quintessential for improved drug safety and patient care. while drugs are subjected to preclinical in vitro safety profiling and clinical drug safety trials to assess drug safety, many adrs occur in small subsets of the human population, making adrs not readily detectable in advance [ ]. moreover, adrs are more difficult to analyze when multiple, rather than single, drugs are administered, which has become common amongst a growing elderly population [ ]. drug-drug interactions (ddis) between co-administered drugs appear in various forms of adrs by different mechanisms, adding additional complexity [ ]. to better address ddi-associated adrs, an understanding of their pharmacological mechanisms is strongly required. ddis can occur when drugs compete for the same target [ ]. ddis also involve drug metabolizing enzymes (e.g. cytochrome p (cyp) enzymes) and influx and efflux drug transporters — all of which determine the adsorption, distribution, metabolism, and excretion (adme) of drugs [ ]. thus, interference with target binding, enzyme- mediated metabolization, and/or uptake and excretion of drugs may cause ddis [ , - ]. moreover, the comprehensive evaluation of entire tet profiles — many of which are dependent on the chemical structures of drugs — and their interplay between the drugs is critical for an enhanced understanding of ddi-associated adrs [ ]. with the current inability to reliably assess ddis in preclinical testing and clinical trials and the complex nature of ddi-associated adrs, a data-driven computational approach is well- suited for predicting such adrs. this approach may benefit from extensive adr databases, such as the fda adverse event reporting system, where data representative of a large population are collected from patients, clinicians, and pharmaceutical companies [ ]. while various machine learning models have been previously developed for predicting ddi-associated adrs with considerable accuracy, they suffer from major limitations. most currently available models are based on drug similarity, providing accurate prediction only when the drug in question is similar to existing drugs with known tet profiles and/or adr information [ - ]. this requirement makes these models not readily applicable when such information is unavailable, for example, when a drug is still under development. moreover, conventional .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / models provide no pharmacological insight into ddi-associated adrs. the availability of such a priori mechanistic understandings can lay out theoretical foundations on which a novel, effective pharmacological strategy can be developed. overall, a novel computational approach to evaluate associations between ddis and adrs and to determine their molecular basis is urgently needed for better drug design and enhanced drug safety. this study reports the development of a hierarchical machine learning model to predict risks of various ddi-associated adrs and their underlying pharmacological mechanisms. this model consists of two layers of classifiers for the prediction of tet profiles and occurrences of adrs from chemical structures of a drug pair, requiring no drug similarity. the model was tested for its robustness with three case studies and then employed to elucidate the origin of an adr of a rare disease, interstitial lung disease (ild), associated with ddis. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / methods all computations were performed with python . . on jupyter notebook . . anaconda navigator . . , unless otherwise noted. statistical analyses of adrs proportional reporting ratios (prrs) were calculated for each drug pair and corresponding adverse drug reactions (adrs) from the twosides v . database [ ] using the equation �� /�� /�� as described previously [ , ]: where a = the number of patients who were administered the drug pair and were reported for the adr, b = the number of patients who were administered the drug pair and were not reported for the adr, c = the number of patients were not administered the drug pair and were reported for the adr, and d = the number of patients who were not administered the drug pair and were not reported for the adr. the twosides v . database was created by application of propensity score matching to the fda adverse event reporting system [ ] in order to account for covariates in the dataset and eliminate potential bias [ ] and used directly for the prr calculations in this study. the numbers of unique drug pairs and adrs used in this study were , and , , respectively. for drug pairs containing one of three widely prescribed drugs — levothyroxine, omeprazole, and atorvastatin — all of their reported adrs were extracted from the twosides v . database using the python pandas . . library [ ]. determination of chemical fingerprints of drugs for all the drugs listed in the drugbank . . database [ ], their chemical structures in the format of the simplified molecular-input line-entry system (smiles) were obtained directly from the database or pubchem v . . .b [ , ]. the smiles were stored in a d representation with the python rdkit . . library and used to produce a chemical fingerprint for each drug by calculating its molecular access system (maccs) keys [ ]. binary string representations of the maccs keys were stored in a python pandas . . dataframe [ ]. construction of target, enzyme, and transporter profiles of drugs .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / annotations about , unique targets, unique enzymes, and unique transporters were collected from drugbank . . [ ] to create tet vectors of drugs using the python numpy . . library [ ]. each of all unique tets was assigned a position in a tet vector. for each drug, a tet vector was created with the python numpy . . library to represent its pharmacological profile. briefly, in each position of a drug’s tet vector, the value of “ ” was assigned if any action of the drug (e.g., as a ligand, substrate, inhibitor, activator, agonist, antagonist) on each target, enzyme, and transporter was noted in drugbank . . , whereas the value of “ ” otherwise. development of rfcs for prediction of target, enzyme, and transporter profiles of drugs random forest classifiers (rfcs) were constructed for prediction of targets, enzymes, and transporters from the chemical structure of a drug using the python sci-kit learn . . library [ ]. these models formed the first layer of the hierarchical model. the dataset of the drugs’ maccs keys and tet vectors were split into training ( % of dataset) and testing ( % of dataset) sets. rfcs were trained and tested to predict tet profiles from maccs key representations (i.e., chemical fingerprints) of the drugs. during training and testing of the rfcs, model accuracies were measured and averaged. development of a model for prediction of ddi-associated adrs from tet profiles of drugs tet vectors of a drug pair were combined to form its tet matrix, which was then matched to the drug pair’s prrs for various adrs reported in the twosides v . database. from the twosides v . database, the calculated prrs for different adrs of each drug pair were categorically encoded with the value of “ ” when ≤ prrs < , “ ” when prrs = , and “ ” when prrs > . the processed prr dataset with the matched tet matrices for the drug pairs were split into training ( % of dataset) and testing ( % of dataset) sets. the machine learning algorithms, random forest classifiers (rfc) [ ], logistic regression (lr) [ ], and support vector machines (svm) [ ], were constructed as classifiers for adr prediction using the python sci-kit learn . . library. the models were fit with default tuning parameters in the python sci-kit learn . . library. model accuracies were measured using a -fold cross- validation, as described elsewhere [ ]. the svm model was chosen as a second layer of the hierarchical model. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / pathway analysis of adrs the key genes/proteins involved in ild associated with ddis involving atorvastatin were determined from the pathway database reactome [ ]. various repositories on gene/protein interactions and pathways, such as biogrid [ ], proteomics db [ ], string [ ], and corum [ ], were applied to identify interactions between drug targets and genes/proteins involved in adrs. the ncbi gene database [ ] was used to determine tissue-specific gene expression levels. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / results and discussion model overview a hierarchical model to predict adrs from the chemical structures of a drug pair was developed. in this model, the input variables were the chemical structures of drugs (fig. a) and the output variables were the prrs for various adrs (fig. b). a value of prr > indicates a high risk of an adr for a given drug pair, whereas a prr< suggests that a given adr is less commonly reported for a drug pair, relative to other drug pairs [ , ]. the prr of indicates a statistically neutral association between a drug pair and an adr. instead of attempting to correlate chemical structures of drugs directly with prrs for adrs, an intermediate tier of a pharmacological profile, namely a target, enzyme, and transporter (tet) profile, of drugs was introduced to connect chemical fingerprints of drugs with various adrs (fig. ). the tet profiles depend on the drugs’ chemical structures [ , ]. on the other end, the tet profiles determine the drugs’ adme processes and their ultimate pharmacological actions through on- and off-targeting, all of which play a dominant role in adrs [ - ]. thus, the intermediate tier of the tet profiles serves as an essential component in the hierarchical machine learning model that connects the input variables (i.e., chemical structures of drugs) and the output variables (i.e., prrs for various adrs), while allowing for a deeper mechanistic understanding of adrs (fig. ). prediction of target, enzyme, and transporter profiles from chemical fingerprints of drugs to predict the pharmacological profiles (i.e. tet profiles) from the chemical fingerprints (i.e. maccs keys) of drugs, random forest classifiers (rfcs) were constructed. the rfcs achieved high (> %) accuracy across tet profile prediction (table ). the accuracies of these models were higher than other machine learning algorithms previously developed for the classification of drugs inhibiting a specific transporter [ ]. for prediction of the entire tet profile, a testing accuracy of the rfc models is estimated to be . % (= . % × . % × . %; table ). the tet prediction method presented in this study may address limitations of costly and often time-inefficient preclinical in vitro experiments to determine tet profiles [ ]. moreover, in vitro methods to assess drug’s action on transporters are not well established, presenting another limitation [ ]. other computational approaches, such as molecular docking, require the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / d chemical structures of tets [ ], which are often lacking in newly identified drug targets and many transporters [ ]. requiring no such d structural information, the rfcs presented here allow for accurate, thorough, and inexpensive evaluations of tet profiles of drugs — even those under the development stage. adr prediction from target, enzyme, and transporter profiles of drug pairs to predict adrs of a drug pair from its tet profiles, random forest classifier (rfc), logistic regression (lr), and support vector machine (svm) models were developed and evaluated using a -fold cross-validation. compared to the rfc and lr models performing at mean classification accuracies (i.e. a fraction of a correctly classified adr from a drug pair’s tet matrix) of . % (fig. a) and . % (fig. b), respectively, the svm model outperformed with a greater mean classification accuracy of . % (fig. c). application of the svm model for ddi-associated adrs involving three major drugs the svm model was further tested for its robustness with ddis involving three commonly prescribed drugs: levothyroxine, a synthetic hormone to treat hypothyroidism [ ], omeprazole, a proton pump inhibitor for gastric acid-related disorders [ ], and atorvastatin, an inhibitor of -hydroxy- -methyl-glutaryl-coa (hmg-coa) reductase used for lowering lipids concentrations to treat hypercholesterolemia [ ]. ( ) case study : levothyroxine to apply the svm model for ddis associated with levothyroxine, eptifibatide was chosen as the concomitant drug, since the co-administration of levothyroxine and a blood thinner (e.g., eptifibatide) was previously found to cause a bleeding-related adr [ , ] via inhibition of platelet aggregation [ , ]. levothyroxine has four major targets (integrin subunit αv (itgav), integrin subunit βiii (itgb ), thyroid hormone receptor α (thra), and thyroid hormone receptor β (thrb)), two metabolizing enzymes (cytochrome p (cyp) c (cyp c ) and udp-glucuronosyltransferase a (ugt a )), and nine transporters (atp- binding cassette sub-family b member (abcb ), solute carrier (slc) family member (slc a ), slc a , solute carrier organic anion transporter (slco) a (slco a ), slco b , slco b , slco c , slco b , and slco a ; fig. a). thra and thrb are .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / nuclear receptors for levothyroxine, regulating transcription of hormone-responsive genes (referred to as genomic actions) [ , ]. a heterodimeric complex, itgav-itgb , consisting of itgav and itgb is another receptor for levothyroxine [ , ], mediating the drug’s nongenomic actions, such as the proliferation of endothelial cells [ , ]. eptifibatide’s tet profile contains only one target, itgb (fig. a), which is complexed with integrin subunit αiib [ ] to form a heterodimeric complex, itgb -itga b, mediating platelet aggregation [ , ]. the co-administration of eptifibatide, which inhibits itgb -itga b’s binding to fibrinogen, can reduce platelet aggregation [ , ]. thus, the sharing of itgb by levothyroxine and eptifibatide (fig. a) may be responsible for adrs, as is the case with other combinations of drugs binding to the same pharmacological targets [ ]. the predictive power of the svm model, particularly in the role of the shared target (i.e. itgb ) in ddi-associated adrs, was evaluated through comparisons with statistical results. briefly, the prrs for various adrs associated with the co-administration of levothyroxine and eptifibatide were calculated from statistical analyses of twosides v . . the prrs for levothyroxine alone and eptifibatide alone were calculated similarly using offsides v . [ ]. then, for a given adr, its prr for the co-administration of levothyroxine and eptifibatide was subtracted from the average prr for the single administrations of levothyroxine and eptifibatide (i.e. Δ prr = average prr for single administrations of levothyroxine and eptifibatide – prr for their co-administration). highly negative values of this difference (e.g., Δ prr < the % confidence interval of Δ prrs for the “no change” group) are indicative of strong ddis. the calculated prr differences were then compared with prediction results, which were obtained using the svm model upon removal of the itgb as a target from the tet profile of eptifibatide. the comparison result suggests that the risks of most adrs associated with strong ddis between levothyroxine and eptifibatide are predicted to decrease if the tet profile of eptifibatide lacks itgb as a target, suggesting the critical role of shared itgb in the ddis (fig. b). ( ) case study : omeprazole for a subsequent analysis, clopidogrel, an antiplatelet drug for the treatment of cardiovascular diseases [ ], was chosen as the concomitant drug with omeprazole. omeprazole has two major targets (aryl hydrocarbon receptor (ahr) and potassium-transporting atpase α .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / chain (atp a)), nine metabolizing enzymes (cyp a , cyp a , cyp b , cyp c , cyp c , cyp c , cyp c , cyp d , and cyp a ), and three transporters (abcb , abc subfamily c member (abcc ), and abc subfamily g member (abcg ); fig. a) while omeprazole and clopidogrel share no same pharmacological targets, they have multiple enzymes and a single transporter in common (fig. a). the concomitant use of omeprazole was found to lower the platelet inhibitory effects of clopidogrel [ , ], increasing a risk of reinfarction [ , ] and major cardiovascular events [ ], compared to those receiving clopidogrel alone. accordingly, the fda has recommended not using omeprazole together with clopidogrel unless absolutely required [ ]. to identify the major pharmacological determinants responsible for ddis between omeprazole and clopidogrel, each of the targets, enzymes and transporters was removed one at a time from the tet profile of omeprazole and the effects of each removal on the ddi-associated adrs were predicted using the svm model developed in this study. the result from this analysis showed that cyp a , cyp c , cyp c , cyp c , cyp a , abcb , and abcc may play key roles in ddis between omeprazole and clopidogrel (fig. b). consistent with this result, cyp c was previously identified as a key enzyme to mediate ddis between omeprazole and clopidogrel [ ]. for its anti-platelet aggregation effect, clopidogrel needs to be converted by cyp c to an active metabolite [ , ], which prevents activation of p ry required for platelet activation and aggregation [ ]. thus, omeprazole, an inhibitor of cyp c [ , ], can prevent the biotransformation of clopidogrel required for efficacy, causing ddi-associated adrs [ - ]. the analysis also suggests the possible involvement of other enzymes and transporters in ddis between omeprazole and clopidogrel (fig. b), as supported by previous reports. for example, the metabolic activation of clopidogrel was also found to be mediated by cyp a [ , ]. in addition, the high likelihood of cyp a mediating ddis involving omeprazole was previously proposed based on omeprazole’s ability to induce cyp a activity [ , ], though still under debate [ , , ]. omeprazole is a weak inhibitor of cyp d relative to cyp c and cyp a [ ], making cyp d -mediated ddis less likely. an efflux transporter, abcb , may be an active player in these ddis, as omeprazole interferes with the efflux of other drugs (e.g., digoxin [ ] and nifedipine [ ]) by abcb [ ]. out of these enzymes and transporters, cyp c was chosen as a key enzyme for subsequent comparative analyses. for this examination, Δ prr (= an average prr of .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / omeprazole alone and clopidogrel alone − a prr for the co-administration of omeprazole and clopidogrel) was calculated as a ddi index for each adr, as described above. Δ prr values were then compared with the prr changes predicted by the svm model when cyp c was removed from the tet profile of omeprazole. overall, the predictions and the calculations were in good agreement for adrs significantly associated with ddis (either negatively or positively, as judged by Δ prr relative to the % interval of Δ prrs for the “no change” group) between omeprazole and clopidogrel (fig. c), supporting the role of cyp c in their ddis, as described elsewhere [ ]. the comparative result also indicates that correct predictions of drug pairs with little to no ddis (that is, Δ prr ~ ) are difficult with this model. ( ) case study : atorvastatin to further validate the svm model, a similar computational approach was applied to atorvastatin for its well-known ddi-associated adr, myopathy [ - ]. atorvastatin has five major targets (ahr, dipeptidyl peptidase (dpp ), histone deacetylase (hdac ), -hydroxy- -methylglutaryl-coenzyme a reductase (hmgcr), and nuclear receptor subfamily group i member (nr i )), ten metabolizing enzymes (cyp b , cyp c , cyp c , cyp c , cyp d , cyp a , cyp a , cyp a , ugt a , and udp-glucuronosyltransferase a (ugt a )), and ten transporters (abcb , atp-binding cassette sub-family b member (abcb ), atp-binding cassette sub-family c member (abcc) (abcc ), abcc , abcc , abcc , slco a , slco b , slco b , and slco b ; fig. a). atorvastatin’s actions on multiple tets indicate its pharmacological complexity. the svm model was used to derive tets important in atorvastatin-induced myopathy through predicted prr changes of drug pairs upon removal of each of the targets, enzymes, and transporters from atorvastatin’s tet profile. the svm model predicted the importance of cyp c , cyp c , cyp a , ugt a , abcb , abcb , abcc , abcc , abcc , slco a , and slco b in atorvastatin ddi-associated myopathy (fig. b). consistent with this result, the co-administration of drugs that are either inhibitors or substrates of cyp a was found to decrease the metabolism of atorvastatin [ ]. as a result, the plasma concentration of atorvastatin increases, leading to the onset of adrs [ ], including myopathy [ ]. in addition, polymorphisms in the cyp c , ugt a , abcb and slco b genes are associated with systemic exposure of atorvastatin, an important risk factor for myopathy [ , ]. drugs .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / inhibiting slco b and abcb , most of which are cyp a inhibitors [ ], can cause ddis with atorvastatin [ ]. while the crucial role of cyp c in ddis involving other statins (e.g., simvastatin and lovastatin) has been documented [ , ], the involvement of this enzyme in atorvastatin-mediated ddis remains unclear. the model was then tested through comparisons between the predicted and calculated prr changes of drug combinations for myopathy. out of the identified key molecules, cyp a was chosen for further analyses due to its direct involvement in the onset of myopathy associated with ddis involving atorvastatin, as reported previously [ , ]. prrs of drugs with higher degrees of ddis (as judged by Δ prr relative to the % confidence interval of Δ prrs for the “no change” group) with atorvastatin were predicted to decrease upon removal of atorvastatin’s cyp a interaction (fig. c), consistent with the literature reports on the importance of this cyp enzyme in atorvastatin-mediated myopathy [ , ]. similar to the results with omeprazole (case study ), the accurate prediction of prr changes for drug combinations with Δ prr~ was difficult (fig. c). overall, the results obtained with levothyroxine, omeprazole and atorvastatin in this study demonstrate the high applicability of the machine learning model for predicting ddi- associated adrs and providing underlying pharmacological insight. model application for interstitial lung disease involving ddis with atorvastatin motivated by its high prediction power, the model developed in this study was applied to a rare yet life-threatening adr, interstitial lung disease (ild), associated with ddis involving atorvastatin. the prr of the single administration of atorvastatin for ild was calculated from offsides v . to analyze its statistical associations to ild. a similar statistical analysis was extended to drug pairs containing atorvastatin for ild. Δ prrs (i.e. ddi indices) were calculated and plotted with prrs of the co-administration for ild (fig. ). the prr of atorvastatin alone for ild was . , a value indicative of a statistically neutral association between atorvastatin and ild. between Δprrs and prrs for the co- administration of atorvastatin, a strong negative linear relationship was detected (fig. ). the implication is that drug pairs of atorvastatin and concomitant drugs reported with high risks of ild are due to ddis between the drugs. when calculated from the linear regression line, Δ prr .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / is . (~ , no ddi) for drug pairs showing no associations with ild (i.e., prr = ; fig. ), as expected, validating this analysis. to identify important tets in ild associated with ddis involving atorvastatin, the prr changes of drug pairs were predicted by the model upon removal of each of the targets, enzymes, and transporters from atorvastatin’s tet profile (fig. a). among atorvastatin’s five targets, ahr, dpp , hdac , hmgcr and nr i , the importance of dpp was minimal (fig. a). in this analysis, different metabolizing enzymes seemed equally important, suggesting that ddi- associated ild involving atorvastatin may be mediated by a set of multiple enzymes, which may be responsible for previous contradictory findings on the role of metabolizing enzymes in this type of adr [ , ]. among transporters, abcb , slco b and slco b , all of which are primarily expressed in liver [ ], were found to be important. abcb , a primary transporter of bile salts [ ], was found to be involved in the biliary excretion of statins [ ]. slco b and slco b are responsible for the uptake of atorvastatin into hepatocytes [ , ]. thus, the removal of these three transporters from atorvastatin’s tet profile may increase its plasma concentration, increasing risks of ild [ , ]. to further distinguish among the five targets, a similar procedure was conducted to calculate the number of drug pairs with predicted increases in prrs for ild when a target was removed from atorvastatin’s tet profile (fig. b). interestingly, when atorvastatin’s action on hmgcr and dpp became nullified, prrs for ild further increased (fig. b). the implication is that when its binding to hmgcr and dpp becomes ineffective, atorvastatin may bind to the other three targets more strongly, increasing risks of ddis. no such prr increases were observed with the removal of the other three targets (fig. b). thus, ahr, hdac , and nr i were identified as important targets for ddi-associated ild involving atorvastatin. to validate these computational results, two key molecules, nr i and abcb , which were identified by the svm model in ddi-associated ild with atorvastatin, were used for further analyses. for this examination, Δ prrs (the average prr for single administrations of atorvastatin and a concomitant drug – the prr for their co-administration) were calculated and compared with prr predictions by the svm model for the drug pairs upon the removal of nr i (fig. c) and abcb (fig. d) from the tet profile of atorvastatin. in these analyses, prrs for ild were predicted to decrease with most drug pairs involving significant ddis, when judged by Δ prr < the % confidence interval of Δ prrs for the “no change” group (fig. c- .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / d), supporting the critical roles of nr i and abcb in ddi-associated ild involving atorvastatin. a few potential pathological pathways underlying ddi-associated ild involving a high plasma concentration of atorvastatin were determined around the three important targets — ahr, nr i , and hdac — identified in this study. in this analysis, only genes/proteins significantly expressed in lung, as recorded in the ncbi gene database were [ ] considered. the analyses revealed the high likelihood that atorvastatin binding to ahr, nr i , and hdac may cause ild through a major ild mechanism — the dysregulation of surfactant production and homeostasis [ , ]. many interactors, such as sp transcription factor and estrogen receptor (esr ), are highly interconnected in the pathways around ahr, nr i , and hdac (fig. a-b). genes/proteins important in surfactant metabolism also create a strong network (fig. a-b). thus, any impact from ahr, nr i and hdac can be amplified, influencing one another in these pathways. literature survey identified a few plausible routes atorvastatin can take to cause ild. ahr is a transcription factor inducible by aromatic hydrocarbon-based xenobiotics, such as atorvastatin [ , ]. upon binding to a ligand, ahr is complexed with aryl hydrocarbon receptor nuclear translocator (arnt; fig. a) [ , ]. the ahr/arnt complex can then induce expression of ahr’s target genes, which code for enzymes and transporters required for xenobiotic metabolism [ , ]. activated ahr inhibits estrogen receptor (esr ) activity [ ], by redirecting esr away from esr target genes [ ], such as atp-binding cassette sub- family a member (abca ; fig. a) [ ]. abca plays a critical role in the formation of pulmonary surfactant by transporting phospholipids from the endoplasmic reticulum to a surfactant storage organelle in type ii epithelial cells [ , ]. thus, the binding of atorvastatin to ahr may cause pulmonary surfactant metabolism dysfunction by downregulating the abca gene via inhibition of esr activity [ ]. different interactors (e.g., histone acetyltransferase p (ep )) may be involved in ild, amplifying the effects of ahr through networks of other interactors and ild genes/proteins. in addition, nr i is a nuclear receptor that mediates transcriptional activation of target genes required for the metabolism and elimination of xenobiotics [ , , ], such as cyp b and cyp a [ , ]. upon the binding of xenobiotics, nr i is dephosphorylated for nuclear translocation and transactivation [ ], which requires reduced src kinase activity [ ]. on the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / other hand, p y purinoceptor (p ry ), a g protein-coupled receptor, activates src [ , ] in order to promote surfactant secretion from alveolar type ii cells [ ]. thus, the binding of atorvastatin to nr i [ ] can dysregulate normal surfactant secretion via interference with src kinase activity. atorvastatin inhibits hdac [ ]. the connections between hdac and ild genes/proteins are highly interconnected, also sharing many interactors with ahr and nr i , suggesting the existence of many different paths that cause ild from hdac inhibition (fig. b). interestingly, the binding of atorvastatin to hdac may be related to atorvastatin’s beneficial effect against cancer [ ]. hdac plays a key role in the epigenetic regulation of gene expression in cancer [ ] and hdac inhibitors (e.g., atorvastatin [ ]) can display anti- cancer activities [ ]. the anticancer effect of atorvastatin was also statistically analyzed. in this analysis, combinations of atorvastatin and a drug that have prrs > for ild were identified and their prrs for lung cancer were also calculated. none of combinations of atorvastatin and concomitant drugs that had prrs > for ild had prrs > for lung cancer, supporting atorvastatin’s anti-cancer effects. overall, the model-based computational examinations and pathway analyses revealed key molecules important in ddi-associated ild involving atorvastatin and proposed underlying pathological pathways. conclusions this study presented a novel computational approach to accurately predict the occurrences of adrs using a machine learning model consisting of hierarchically structured classifiers. the hierarchical model presented here addresses the limitations of conventional models relying on drug similarity for the prediction of adrs. the method developed here is based on tet profile-dependencies of adrs derived from drugs’ chemical structures, requiring no high chemical similarity of drugs. given basic structural characteristics of drugs, this hierarchical model integrating the rfcs for tet profile prediction and the svm for specific ddi-associated adrs can accurately predict adrs with an overall ~ % (= . % for tet prediction × . % for adr prediction) accuracy. as ddis typically appearing as various forms of adrs have been another primary issue in past predictions of ddi-associated adrs [ ], the presented model deconvolutes this complexity, as judged by its accurate prediction of various adrs for any drug pair. in addition, pharmacological insight offered by the hierarchical .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / model was successfully connected to pathway analyses underlying adrs, making the described computational approach powerful for not only predicting the occurrence of ddi-associated adrs but also enhancing mechanistic understandings. notably, the constructed model can accurately predict tet profiles and ddi-associated adrs from most basic information of drugs - chemical structures – for any pair. thus, the model presented in this study can also be used for drug design. for example, the maccs keys descriptors can be manipulated and inputted into the hierarchical model to identify a drug’s key structural characteristics that increase a risk of adrs. once the structural hot spots are identified, an array of drug variants with different chemical moieties at the locations can be designed and evaluated for ddis and adrs prior to synthesis. as a result, many drugs can readily be evaluated for their potential ddis in advance, avoiding costly preclinical and clinical tests. thus, the hierarchical model developed is anticipated to pave new way to enhance drug safety and reduce drug development costs. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references . kovacevic, m., vezmar kovacevic, s., radovanovic, s., stevanovic, p. & miljkovic, b. ( ). adverse drug reactions caused by drug-drug interactions in cardiovascular disease patients: introduction of a simple prediction tool using electronic screening database items. curr med res opin, ( ), - . . luo, j., eldredge, c., cho, c. c. & cisler, r. a. ( ). population analysis of adverse events in different age groups using big clinical trials data. jmir med inform, ( ), e . . tatonetti, n. p., ye, p. p., daneshjou, r. & altman, r. b. ( ). data-driven prediction of drug effects and interactions. sci transl med, ( ), ra . . nguyen, t., wong, e. & ciummo, f. ( ). polypharmacy in older adults: practical applications alongside a patient case. the journal for nurse practitioners, ( ), - . . kaneko, s. & nagashima, t. ( ). drug repositioning and target finding based on clinical evidence. biol pharm bull, ( ), - . . neve, e. p., artursson, p., ingelman-sundberg, m. & karlgren, m. ( ). an integrated in vitro model for simultaneous assessment of drug uptake, metabolism, and efflux. mol pharm, ( ), - . . benet, l. z., cummins, c. l. & wu, c. y. ( ). transporter-enzyme interactions: implications for predicting drug-drug interactions from in vitro data. curr drug metab, ( ), - . . kalliokoski, a. & niemi, m. ( ). impact of oatp transporters on pharmacokinetics. br j pharmacol, ( ), - . . poirier, a., funk, c., lave, t. & noe, j. ( ). new strategies to address drug-drug interactions involving oatps. curr opin drug discov devel, ( ), - . . jamal, s., goyal, s., shanker, a. & grover, a. ( ). predicting neurological adverse drug reactions based on biological, chemical and phenotypic properties of drugs using machine learning models. sci rep, ( ), . . sakaeda, t., tamon, a., kadoyama, k. & okuno, y. ( ). data mining of the public version of the fda adverse event reporting system. int j med sci, ( ), - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . brouwers, l., iskar, m., zeller, g., van noort, v. & bork, p. ( ). network neighbors of drug targets contribute to drug side-effect similarity. plos one, ( ), e . . munoz, e., novacek, v. & vandenbussche, p. y. ( ). using drug similarities for discovery of possible adverse reactions. amia annu symp proc, , - . . seo, s., lee, t., kim, m. h. & yoon, y. ( ). prediction of side effects using comprehensive similarity measures. biomed res int, , . . vilar, s., uriarte, e., santana, l., lorberbaum, t., hripcsak, g., friedman, c. & tatonetti, n. p. ( ). similarity-based modeling in large-scale prediction of drug-drug interactions. nat protoc, ( ), - . . noguchi, y., ueno, a., otsubo, m., katsuno, h., sugita, i., kanematsu, y., yoshida, a., esaki, h., tachi, t. & teramachi, h. ( ). a simple method for exploring adverse drug events in patients with different primary diseases using spontaneous reporting system. bmc bioinformatics, ( ), . . stancin, i. & jovic, a. ( ). an overview and comparison of free python libraries for data mining and big data analysis. nd international convention on information and communication technology, electronics and microelectronics (mipro). anal chem, ( ), - . . wishart, d. s., feunang, y. d., guo, a. c., lo, e. j., marcu, a., grant, j. r., sajed, t., johnson, d., li, c., sayeeda, z., assempour, n., iynkkaran, i., liu, y., maciejewski, a., gale, n., wilson, a., chin, l., cummings, r., le, d., pon, a., knox, c. & wilson, m. ( ). drugbank . : a major update to the drugbank database for . nucleic acids res, (d ), d -d . . kim, s., chen, j., cheng, t., gindulyte, a., he, j., he, s., li, q., shoemaker, b. a., thiessen, p. a., yu, b., zaslavsky, l., zhang, j. & bolton, e. e. ( ). pubchem update: improved access to chemical data. nucleic acids res, (d ), d -d . . du, h., cai, y., yang, h., zhang, h., xue, y., liu, g., tang, y. & li, w. ( ). in silico prediction of chemicals binding to aromatase with machine learning methods. chem res toxicol, ( ), - . . breiman, l. ( ). random forests. machine learning, , - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . dreiseitl, s. & ohno-machado, l. ( ). logistic regression and artificial neural network classification models: a methodology review. j biomed inform, ( - ), - . . cortes, c. & vapnik, v. ( ). support-vector networks. machine learning, , − . . khuri, n., zur, a. a., wittwer, m. b., lin, l., yee, s. w., sali, a. & giacomini, k. m. ( ). computational discovery and experimental validation of inhibitors of the human intestinal transporter oatp b . j chem inf model, ( ), - . . jassal, b., matthews, l., viteri, g., gong, c., lorente, p., fabregat, a., sidiropoulos, k., cook, j., gillespie, m., haw, r., loney, f., may, b., milacic, m., rothfels, k., sevilla, c., shamovsky, v., shorser, s., varusai, t., weiser, j., wu, g., stein, l., hermjakob, h. & d'eustachio, p. ( ). the reactome pathway knowledgebase. nucleic acids res, (d ), d -d . . oughtred, r., stark, c., breitkreutz, b. j., rust, j., boucher, l., chang, c., kolas, n., o'donnell, l., leung, g., mcadam, r., zhang, f., dolma, s., willems, a., coulombe- huntington, j., chatr-aryamontri, a., dolinski, k. & tyers, m. ( ). the biogrid interaction database: update. nucleic acids res, (d ), d -d . . schmidt, t., samaras, p., frejno, m., gessulat, s., barnert, m., kienegger, h., krcmar, h., schlegl, j., ehrlich, h. c., aiche, s., kuster, b. & wilhelm, m. ( ). proteomicsdb. nucleic acids res, (d ), d -d . . szklarczyk, d., gable, a. l., lyon, d., junge, a., wyder, s., huerta-cepas, j., simonovic, m., doncheva, n. t., morris, j. h., bork, p., jensen, l. j. & mering, c. v. ( ). string v : protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. nucleic acids res, (d ), d -d . . giurgiu, m., reinhard, j., brauner, b., dunger-kaltenbach, i., fobo, g., frishman, g., montrone, c. & ruepp, a. ( ). corum: the comprehensive resource of mammalian protein complexes- . nucleic acids res, (d ), d -d . . brown, g. r., hem, v., katz, k. s., ovetsky, m., wallin, c., ermolaeva, o., tolstoy, i., tatusova, t., pruitt, k. d., maglott, d. r. & murphy, t. d. ( ). gene: a gene- centered information resource at ncbi. nucleic acids res, (database issue), d - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . yu, w. & mackerell, a. d., jr. ( ). computer-aided drug design methods. methods mol biol, , - . . meibohm, b. & derendorf, h. ( ). pharmacokinetic/pharmacodynamic studies in drug product development. j pharm sci, ( ), - . . sudsakorn, s., bahadduri, p., fretland, j. & lu, c. ( ). fda drug-drug interaction guidance: a comparison analysis and action plan by pharmaceutical industrial scientists. curr drug metab, ( ), - . . el-hachem, n., haibe-kains, b., khalil, a., kobeissy, f. h. & nemer, g. ( ). autodock and autodocktools for protein-ligand docking: beta-site amyloid precursor protein cleaving enzyme (bace ) as a case study. methods mol biol, , - . . jonklaas, j., bianco, a. c., bauer, a. j., burman, k. d., cappola, a. r., celi, f. s., cooper, d. s., kim, b. w., peeters, r. p., rosenthal, m. s., sawka, a. m. & american thyroid association task force on thyroid hormone, r. ( ). guidelines for the treatment of hypothyroidism: prepared by the american thyroid association task force on thyroid hormone replacement. thyroid, ( ), - . . shi, s. & klotz, u. ( ). proton pump inhibitors: an update of their clinical use and pharmacokinetics. eur j clin pharmacol, ( ), - . . mohammadkhani, n., gharbi, s., rajani, h. f., farzaneh, a., mahjoob, g., hoseinsalari, a. & korsching, e. ( ). statins: complex outcomes but increasingly helpful treatment options for patients. eur j pharmacol, , . . amoroso, g., van boven, a. j., van veldhuisen, d. j., tio, r. a., balje-volkers, c. p., petronio, a. s. & van oeveren, w. ( ). eptifibatide and abciximab exhibit equivalent antiplatelet efficacy in an experimental model of stenting in both healthy volunteers and patients with coronary artery disease. j cardiovasc pharmacol, ( ), - . . coutinho, j., field, j. b. & sule, a. a. ( ). armour(r) thyroid rage - a dangerous mixture. cureus, ( ), e . . schror, k. & weber, a. a. ( ). comparative pharmacology of gp iib/iiia antagonists. j thromb thrombolysis, ( ), - . . davis, p. j., leonard, j. l. & davis, f. b. ( ). mechanisms of nongenomic actions of thyroid hormone. front neuroendocrinol, ( ), - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . hammes, s. r. & davis, p. j. ( ). overlapping nongenomic and genomic actions of thyroid hormone and steroids. best pract res clin endocrinol metab, ( ), - . . phillips, d. r., charo, i. f. & scarborough, r. m. ( ). gpiib-iiia: the responsive integrin. cell, ( ), - . . torres, n. b. & altafini, c. ( ). drug combinatorics and side effect estimation on the signed human drug-target network. bmc syst biol, ( ), . . lee, c. h., franchi, f. & angiolillo, d. j. ( ). clopidogrel drug interactions: a review of the evidence and clinical implications. expert opin drug metab toxicol, ( ), - . . ho, p. m., maddox, t. m., wang, l., fihn, s. d., jesse, r. l., peterson, e. d. & rumsfeld, j. s. ( ). risk of adverse outcomes associated with concomitant use of clopidogrel and proton pump inhibitors following acute coronary syndrome. jama, ( ), - . . ogawa, r. & echizen, h. ( ). drug-drug interaction profiles of proton pump inhibitors. clin pharmacokinet, ( ), - . . evanchan, j., donnally, m. r., binkley, p. & mazzaferri, e. ( ). recurrence of acute myocardial infarction in patients discharged on clopidogrel and a proton pump inhibitor after stent placement for acute myocardial infarction. clin cardiol, ( ), - . . stockl, k. m., le, l., zakharyan, a., harada, a. s., solow, b. k., addiego, j. e. & ramsey, s. ( ). risk of rehospitalization for patients using clopidogrel with a proton pump inhibitor. arch intern med, ( ), - . . gaglia, m. a., jr., torguson, r., hanna, n., gonzalez, m. a., collins, s. d., syed, a. i., ben-dor, i., maluenda, g., delhaye, c., wakabayashi, k., xue, z., suddath, w. o., kent, k. m., satler, l. f., pichard, a. d. & waksman, r. ( ). relation of proton pump inhibitor use after percutaneous coronary intervention with drug-eluting stents to outcomes. am j cardiol, ( ), - . . savi, p., pereillo, j. m., uzabiaga, m. f., combalbert, j., picard, c., maffrand, j. p., pascal, m. & herbert, j. m. ( ). identification and biological activity of the active metabolite of clopidogrel. thromb haemost, ( ), - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . umemura, k., furuta, t. & kondo, k. ( ). the common gene variants of cyp c affect pharmacokinetics and pharmacodynamics in an active metabolite of clopidogrel in healthy subjects. j thromb haemost, ( ), - . . herbert, j. m. & savi, p. ( ). p y , a new platelet adp receptor, target of clopidogrel. semin vasc med, ( ), - . . li, x. q., andersson, t. b., ahlstrom, m. & weidolf, l. ( ). comparison of inhibitory effects of the proton pump-inhibiting drugs omeprazole, esomeprazole, lansoprazole, pantoprazole, and rabeprazole on human cytochrome p activities. drug metab dispos, ( ), - . . clarke, t. a. & waskell, l. a. ( ). the metabolism of clopidogrel is catalyzed by human cytochrome p a and is inhibited by atorvastatin. drug metab dispos, ( ), - . . farid, n. a., payne, c. d., small, d. s., winters, k. j., ernest, c. s., nd, brandt, j. t., darstein, c., jakubowski, j. a. & salazar, d. e. ( ). cytochrome p a inhibition by ketoconazole affects prasugrel and clopidogrel pharmacokinetics and pharmacodynamics differently. clin pharmacol ther, ( ), - . . diaz, d., fabre, i., daujat, m., saint aubert, b., bories, p., michel, h. & maurel, p. ( ). omeprazole is an aryl hydrocarbon-like inducer of human hepatic cytochrome p . gastroenterology, ( ), - . . rost, k. l., brosicke, h., brockmoller, j., scheffler, m., helge, h. & roots, i. ( ). increase of cytochrome p ia activity by omeprazole: evidence by the c-[n- - methyl]-caffeine breath test in poor and extensive metabolizers of s-mephenytoin. clin pharmacol ther, ( ), - . . rizzo, n., padoin, c., palombo, s., scherrmann, j. m. & girre, c. ( ). omeprazole and lansoprazole are not inducers of cytochrome p a under conventional therapeutic conditions. eur j clin pharmacol, ( ), - . . xiaodong, s., gatti, g., bartoli, a., cipolla, g., crema, f. & perucca, e. ( ). omeprazole does not enhance the metabolism of phenacetin, a marker of cyp a activity, in healthy volunteers. ther drug monit, ( ), - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . oosterhuis, b., jonkman, j. h., andersson, t., zuiderwijk, p. b. & jedema, j. n. ( ). minor effect of multiple dose omeprazole on the pharmacokinetics of digoxin after a single oral dose. br j clin pharmacol, ( ), - . . soons, p. a., van den berg, g., danhof, m., van brummelen, p., jansen, j. b., lamers, c. b. & breimer, d. d. ( ). influence of single- and multiple-dose omeprazole treatment on nifedipine pharmacokinetics and effects in healthy subjects. eur j clin pharmacol, ( ), - . . bataillard, m., beyens, m. n., mounier, g., vergnon-miszczycha, d., bagheri, h. & cathebras, p. ( ). muscle damage due to fusidic acid-statin interaction: review of cases from the french pharmacovigilance database and literature reports. am j ther, ( ), e -e . . boonmuang, p., nathisuwan, s., chaiyakunapruk, n., suwankesawong, w., pokhagul, p., teerawattanapong, n. & supsongserm, p. ( ). characterization of statin- associated myopathy case reports in thailand using the health product vigilance center database. drug saf, ( ), - . . brahmachari, b. & chatterjee, s. ( ). myopathy induced by statin-ezetimibe combination: evaluation of potential risk factors. indian j pharmacol, ( ), - . . du souich, p., roederer, g. & dufour, r. ( ). myotoxicity of statins: mechanism of action. pharmacol ther, , - . . hirota, t. & ieiri, i. ( ). drug-drug interactions that interfere with statin metabolism. expert opin drug metab toxicol, ( ), - . . marusic, s., lisicic, a., horvatic, i., bacic-vrca, v. & bozina, n. ( ). atorvastatin- related rhabdomyolysis and acute renal failure in a genetically predisposed patient with potential drug-drug interaction. int j clin pharm, ( ), - . . stormo, c., bogsrud, m. p., hermann, m., asberg, a., piehler, a. p., retterstol, k. & kringen, m. k. ( ). ugt a * is associated with decreased systemic exposure of atorvastatin lactone. mol diagn ther, ( ), - . . neuvonen, p. j., niemi, m. & backman, j. t. ( ). drug interactions with lipid- lowering drugs: mechanisms and clinical relevance. clin pharmacol ther, ( ), - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . ogilvie, b. w., zhang, d., li, w., rodrigues, a. d., gipson, a. e., holsapple, j., toren, p. & parkinson, a. ( ). glucuronidation converts gemfibrozil to a potent, metabolism-dependent inhibitor of cyp c : implications for drug-drug interactions. drug metab dispos, ( ), - . . canestaro, w. j., austin, m. a. & thummel, k. e. ( ). genetic factors affecting statin concentrations and subsequent myopathy: a hugenet systematic review. genet med, ( ), - . . gluba-brzozka, a., franczyk, b., toth, p. p., rysz, j. & banach, m. ( ). molecular mechanisms of statin intolerance. arch med sci, ( ), - . . fernandez, a. b., karas, r. h., alsheikh-ali, a. a. & thompson, p. d. ( ). statins and interstitial lung disease: a systematic review of the literature and of food and drug administration adverse event reports. chest, ( ), - . . zanger, u. m. & schwab, m. ( ). cytochrome p enzymes in drug metabolism: regulation of gene expression, enzyme activities, and impact of genetic variation. pharmacol ther, ( ), - . . pedersen, j. m., matsson, p., bergstrom, c. a., hoogstraate, j., noren, a., lecluyse, e. l. & artursson, p. ( ). early identification of clinically relevant drug interactions with the human bile salt export pump (bsep/abcb ). toxicol sci, ( ), - . . hirano, m., maeda, k., hayashi, h., kusuhara, h. & sugiyama, y. ( ). bile salt export pump (bsep/abcb ) can transport a nonbile acid substrate, pravastatin. j pharmacol exp ther, ( ), - . . hirota, t., fujita, y. & ieiri, i. ( ). an updated review of pharmacokinetic drug interactions and pharmacogenetics of statins. expert opin drug metab toxicol, ( ), - . . zhang, l., lv, h., zhang, q., wang, d., kang, x., zhang, g. & li, x. ( ). association of slco b and abcb genetic variants with atorvastatin-induced myopathy in patients with acute ischemic stroke. curr pharm des, ( ), - . . akella, a. & deshpande, s. b. ( ). pulmonary surfactants and their role in pathophysiology of lung disorders. indian j exp biol, ( ), - . . whitsett, j. a., wert, s. e. & weaver, t. e. ( ). alveolar surfactant homeostasis and the pathogenesis of pulmonary disease. annu rev med, , - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . larigot, l., juricek, l., dairou, j. & coumoul, x. ( ). ahr signaling pathways and regulatory functions. biochim open, , - . . matthews, j. & gustafsson, j. a. ( ). estrogen receptor and aryl hydrocarbon receptor signaling pathways. nucl recept signal, , e . . hankinson, o. ( ). the aryl hydrocarbon receptor complex. annu rev pharmacol toxicol, , - . . matthews, j., wihlen, b., thomsen, j. & gustafsson, j. a. ( ). aryl hydrocarbon receptor-mediated transcription: ligand-dependent recruitment of estrogen receptor alpha to , , , -tetrachlorodibenzo-p-dioxin-responsive promoters. mol cell biol, ( ), - . . lin, c. y., strom, a., vega, v. b., kong, s. l., yeo, a. l., thomsen, j. s., chan, w. c., doray, b., bangarusamy, d. k., ramasamy, a., vergara, l. a., tang, s., chong, a., bajic, v. b., miller, l. d., gustafsson, j. a. & liu, e. t. ( ). discovery of estrogen receptor alpha target genes and response elements in breast tumor cells. genome biol, ( ), r . . klugbauer, n. & hofmann, f. ( ). primary structure of a novel abc transporter with a chromosomal localization on the band encoding the multidrug resistance-associated protein. febs lett, ( - ), - . . yamano, g., funahashi, h., kawanami, o., zhao, l. x., ban, n., uchida, y., morohoshi, t., ogawa, j., shioda, s. & inagaki, n. ( ). abca is a lamellar body membrane protein in human lung alveolar type ii cells. febs lett, ( ), - . . shulenin, s., nogee, l. m., annilo, t., wert, s. e., whitsett, j. a. & dean, m. ( ). abca gene mutations in newborns with fatal surfactant deficiency. n engl j med, ( ), - . . auerbach, s. s., dekeyser, j. g., stoner, m. a. & omiecinski, c. j. ( ). car displays unique ligand binding and rxralpha heterodimerization characteristics. drug metab dispos, ( ), - . . qatanani, m. & moore, d. d. ( ). car, the continuously advancing receptor, in drug metabolism and disease. curr drug metab, ( ), - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . goodwin, b., hodgson, e., d'costa, d. j., robertson, g. r. & liddle, c. ( ). transcriptional regulation of the human cyp a gene by the constitutive androstane receptor. mol pharmacol, ( ), - . . mutoh, s., osabe, m., inoue, k., moore, r., pedersen, l., perera, l., rebolloso, y., sueyoshi, t. & negishi, m. ( ). dephosphorylation of threonine is required for nuclear translocation and activation of human xenobiotic receptor car (nr i ). j biol chem, ( ), - . . groll, n., petrikat, t., vetter, s., wenz, c., dengjel, j., gretzmeier, c., weiss, f., poetz, o., joos, t. o., schwarz, m. & braeuning, a. ( ). inhibition of beta-catenin signaling by phenobarbital in hepatoma cells in vitro. toxicology, , - . . liu, j., liao, z., camden, j., griffin, k. d., garrad, r. c., santiago-perez, l. i., gonzalez, f. a., seye, c. i., weisman, g. a. & erb, l. ( ). src homology binding sites in the p y nucleotide receptor interact with src and regulate activities of src, proline-rich tyrosine kinase , and growth factor receptors. j biol chem, ( ), - . . woods, l. t., jasmer, k. j., munoz forti, k., shanbhag, v. c., camden, j. m., erb, l., petris, m. j. & weisman, g. a. ( ). p y receptors mediate nucleotide-induced egfr phosphorylation and stimulate proliferation and tumorigenesis of head and neck squamous cell carcinoma cell lines. oral oncol, , . . rice, w. r. & singleton, f. m. ( ). p -purinoceptors regulate surfactant secretion from rat isolated alveolar type ii cells. br j pharmacol, ( ), - . . rezen, t., hafner, m., kortagere, s., ekins, s., hodnik, v. & rozman, d. ( ). rosuvastatin and atorvastatin are ligands of the human constitutive androstane receptor/retinoid x receptor alpha complex. drug metab dispos, ( ), - . . lin, y. c., lin, j. h., chou, c. w., chang, y. f., yeh, s. h. & chen, c. c. ( ). statins increase p through inhibition of histone deacetylase activity and release of promoter-associated hdac / . cancer res, ( ), - . . archibugi, l., arcidiacono, p. g. & capurso, g. ( ). statin use is associated to a reduced risk of pancreatic cancer: a meta-analysis. dig liver dis, ( ), - . . bolden, j. e., peart, m. j. & johnstone, r. w. ( ). anticancer activities of histone deacetylase inhibitors. nat rev drug discov, ( ), - . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / . noguchi, y., tachi, t. & teramachi, h. ( ). subset analysis for screening drug- drug interaction signal using pharmacovigilance database. pharmaceutics, ( ), .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / training accuracy testing accuracy target . % . % enzyme . % . % transporter . % . % table . average training and testing accuracies for target, enzyme, and transporter prediction by the rfc models. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure captions fig. . hierarchical classification model overview for the prediction of ddi-associated adrs from drugs’ chemical structures through predictions of (a) tet profiles from chemical fingerprints and (b) adrs from tet matrices of drug pairs. (a) drugs’ chemical structures were represented with maccs keys and used as features to predict tet profiles of drugs using a random forest classifier (rfc). (b) tet profiles of a drug pair were combined into a tet matrix, which was then used as a feature to predict encoded prrs for all adrs in rfc, logistic regression (lr), and support vector machine (svm) models. fig. . repeated -fold cross-validation for (a) random forest classifier (rfc), (b) logistic regression (lr), and (c) support vector machine (svm) models. fig. . adverse drug reactions associated with drug-drug interactions (ddis) between levothyroxine and eptifibatide. (a) target, enzyme, and transporter (tet) profiles of levothyroxine and eptifibatide. y: the presence of a drug’s action on tets. n: the absence of a drug’s action on tets. (b) comparisons between Δprrs (as ddi indices) and prr prediction upon removal of itgb from eptifibatide’s tet profile for adrs associated with the co-administration of levothyroxine and eptifibatide. for a given adr, the average prr of single administrations of levothyroxine and eptifibatide – the prr of their co- administration was calculated, and the prr change for co-administration of levothyroxine and eptifibatide was predicted upon alteration of tet profiles of eptifibatide for integrin β- (itgb ) from y to n. y: inclusion of itgb in eptifibatide’s tet profile. n: removal of itgb from eptifibatide’s tet profile. **: outside of the % confidence interval of the “no change” group. fig. . adverse drug reactions associated with drug-drug interactions (ddis) between omeprazole and clopidogrel. (a) target, enzyme, and transporter (tet) profiles of omeprazole and clopidogrel. y: the presence of a drug’s action on tets. n: the absence of a drug’s action on tets. (b) the impacts of omeprazole’s tet profile on its ddi-associated adrs with clopidogrel. the prr changes for adrs associated with co-administration of omeprazole and clopidogrel were calculated using the svm model when each of omeprazole’s .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / tets was removed. (c) comparisons between Δprrs (as ddi indices) and prr predictions upon removal of cyp c from omeprazole’s tet profile for adrs associated with the co-administration of omeprazole and clopidogrel. for a given adr, the average prr of single administrations of omeprazole and clopidogrel – the prr of their co-administration was calculated, and the prr change for co-administration of omeprazole and clopidogrel was predicted using the svm model upon alteration of the tet profile of omeprazole for cyp c from y to n. y: inclusion of cyp c in omeprazole’s tet profile. n: removal of cyp c from omeprazole’s tet profile. **: outside of the % confidence interval of the “no change” group. fig. . myopathy associated with drug-drug interactions (ddis) involving atorvastatin. (a) target, enzyme, and transporter (tet) profiles of atorvastatin and concomitant drugs, such as ramipril and warfarin. y: the presence of a drug’s action on tets. n: the absence of a drug’s action on tets. (b) the impacts of atorvastatin’s tet profile on its ddi-associated adr of myopathy with various concomitant drugs. the prr changes for myopathy associated with co-administration of atorvastatin and other drugs were calculated using the svm model when each of atorvastatin’s tets was removed. (c) comparisons between Δprrs (as ddi indices) and prr predictions upon removal of cyp a from atorvastatin’s tet profile for myopathy associated with the co-administration of atorvastatin and a concomitant drug. for myopathy, the average prrs of single administrations of atorvastatin and a concomitant drug – the prrs of their co-administration were calculated, and prr changes for the co-administration of atorvastatin and the drug were predicted using the svm model upon alteration of the tet profile of atorvastatin for cytochrome p a (cyp a ) from y to n. y: inclusion of cyp a in atorvastatin’s tet profile. n: removal of cyp a from atorvastatin’s tet profile. **: outside of the % confidence interval of the “no change” group. fig. . drug-drug interactions between atorvastatin and concomitant drugs for interstitial lung disease (ild). for ild, Δ prr (= the average prr of single administrations of atorvastatin and the concomitant drug – the prr of their co-administration) was calculated, and plotted with the prr for their co-administration. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. . interstitial lung disease (ild) associated with drug-drug interactions (ddis) involving atorvastatin. (a) the impacts of atorvastatin’s tet profile on its ddi-associated ild with various concomitant drugs. the prr changes for ild associated with the co- administration of atorvastatin and other drugs were calculated using the svm model when each of atorvastatin’s tets was removed. (b) the number of drug pairs containing atorvastatin with a predicted increase in prrs for ilds when each of atorvastatin’s targets was removed. (c) comparisons between Δprrs (as ddi indices) and prr predictions upon removal of (a) nr i and (b) abcb from atorvastatin’s tet profile for ild associated with the co-administration of atorvastatin and a concomitant drug. for ild, the average prr of single administrations of atorvastatin and a concomitant drug– the prr of their co- administration was calculated, and the prr changes for the co-administration of atorvastatin and the drug were predicted using the svm model upon alteration of the tet profile of atorvastatin for (a) nr i and (b) abcb from y to n. y: inclusion of (a) nr i and (b) abcb in atorvastatin’s tet profile. n: removal of (a) nr i and (b) abcb from atorvastatin’s tet profile. **: outside of the % confidence interval of the “no change” group. fig. . pathway analyses for the enhanced risk of ild, associated with ddis involving atorvastatin created around (a) ahr and nr i and (b) hdac . interactions among genes/proteins were determined using an array of bioinformatics databases, including biogrid , proteomics db, string, corum and reactome. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / fig. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / keywords adverse drug reactions; drug-drug interaction; drug safety; hierarchical classification; machine learning; prediction, metabolizing enzyme, target; transporter. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / via: generalized and scalable trajectory inference in single-cell omics data / via: generalized and scalable trajectory inference in single-cell omics data shobana v. stassen , gwinky g. k. yip , kenneth k. y. wong , , joshua w. k. ho , and kevin k. tsia , department of electrical & electronic engineering, the university of hong kong, pokfulam road, hong kong school of biomedical sciences, li ka shing faculty of medicine, the university of hong kong, pokfulam, hong kong advanced biomedical instrumentation centre, hong kong science park, shatin, new territories, hong kong laboratory of data discovery for health, hong kong science park, shatin, new territories, hong kong abstract inferring cellular trajectories using a variety of omic data is a critical task in single-cell data science. however, prediction and thus biologically meaningful discovery of cell fates are challenged by the sheer size of single-cell data, diverse omic data types, and their complex data topologies. we present via, a scalable trajectory inference algorithm that uses lazy-teleporting random walks to accurately reconstruct complex cellular trajectories beyond tree-like pathways (e.g. cyclic or disconnected structures), and to discover less populous lineages or those otherwise obscured in other methods. via outperforms existing algorithms in recapitulating cell fates/lineages, and also mitigates loss of global connectivity information in large datasets beyond a million cells. furthermore, via demonstrates versatility by distilling cellular trajectories in single-cell transcriptomic, epigenomic, proteomic and morphological data – showing new promise in scalable, multifaceted single-cell analysis to explore novel biological processes. introduction single-cell omics data captures snapshots of cells that catalog cell types and molecular states with high precision. these high-content single-cell readouts can be harnessed to model evolving cellular heterogeneity and track dynamical changes of cell fates in tissue, tumour, and cell population. however, current computational methods face four critical challenges. first, it remains difficult to accurately reconstruct high-resolution cell trajectories and detect cell fates embedded within them. even the few algorithms which automate cell fate detection (e.g., slingshot and palantir ) exhibit low sensitivity and are highly susceptible to changes in input parameters. second, current trajectory inference (ti) methods predominantly work well on tree-like trajectories (e.g. slingshot, monocle ), but lack the generalisability to infer disconnected, cyclic or hybrid topologies without imposing restrictions on transitions and causality . third, the growing scale of single-cell data, notably cell atlases of whole organisms , , embryos , and human organs , exceeds the existing ti capacity, not just in runtime and memory, but in preserving global connectivity, which is often lost after extensive dimension reduction or subsampling. fourth, fueling the advance in single-cell technologies is the ongoing pursuit to understand cellular heterogeneity from a broader perspective beyond transcriptomics. however, the applicability of ti to a broader spectrum of single-cell data has yet to be fully exploited. to overcome these recurring challenges, we present via, a graph-based ti algorithm that uses a new strategy to compute pseudotime, and reconstruct cell lineages based on lazy-teleporting random walks integrated with markov chain monte carlo (mcmc) refinement. via relaxes common constraints on traversing the graph by allowing cyclic and temporally reversed movements, and thus robustly detects cell fates involving complex transitions that are otherwise obscured in other methods. via outperforms popular ti algorithms in terms of capturing cellular trajectories not limited to multi-furcations and trees, but also disconnected and cyclic topologies ( supplementary fig. s ) . compared to existing ti methods, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=s aw ujvuwn https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ruyzz p https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=i cha y s https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=pwzzqt v https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= glf ij qkb https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=mtxpwbv o fn https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=gsc l p cfhh https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / via is highly scalable with respect to number of cells ( to > cells) and features, without requiring extensive dimensionality reduction or subsampling which compromise global information. we demonstrate via’s accuracy, scalability, topological-generalizability and multi-omic versatility across multiple modalities by investigating simulated and experimental datasets ( supplementary table s ), ranging from single-cell rna-sequencing (scrna-seq), single-cell sequencing assay for transposase-accessible chromatin (scatac-seq), multi-omics integration, to mass and imaging cytometry. figure . general workflow of via algorithm. step : single-cell level graph is clustered such that each node represents a cluster of single cells (computed by our clustering algorithm parc ). the resulting cluster graph forms the basis for subsequent random walks. step : -stage pseudotime computation: (i) the pseudotime (relative to a user defined start cell) is first computed by the expected hitting time for a lazy-teleporting random walk along an undirected graph. at each step, the walk (with small probability) can remain (orange arrows) or teleport (red arrows) to any other state. (ii) edges are then forward biased based on the expected hitting time (see forward biased edges illustrated as the imbalance of double-arrowhead size). the pseudotime is further refined on the directed graph by running markov chain monte carlo (mcmc) simulations (see highlighted paths starting at root). step : consensus vote on terminal states based on vertex connectivity properties of the directed graph. step : lineage likelihoods computed as the visitation frequency under lazy-teleporting mcmc simulations. step : visualization that combines network topology and single-cell level pseudotime/lineage probability properties onto an embedding using gams, as well as unsupervised downstream analysis (e.g. gene expression trend along pseudotime for each lineage). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / results algorithm via first represents the single-cell data as a cluster graph (i.e. each node is a cluster of single cells), computed by our recently developed data-driven community-detection algorithm, parc, which allows scalable clustering whilst preserving global properties of the topology needed for accurate ti ( step in fig. ) . the cell fates and their lineage pathways are then computed by a two-stage probabilistic method, which is the key algorithmic contribution of this work ( step in fig. , see methods ). in the first stage, via models the cellular process as a modified random walk that allows degrees of laziness (remaining at a node/state) and teleportation (jumping to any other node/state) with pre-defined probabilities. the pseudotime, and thus the graph directionality, can be computed based on the theoretical hitting times of nodes (see the theory and derivation in methods and supplementary note ). the lazy-teleporting behavior prevents the expected hitting time from converging to a local distribution in the graph as otherwise occurs in regular random walks, especially when the sample size grows . more specifically, the laziness and teleportation factors regulate the weights given to each eigenvector-value pair in the expected hitting time formulation such that the stationary distribution (given by the local-node degree-properties in regular walks) does not overwhelm the global information provided by other ‘eigen-pairs’. moreover, the computation does not require subsetting the first k eigenvectors (bypassing the need for the user to select a suitable threshold or subset of eigenvectors) since the dimensionality is not on the order of number of cells, but equal to the number of clusters. hence all eigenvalue-eigenvector pairs can be incorporated without causing a bottleneck in runtime. consequently in via, the modified walk on a cluster-graph not only enables scalable pseudotime computation for large datasets in terms of runtime, but also preserves information about the global neighborhood relationships within the graph. in the second stage of step , via infers the directionality of the graph by biasing the edge-weights with the initial pseudotime computations, and refines the pseudotime through mcmc simulations. next (step in fig . ), the mcmc-refined graph-edges of the lazy-teleporting random walk enable accurate predictions of terminal cell fates through a consensus vote of various vertex connectivity properties derived from the directed graph. the cell fate predictions obtained using this approach are more robust to changes in input data and parameters compared to other ti methods ( supplementary fig. s and fig. s ) . trajectories towards identified terminal states are resolved using lazy-teleporting mcmc simulations ( step in fig. ). the probabilistic approach and relaxation of edge constraints allowed by via in computing differentiation pathways and pseudotime enables greater sensitivity to cell fates and complex trajectories, and makes allowances for asynchrony in differentiation processes by avoiding prematurely imposing constraints on node-to-node mobility. other methods resort to constraints such as reducing the graph to a tree, imposing unidirectionality by thresholding edges based on pseudotime directionality, removing outgoing edges from terminal states , and computing shortest paths for pseudotime , . via’s probabilistic approach to graph-traversal allows it to infer cell fates when the underlying data spans combinations of multifurcating trees and cyclic/disconnected topologies - topologies and hence lineages often obscured in existing ti methods ( supplementary fig. s ). together, these four steps facilitate holistic topological visualization of ti on the single-cell level (e.g. using umap or phate embeddings , ) and other data-driven downstream analyses such as recovering gene expression trends ( methods ). ( step in fig. ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=e uxoipvwota https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=onqq xmlfrf https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=jbvdrwuod wd https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ruyzz p https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ruyzz p https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=igxr liha sr https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / via accurately infers trajectories in diverse scrna-seq data via recapitulates differentiation topologies and identifies elusive cell fates across a wide range of transcriptomic data. we first showcase the ability of via to explore large single-cell transcriptomic datasets by employing the . -million-cell mouse organogenesis cell atlas (moca) . while this dataset is inaccessible to most ti methods from a runtime and memory perspective, via can efficiently resolve the underlying developmental heterogeneity, including major trajectories ( fig. a,b ) with a runtime of ~ min, compared to the next fastest method which has a runtime of at least hours ( supplementary table s ). via preserves wider neighborhood information and reveals a globally connected topology of moca which is otherwise lost in the previous method. broadly speaking, the overall cluster graph of via consists of three main branches that concur with the known developmental process at early organogenesis. ( fig. a) . it starts from the root stem which has a high concentration of e . early epithelial cells made of multiple sub-trajectories (e.g. epidermis, nose and foregut/hindgut epithelial cells derived from the ectoderm and endoderm). the stem is connected to two distinct lineages: ) mesenchymal cells originated from the mesoderm which arises from interactions between the ectoderm and endoderm and ) neural tube/crest cells derived from neurulation when the ectoderm folds inwards . the sparsity of early cells (only ~ % are e . ) and the absence of earlier ancestral cells make it particularly challenging to capture the simultaneous development of trajectories. however, the overall pseudotime structure presented by via is reasonable. for instance, at the junction of the epithelial-mesenchymal branch, we find early mesenchymal cells from e . -e . . cells from later mesenchymal developmental stages (e.g. myocytes from e . - e . ) reside at the leaves of branches. similarly, at the junction of epithelial-neural tube, we find dorsal tube neural cells and notochord plate cells which are predominantly from e . -e . and more developed neural cells at the tips (e.g. excitatory and inhibitory neurons from e . -e . ). via also places the other dispersed groups of trajectories (e.g. endothelial, hematopoietic) in biologically relevant neighborhoods ( supplementary notes , supplementary fig. s ). while via’s connected topology offers a coarse-grained holistic view, it does not compromise the ability to delineate individual lineage pathways (consistent with those found by cao et al., ) as shown in fig. c and supplementary fig. s . ti using via uniquely preserves both the global and local structures of the data and is thus particularly favorable for biological exploration involving large datasets, especially for comparative studies involving cell atlases . whilst manifold-learning methods are often used to extensively reduce dimensionality to mitigate the computational burden of large single-cell datasets, they tend to incur loss of global information and be sensitive to input parameters. via is sufficiently scalable to bypass such a step, and therefore retains a higher degree of neighborhood information when mapping large datasets. this is in contrast to monocle ’s umap-reduced inputs that reveal different disconnected super-groups and fluctuating connectivity depending on input parameters (see supplementary fig. s - for the biologically consistent structures proposed by via across a range of parameters compared to the contradicting cell super groups and connectivity suggested by a umap based ti interpretation ). we next demonstrated the applicability of via in single-cell multi-omics analysis by inferring murine isl + cardiac progenitor cell (cpc) transition states using both single-cell transcriptomic and chromatin accessibility information ( fig. d-i ). via consistently uncovers the bifurcating lineages towards the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=mtxpwbv o fn https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ruyzz p https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=s f i x uus https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=ic kiviuls https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=c gwvsmlicg https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=mtxpwbv o fn https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=mtxpwbv o fn https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=dt acr aal q https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / endothelial and cardiomyocyte fates based on the scrna-seq, scatac-seq datasets and their data integration ( see methods for data integration). other methods such as palantir and slingshot, that are also applicable to non-transcriptomic data, fail to uncover the two main lineages in the individual as well as the more challenging integrated multi-omic data. they typically only detect one of the two lineages and instead falsely detect several intermediate and early stages as final cell fates ( see fig. i for prediction accuracy). paga does not offer automated cell fate prediction and is therefore not benchmarked for this dataset. via detects lineage pathways in both the scrna-seq and scatac-seq that can be used to interpret relationships between transcription factor dynamics and gene expression in an unsupervised manner. via automatically generates a pseudotemporal ordering of cells (without requiring manual selection of relevant cells as done in jia et al. ) along respective lineages and their marker-tf pairs ( see fig. f and supplementary fig. s e) . the highlighted gene and tf pairs in the cardiac lineage show a strong correlation between expression and accessibility of gata and homeobox hox genes which are known to be related to the regulation of cardiomyocyte proliferation , , . via’s reliable performance against user-reconfiguration (choice of components, individual or integrated omic data) suggests it can be used for transferable interpretation between scrna-seq and chromatin accessibility data. we further tested via on a wider scope of (small-to mid-sized) scrna-seq datasets, including b-cell differentiation , hematopoiesis , , embryonic stem (es) cell differentiation in embryoid bodies , and endocrine differentiation (~ - cells). by comparing via with top-performing and popular ti algorithms, e.g. paga , palintir, slingshot and cellrank (see methods , and supplementary fig. s - for full analysis), we showed that via consistently outperforms other methods in terms of both runtime (in some cases by several magnitudes see supplementary table s for runtime comparison ), and more robust and accurate lineage prediction across a wide range of pre-processing and algorithmic parameters. via’s relaxation of graph traversal to permit cyclic sub-paths (see supplementary fig. s ) and movements that are temporally reversed, augments its sensitivity to lineages. notably, via more consistently across a wide range of input parameter choice identified less populous lineages that were at best detected by other methods for a narrow sweet spot of parameters. for example, via reliably delineates the megakaryocyte, conventional and plasmacytoid dendritic cell (cdc and pdc) lineages in human hematopoiesis ( fig. m-o, supplementary fig. s - for pseudotime and graph-topological gene trends for all lineages); and delta cells ( %) during the endocrine progenitor cells differentiation ( fig. j-l, supplementary fig. s for pseudotime and topological gene trends for all lineages), as evidenced by the corresponding gene-expression trend analysis and parameter stress tests. interestingly, we find that via often detects beta cell subpopulations (supplementary fig. s b,d,f) that express typical beta markers like dlk , pdx , but differ in their expression of ins and ins (supplementary fig. s d) . such a beta cell heterogeneity , , whereby the immature beta- population strongly expresses ins , and weakly expresses ins , and the mature beta- population expresses both types of ins , can also be reconciled based on the position of the beta- cluster on the via graph (supplementary fig. s f). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=dt acr aal q https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=hpu rs k s https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ruyzz p https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=y ixemnm tac https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ohfvgceg https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=s mv z n zoj https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=jbvdrwuod wd https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=hn iy esq sd https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=thb kc o https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / figure via accurately infers trajectories in diverse scrna-seq datasets. (a) via cluster-graph trajectory where nodes are colored by pseudotime, and branches are shaded according to major lineages of . -million-cell mouse organogenesis cell atlas (moca). the via analysis (which is independent of the choice of visualization) produces a connected structure with linkages between some of the major cell types that have a tendency to become segregated in a umap based ti analysis (see supplementary fig. s - ). the stem (root) branch consists of epithelial cells derived from ectoderm and endoderm, leading to two main branches: ) the mesenchymal and ) the neural tube and neural crest. other major groups are placed in the biologically relevant neighborhoods, such as the adjacencies between hepatocyte and epithelial trajectories; the neural crest (comprising glial cells and pns neurons) and the neural tube; as well as the links between early mesenchyme with both the hematopoietic cells and the endothelial cells ( see supplementary note ). (b) (left) single-cell phate embedding colored by major cell groups. (right) single-cell phate embedding colored by via pseudotime. (c) lineage pathways and probabilities of neuronal, myocyte and wbc lineages ( see supplementary fig. s for other lineages ). (d) scrna-seq and scatac-seq data of isl + cardiac progenitors (cps) integrated using seurat before via ti analysis and phate visualization. cells are colored by annotated cell-type and experimental modality (e) cells are colored by via pseudotime with the via-inferred trajectory towards endothelial and myocyte lineages projected on top. (f) marker gene expression and chromatin accessibility for gene-tf pairs along pseudotime axis for cardiomyocyte lineage (g) via-graph trajectory with nodes colored by pseudotime shows bifurcation to endothelial and myocyte cells in scrna-seq cells (h) scatac-seq of isl + cps: via-graph again shows bifurcation after intermediate cp stage. (i) lineage prediction accuracy (f -score) for methods that offer automated lineage detection and are not limited to transcriptomic data. (k) pancreatic islets: colored by via pseudotime with detected terminal states shown in red and annotated based on known cell type as alpha, beta- , beta- , delta and epsilon lineages where beta- is ins low ins + beta subtype ( supplementary fig. s ). (l) via inferred cluster-level pathway shows gene regulation along endocrine progenitor (ep) to delta lineage specification (top) and sst gene-expression trend shows rise of sst in delta lineage ( see supplementary fig. s for remaining ). (m) prediction accuracy of the major endocrine cell types when varying the number of hvgs selected in pre-processing, and the number of pcs. (n) human cd + hematopoiesis with detected cell fates annotated (o) lineage pathway and gene-pseudotime trend shown for the cd megakaryocytic cells ( see supplementary fig. s for other lineages ). (p) prediction accuracy of cell fates when varying number of k (nearest neighbors) and pcs. note slingshot on default mode (“v ”) uses gmm clustering and “v ” uses k-means clustering (allowing for over-clustering k= , to increase sensitivity). runtime of each method is also highlighted below the chart. via enables multi-omic analysis beyond transcriptomic data broad applicability of ti beyond transcriptomic analysis is increasingly critical, but existing methods have limitations contending with the disparity in the data structure (e.g. sparsity and dimensionality) across a variety of single-cell data types and oftentimes are designed with a view to only handling transcriptomic data (e.g. methods using rna velocity to infer directionality). first, we employ via to analyze human scatac-seq profiles (from cd + human bone marrow) ( fig. a ), and find that the continuous landscape of hematopoiesis generally mirrors the scrna-seq human hematopoietic data ( fig. c ). the intrinsic sparsity of scatac-seq data poses a challenge that can be alleviated by choice of pre-processing pipelines, and we see that via consistently predicts the expected hierarchy of furcations that leads to the lymphoid, myeloid and erythroid lineages for two commonly accepted pre-processing protocols , ( methods ) . this again holds across a wide range of input parameters, as shown by the changes in the accessibility of tf motifs associated with known regulators, e.g. gata (erythroid), cebpd (myeloid) (fig b-d, supplementary fig. s ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=cpy x qcr https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / we next investigated whether via can cope with a significant drop in data dimensionality ( - ), as often presented in flow/mass cytometry data, and still delineate continuous biological processes. we run via on a time-series mass cytometry data ( antibodies, k cells) capturing murine embryonic stem cells (escs) differentiation toward mesoderm cells (day - day ) . unlike previous analysis of the same data which required chronological labels to visualize the developmental hierarchy, we ran via without such supervised adjustments and accurately captured the sequential development. via computed the trajectories with faster runtime (running in minutes versus slingshot which required hours see table s ), detecting terminal states corresponding to cells in the final developmental stages: corresponding to the main region of day - (marked by pdgfra , cd and gata expressions), and a small population of day - cells expressing epcam, which are otherwise obscured in other methods (e.g. palantir, slingshot), especially the small epcam population (~ . % of cells) (fig e-h, fig. s e,f) . finally we tested the adaptability of via to infer cell-cycle stages based on label-free single-cell biophysical morphology ( features, see supplementary table s and table s ) profiled by our recently developed high-throughput imaging flow cytometer, called faced . via reliably reconstructed the continuous cell-cycle progressions from g -s-g /m phase of two different types of live breast cancer cells as validated by the single-cell fluorescent (dna dye) images captured by the same system ( methods )( fig. i-k for mcf , supplementary fig. s for mda-mb ) . intriguingly, according to the pseudotime ordered by via, not only can it reveal the known cell growth in size and mass , and general conservation of cell mass density (as derived from the faced images ( methods )) throughout the g /s/g phases, but also a slow-down trend during the g /s transition, consistent with the lower protein-accumulation rate during s phase ( fig. l, supplementary fig, s f,g ). the variation in biophysical textures (e.g. phase entropy) along the via pseudotime likely relates to known architectural changes of chromosomes and cytoskeletons during the cell cycles ( fig. l, fig. s f,g ). these results further substantiate the growing body of work , , , on imaging biophysical cytometry for gaining a mechanistic understanding of biological systems, especially when combined with omics analysis . concluding remarks overall, via offers an advancement to ti methods to study a diverse range of single-cell omic data, including those targeted by many cell-atlas initiatives. by combining lazy-teleporting random walks and mcmc simulations, via relaxes common constraints on graph traversal and causality. this enables accurate lineage prediction that is robust to parameter configuration for a variety of complex topologies and rarer lineages obscured in other methods. our stress tests showed that the modeled developmental landscape in other methods is vulnerable to user parameter choice which can incur fragmentation or spurious linkages, and consequently only yield biologically sensible lineages for a narrow sweet spot of parameters (see the summary in supplementary fig. s ). for example, due to algorithmic measures taken to restrict permissible graph-edge transitions and progressively reduce the inherent dimensionality (e.g. pca followed by subsetting the number of diffusion components) other algorithms struggle to delineate obscure lineages and maintain neighborhood relationships. via’s wider bandwidth of accuracy, superior runtime and preservation of global graph properties for very large datasets, offers a unique and well-suited approach for multifaceted exploratory analysis to uncover novel biological processes, potentially those deviated from healthy trajectories. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= jzsvekzs z https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= jzsvekzs z https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ezm xa jlh https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=tbqfcii qr https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=wlju tjkbefs https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=qj p ubk https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=z feu u i yt https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=d sb yalb https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / figure via infers trajectories in single-cell multi-omic and image datasets (a) major lineages of human hematopoiesis (profiled by scatac-seq) projected onto the umap embedding. lineages are colored by facs sorted labels . (b) via cluster-graph topology colored by via pseudotime. (c) trajectory, pseudotime and detected terminal states (red) projected onto the umap embedding. (d) f -scores (on the k-mer z-score input) for terminal state prediction by different ti methods (for a fixed knn = ). terminal states include megakaryocyte–erythroid progenitor (mep), common lymphoid progenitor (clp), plasmacytoid dendritic cell (pdc) and monocytes (mono) lineages. the comparisons show that via's accuracy remains high across a wide range of pcs. (e) differentiation of mesc to mesoderm cells measured by single-cell mass cytometry. umap embedding is colored by different measurement time points (day - ). (f) via cluster graph with detected terminal nodes (red) and colored by pseudotime. (g) via results projected onto single-cell umap embedding shows terminal states correspond to day / regions. (h) correlation of inferred pseudotime and day-labels achieved by different ti methods. the benchmark was done across different numbers of knn (using all antibodies). (i) label-free cell cycle progression tracking based on faced imaging cytometry. the phate embedding is constructed using biophysical/morphological features computed from images of human breast cancer cells (mcf ) (see supplementary fig. s for additional results using another breast cancer cell type (mda-mb )). the embedding is colored by the known cell cycle stages given by the dna fluorescence images (obtained from the same system). (j) via graph topology colored by pseudotime. (k) via trajectory and pseudotime projected on embedding. (l) “biophysical” feature expressions (z-score normalized) over pseudotime. (see supplementary table s - for detailed definitions of the features). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=y ixemnm tac https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / methods via algorithm via applies a scalable probabilistic method to infer cell state dynamics and differentiation hierarchies by organizing cells into trajectories along a pseudotime axis in a nearest-neighbor graph which is the basis for subsequent random walks. single cells are represented by graph nodes that are connected based on their feature similarity, e.g. gene expression, transcription factor accessibility motif, protein expression or morphological features of cell images. a typical routine in via mainly consists of four steps: . accelerated and scalable cluster-graph construction . via first represents the single-cell data in a k-nearest-neighbor (knn) graph where each node is a cluster of single cells. the clusters are computed by our recently developed clustering algorithm, parc . . in brief, parc is built on hierarchical navigable small world (hnsw ) accelerated knn graph construction and a fast community-detection algorithm (leiden method ), which is further refined by data-driven pruning. the combination of these steps enables parc to outperform other clustering algorithms in computational run-time, scalability in data size and dimension (without relying on subsampling of large-scale, high-dimensional single-cell data (> million cells)), and sensitivity of rare-cell detection. we employ the cluster-level topology, instead of a single-cell-level graph, for ti as it provides a coarser but clearer view of the key linkages and pathways of the underlying cell dynamics without imposing constraints on the graph edges. together with the strength of parc in clustering scalability and sensitivity, this step critically allows via to faithfully reveal complex topologies namely cyclic, disconnected and multifurcating trajectories ( supplementary fig. s ). . probabilistic pseudotime computation . the trajectories are then modeled in via as (i) lazy-teleporting random walk paths along which the pseudotime is computed and further refined by (ii) mcmc simulations. the root is a single cell chosen by the user.these two sub-steps are detailed as follows: (i) lazy-teleporting random walk : we first compute the pseudotime as the expected hitting time of a lazy-teleporting random walk on an undirected cluster-graph generated in step . the lazy-teleporting nature of this random walk ensures that as the sample size grows, the expected hitting time of each node does not converge to the stationary probability given by local node properties, but instead continues to incorporate the wider global neighborhood information . here we highlight the derivation of the closed form expression of the hitting time of this modified random walk with a detailed derivation in supplementary note . the cluster graph constructed in via is mathematically defined as a weighted connected graph g ( v , e , w ) with a vertex set v of n vertices (or nodes), i.e. and an edge set e , v = {v , , } ⋯ vn i.e. a set of ordered pairs of distinct nodes. w is an weight matrix that describes a set of n ×n edge weights between node i and j , are assigned to the edges . for an undirected ≥ wij v ,( i vj) graph, the probability transition matrix, p, of a standard random walk on this wwij = ji ×nn graph g can be given by .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=e uxoipvwota https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=nlu dvpyxpr https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=w vuo u f https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=onqq xmlfrf https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / d wp = − ( ) where d is the degree matrix, which is a diagonal matrix of the weighted sum of the degree ×nn of each node, i.e. the matrix elements are expressed as where k are the neighbouring nodes connected to node i . hence, (which can be reduced as ) dii di is the degree of node i . we next consider a lazy random walk, defined as z , with probability ( ) of being lazy (where ), i.e. staying at the same node, then − x < x < xp )iz = + ( − x ( ) where i is the identity matrix. when teleportation occurs with a probability ( ), the modified − α lazy-teleporting random walk z' can be written as follows, where is an matrix of ones. j ×nn αz ) jz ′ = + ( − α n ( ) here we adapt the concept of personalized pagerank vector, originally used for recording (or ranking ) personal preferences of a web-surfer toward particular website pages , to rank the importance of other nodes (clusters of cells) to a given node, depending on the similarities among nodes (related to p in the graph), and the lazy-teleporting random walk characteristics in the graph (set by probabilities of teleporting and being lazy). based on this concept, one could model the likelihood to transit from one node (cluster of cells) to another, and thus construct the pseudotime based on the hitting time, which is a parameter describing the expected number of steps it takes for a random walk that starts at node i and visit node j for the first time. consider the teleporting probability of ( ) and a seed vector s specifying the initial probability − α distribution across the n nodes (such that , where s m is the probability of starting at ∑ m sm = node m ) the personalized pagerank vector (which is defined as a column vector) is the prα (s) unique solution to . αpr z )sprα (s) t = α (s) t + ( − α t ( ) substituting z (eq. ( )) into eq. ( ), we can express the personalized pagerank vector in prα (s) terms of the inverse of the 𝛃 -normalized laplacian, of the modified random walk rβ,n l ( supplementary note ), i.e. , s d r dprα (s) t = β t − . β,n l . ( ) where , and . and are the m th eigenvector and β = ( −α) ( −α) rβ,n l = ∑ m= Φ Φm t m β+ x( −β)η[ m] Φm ηm eigenvalue of the normalized laplacian. in the expression of r 𝛃,nl, the 𝛃 and x regulate the weight of contribution in each eigenvalue-eigenvector pair of the summation such that the first eigenvalue-eigenvector pair (corresponding to the stationary distribution and given by the local-node degree-properties) remains included in the overall expression, but does not overwhelm the global information provided by subsequent ‘eigen-pairs’. moreover, computation of r 𝛃,nl is not limited to a subset of the first k eigenvectors (bypassing the need for the user to select a suitable threshold or subset of eigenvectors) since the dimensionality is not on the order of .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=uzyf ddp p https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=fi hnl oym https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / number of cells, but equal to the number of clusters and hence all eigenvalue-eigenvector pairs can be incorporated without causing a bottleneck in runtime. the expected hitting time from node q to node r is given by , hα (q, )r = dr pr (e ) (r)[ α r t ] − dq pr (e ) (q)[ α r t ] ( ) where is an indicator vector with in the i th entry and elsewhere (i.e. if and ei sm = m = i if ). we can substitute eq. ( ) into eq. ( ), making use of the fact that sm = ≠im , and is symmetric, to obtain a closed form expression of the dr = d e[ − r] (r) r dd− . β,n l − . hitting time in terms of rβ,n l (e ) d r d ehα (q, )r = β r − eq t − . β,n l − . r ( ) (ii) mcmc simulation : the hitting time metric computed in step- is used to infer graph-directionality. instead of pruning edges in the ‘reverse’ direction, edge-weights are biased based on the time difference between nodes using the logistic function with growth factor b = . (t) f = +e −b (t − t ) we then recompute the pseudotimes on the forward biased graph: since there is no closed form solution of hitting times on a directed graph, we perform mcmc simulations (parallely processed to enable fast simulations of s of teleporting, lazy random walks starting at the root node of the cluster graph) and use the first quartile of the simulated pseudotime values for a respective node as the refined pseudotime for that node relative to the root. this refinement step ensures that the pseudotime is robust to the spurious links (or conversely, links that are too weakly weighted) that can distort calculations based purely on the closed form solution of hitting times ( supplementary fig. s d ). by using this -step pseudotime computation, via mitigates the issues of convergence issues and spurious edge-weights, both of which are common in random-walk pseudotime computation on large and complex datasets . . . automated terminal-state detection. the algorithm then uses the refined directed and weighted graph (the edges are re-weighted using the refined pseudotimes) to predict which nodes represent the terminal states based on a consensus vote of pseudotime and multiple vertex connectivity properties, including out-degree (i.e. the number of edges directing out from the node), closeness c( q ) , and betweenness b( q ). c (q) = (q,r)∑ q≠r l b (q) = ∑ r=q≠t/ σrt σ (q)rt is the distance between node q and node r (i.e. the sum of edges in a shortest path connecting l (q, )r them). is the total number of shortest paths from node r to node t . is the number of these σrt σrt (q) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=hy nvy h bta https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=onqq xmlfrf https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / paths passing through node q . the consensus vote is performed on nodes that score above (or below for out-degree) the median in terms of connectivity properties. we show on multiple simulated and real biological datasets that via more accurately predicts the terminal states across a range of input data dimensions and key algorithm parameters than other methods attempting the same (supplementary fig. s ). . automated trajectory reconstruction . via then identifies the most likely path of each lineage by computing the likelihood of a node traversing towards a particular terminal state (e.g. differentiation). these lineage likelihoods are computed as the visitation frequency under lazy-teleporting mcmc simulations from the root to a particular terminal state, i.e. the probability of node i reaching terminal-state j as the number of times cell i is visited along a successful path (i.e. terminal-state j is reached) divided by the number of times cell i is visited along all of the simulations. in contrast to other trajectory reconstruction methods which compute the shortest paths between root and terminal node , , the lazy-teleporting mcmc simulations in via offer a probabilistic view of pathways under relaxed conditions that are not only restricted to the random-walk along a tree-like graph, but can also be generalizable to other types of topologies, such as cyclic or connected/disconnected paths. in the same vein, we avoid confining the graph to an absorbing markov chain , (amc) as this places prematurely strict / potentially inaccurate constraints on node-to-node mobility and can impede sensitivity to cell fates (as demonstrated by via’s superior cell fate detection across numerous datasets ( supplementary fig. s ). downstream visualization and analysis via generates a visualization that combines the network topology and single-cell level pseudotime/lineage probability properties onto an embedding based on umap or phate. generalized additive models (gams) are used to draw edges found in the high-dimensional graph onto the lower dimensional visualization ( fig. ). an unsupervised downstream analysis of cell features (e.g. marker gene expression, protein expression or image phenotype) along pseudotime for each lineage is performed ( fig. ). specifically, via plots the expression of features across pseudotime for each lineage by using the lineage likelihood properties to weight the gams. a cluster-level lineage pathway is automatically produced by via to visualize feature heat maps at the cluster-level along a lineage-path to see the regulation of genes. via provides the option of gene imputation before plotting the lineage specific gene trends. the imputation is fast as it relies on the single-cell knn (scknn) graph computed in step . using an affinity-based imputation method , this step computes a “diffused” transition matrix on the scknn graph used to impute and denoise the original gene expressions. benchmarked methods the methods were mainly chosen based on their superior performance in a recent large-scale benchmarking study , including a select few recent methods claiming to supersede those in the study. specifically, recent and popular methods exhibiting reasonable scalability, and automated cell fate prediction in multi-lineage trajectories were favoured as candidates for benchmarking (see supplementary table s for the key characteristics of methods). performance stress-tests in terms of lineage detection of each biological dataset, and pseudotime correlation for time-series data were .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=s aw ujvuwn https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=jbvdrwuod wd https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=c deqc by h https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=pwzzqt v https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / conducted over a range of key input parameters (e.g. numbers of k-nearest neighbors, highly variable genes (hvgs), and principal components (pcs)) and pre-processing protocols (see fig. m,p, supplementary fig. ). all comparisons were run on a computer with an intel(r) xeon (r) w- central processing unit ( . ghz, cores) and gb ram. quantifying terminal state prediction accuracy for parameter tests was done using the f -score, defined as the harmonic mean of recall and precision and calculated as: f = tp tp + . (f p+f n) where tp is a true-positive: the identification of a terminal cluster that is in fact a final differentiated cell fate; fp is a false positive identification of a cluster as terminal when in fact it represents an intermediate state; and fn is a false negative where a known cell fate fails to be identified paga . it uses a cluster-graph representation to capture the underlying topology. paga computes a unified pseudotime by averaging the single-cell level diffusion pseudotime computed by dpt, but requires manual specification of terminal cell fates and clusters that contribute to lineages of interest in order to compare gene expression trends across lineages. palantir . it uses diffusion-map . components to represent the underlying trajectory. pseudotimes are computed as the shortest path along a knn-graph constructed in a low-dimensional diffusion component space, with edges weighted such that the distance between nodes corresponds to the diffusion pseudotime . (dpt). terminal states are identified as extrema of the diffusion maps that are also outliers of the stationary distribution. the lineage-likelihood probabilities are computed using absorbing markov chains (constructed by removing outgoing edges of terminal states, and thresholding reverse edges). slingshot . it is designed to process low-dimensional embeddings of the single-cell data. by default slingshot runs clustering based on gaussian mixture modeling and recommends using the first few pcs as input. slingshot connects the clusters using a minimum spanning tree and then fits principle curves for each detected branch. it uses the orthogonal projection against each principal curve to fit a separate pseudotime for each lineage, and hence the gene expressions cannot be compared across lineages. also, the runtimes are prohibitively long for large datasets or high input dimensions. cellrank . this method combines the information of rna velocity (computed using scvelo . ) and gene-expression to infer trajectories. given it is mainly suited for the scrna-seq data, with the rna-velocity computation limiting the overall runtime for larger dataset, we limit our comparison to the pancreatic dataset which the authors of cellrank used to highlight its performance. simulated data we employed the dyntoy ( https://github.com/dynverse/dyntoy ) package, which generates synthetic single-cell gene expression data (~ cells x ‘genes’), to simulate different complex trajectory models. using these datasets, we tested that via consistently and more accurately captures both tree and non-tree like structures (multifurcating, cyclic, and disconnected) compared to other methods .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=s mv z n zoj https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ruyzz p https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= th gotw ydi https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=l wpb nev n https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=s aw ujvuwn https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=jbvdrwuod wd https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=xpkwepv v https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=pwzzqt v https://github.com/dynverse/dyntoy https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / (supplementary fig. s ) . all methods are subject to the same data pre-processing steps, pca dimension reduction and root-cell to initialize the path. multifurcating structure . this dataset consists of ‘cells’ multifurcating into terminal states. via robustly captures all four terminal cell fates across a range of input pcs and the pseudotimes are well inferred relative to the root node (supplementary fig. s a) . note that two terminal states (m and m ), which are very close to each other, are easily merged by the other methods (slingshot, palantir and paga). cyclic structure. we ran via and other methods for different values of k nearest neighbors. via unambiguously shows a cyclic network for a range of k (in knn). slingshot does not use a knn parameter and shows fragmented different lineages (top to bottom). paga fails to capture the connected cyclic structure at k = and , while palantir visually shows a linear (k = , ) or disconnected structure (k = ). van den berge et al note that the challenge of cyclic trajectory reconstruction is also common in other popular methods, such as monocle that consistently fragments or fits branching structures onto cyclic simulated datasets. disconnected structure. this dataset comprises two disconnected trajectories (t and t ). t is cyclic with an extra branch (m to m ), t has a bifurcation at m ( supplementary fig. s c) . via captures the two disconnected structures as well as the m branch in the cyclic structure, and the bifurcation in the smaller structure. paga captures the underlying structure at pc = but becomes fragmented for other numbers of pcs. palantir also yields multiple fragments and is not able to capture the overall structure, while slingshot (using the default clustering based on gaussian mixture modeling) connects t and t , and only captures one of the bifurcations in t . biological data the pre-processing steps described below for each dataset are not included in the reported runtimes as these steps are typically very fast, (typically less than - % of the total runtime depending on the method. e.g. only a few minutes for pre-processing , s of cells) and only need to be performed once as they remain the same for all subsequent analyses. it should also be noted that visualization (e.g. umap, t-sne) are not included in the runtimes. via provides a subsampling option at the visualization stage to accelerate this process for large datasets without impacting the previous computational steps. however, to ensure fair comparisons between ti methods (e.g. other methods do not have an option to compute the embedding on a subsampled input and transfer the results between the full trajectory and the sampled visualization, or rely on a slow version of tsne), we simply provide each ti method with a pre-computed visualization embedding on which the computed results are projected. scrna-seq of mouse pre-b cells. this dataset models the pre-bi cell (hardy fraction c’) process during which cells progress to the pre-bii stage and b cell progenitors undergo growth arrest and differentiation. measurements were obtained at , , , , and hours (h) for a total of cells x , genes. we follow a standard scanpy preprocessing recipe that filters cells with low counts, and genes that occur in less than cells. the filtered cells are normalized by library size and log transformed. the top highly variable genes (hvg) are retained. cells are renormalized by library count and scaled to unit variance and zero mean. via identifies the terminal state at - h and accurately .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=mfw e p x hj https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=hpu rs k s https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=s izyt ye https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / recapitulates the gene expression trends along inferred pseudotime of igii , slc a , fox , myc , ldha and lig . ( supplementary fig. s a). we show the results generalize across a range of pcs for two values of k of the graph with higher accuracy in locating the later cell fates than slingshot and palantir. ( supplementary fig. s b). scrna-seq of human cd + bone marrow cells. this is a scrna-seq dataset of cells representing human hematopoiesis . . we used the filtered, normalized and log-transformed count matrix provided by setty et al . ., with pca performed on all the remaining genes. the cells were annotated using singler . which automatically labeled cells based on the hematopoietic reference dataset novershtern hematopoietic cell data - gse . . the annotations are in agreement with the labels inferred by setty et al. for the clusters, including the root hscs cluster that differentiates into different lineages: monocytes, erythrocytes, and b cells, as well as the less populous megakaryocytes, cdcs and pdcs. via consistently identifies these lineages across a wider range of input parameters and data dimensions (e.g. the number of k and pcs provided as input to the algorithms see fig. p, and supplementary fig. s c ). notably, the upregulated gene expression trends of the small populations can be recovered in via, i.e. pdc and cdc show elevated cd and csf r levels relative to other lineages, and the upregulated cd expression in megakaryocytes ( supplementary fig. s -s ) . scrna-seq of human embryoid body. this is a midsized scrna-seq dataset of , human cells in embryoid bodies (ebs) . we followed the same pre-processing steps as moon et al. to filter out dead cells and those with too high or low library count. cells are normalized by library count followed by square root transform. finally the transformed counts are scaled to unit variance and zero mean. the filtered data contained cells × genes. pca is performed on the processed data before running each ti method. via identifies cell fates, which, based on the upregulation of marker genes as cells proceed towards respective lineages, are in accord with the annotations given by moon et al., (see the gene heatmap and changes in gene expression along respective lineage trajectories in supplementary fig. s ). note that palantir and slingshot do not capture the cardiac cell fate, and slingshot also misses the neural crest ( see the f -scores summary for terminal state detection supplementary fig. s ). scrna-seq of mouse organogenesis cell atla s . this is a large and complex scrna-seq dataset of mouse organogenesis cell atlas (moca) consisting of . million cells . . the dataset contains cells from embryos spanning developmental stages from early organogenesis (e . -e . ) to organogenesis (e . ). of the million cells profiled, . million are ‘high-quality’ cells that are analysed by via. the runtime is approximately minutes which is in stark contrast to the next fastest tool palantir which takes hours (excluding visualization). the authors of moca manually annotated cell-types based on the differentially expressed genes of the clusters. in general, each cell type exclusively falls under one of major and disjoint trajectories inferred by applying monocle to the umap of moca. the authors attributed the disconnected nature of the trajectories to the paucity of earlier stage common predecessor cells. we followed the same steps as cao et al. to retain high-quality cells (i.e. remove cells with less than mrna, and remove doublet cells and cells from doubled derived sub-clusters). pca was applied to the top hvgs with the top pcs selected for analysis. via analyzed the data in the high-dimensional pc space. we bypass the step in monocle which applies umap on the pcs prior to ti as this incurs an additional bias from choice of manifold-learning parameters and a further loss in .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=hpu rs k s https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ruyzz p https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ruyzz p https://bioconductor.org/packages/ . /singler https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=gqvc cw qlq https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= w hbpz mcwd https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ohfvgceg https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= glf ij qkb https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= glf ij qkb https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= glf ij qkb https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / neighborhood information. as a result, via produces a more connected structure with linkages between some of the major cell types that become segregated in umap (and hence monocle ), and favors a biologically relevant interpretation ( fig. , supplementary fig. s ). a detailed explanation of these connections (graph-edges) extending between certain major groups using references to literature on organogenesis is presented in supplementary note . scrna-seq of murine endocrine development . this is an scrna-seq dataset of e . murine pancreatic cells spanning all developmental stages from an initial endocrine progenitor-precursor (ep) state (low level of ngn , or ngn low ), to the intermediate ep (high level of ngn , or ngn high ) and fev + states, to the terminal states of hormone-producing alpha, beta, epsilon and delta cells . . following steps by lange et al . , we preprocessed the data using scvelo to filter genes, normalize each cell by total counts over all genes, keep the top most variable genes, and take the log-transform. pca was applied to the processed gene matrix. we assessed the performance of via and other ti methods (cellrank, palantir, slingshot) across a range of number of retained hvgs and input pcs ( fig. m , supplementary fig. s ) . scatac-seq of human bone marrow cells. this scatac-seq data profiles cells isolated from human bone marrow using fluorescence activated cell sorting (facs), yielding populations : hsc, mpp, cmp, clp, lmpp, gmp, mep, mono and plasmacytoid dcs ( fig. a and supplementary fig. s ). we examined ti results for two different preprocessing pipelines to gauge how robust via is on the scatac-seq analysis which is known to be challenging for its extreme intrinsic sparsity. we used the pre-processed data consisting of pca applied to the z-scores of the transcription factor (tf) motifs used by buenrostro et a . . their approach corrects for batch effects in select populations and weighting of pcs based on reference populations and hence involves manual curation. we also employed a more general approach used by chen et al. . which employs chromvar to compute k-mer accessibility z-scores across cells. via infers the correct trajectories and the terminal cell fates for both of these inputs, again across a wide range of input parameters ( fig. d and supplementary fig. s ). scrna-seq and scatac-seq of isl + cardiac progenitor cells. this time-series dataset captures murine isl + cardiac progenitor cells (cpcs) from e . to e . characterized by scrna-seq ( cells) and scatac-seq ( cells) . . the isl + cpcs are known to undergo multipotent differentiation to cardiomyocytes or endothelial cells. for the scrna-seq data, the quality filtered genes and the size-factor normalized expression values are provided by jia et al. as a “single cell expression set” object in r. similarly, the cells in the scatac-seq experiment were provided in a “singlecellexperiment” object with low quality cells excluded from further analysis. the accessibility of peaks was transformed to a binary representation as input for tf-idf (term frequency-inverse document frequency) weighting prior to singular value decomposition (svd). the highlighted tf motifs in the heatmap ( fig. j ) correspond to those highlighted by jia et al. we tested the performance when varying the number of svds used. we also considered the outcome when merging the scatac-seq and scrna-seq data using seurat . . despite the relatively low cell count of both datasets, and the relatively under-represented scrna-seq cell count, the two datasets overlapped reasonably well and allowed us to infer the expected lineages in an unsupervised manner ( fig. d and supplementary fig. s . in contrast, jia et al., performed a supervised ti by manually selecting cells relevant to the different lineages (for the scatac-seq cells) and choosing the two diffusion components that best characterize the developmental trajectories in low dimension . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= l bltp p u https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= l bltp p u https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=jbvdrwuod wd https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=y ixemnm tac https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=y ixemnm tac https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=cpy x qcr https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=dt acr aal q https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=dt acr aal q https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=hkri zw jhb https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=dt acr aal q https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / mass cytometry data of mouse embryonic stem cells (mesc) . this is a mass cytometry (or cytof) dataset, consisting of , cells and antibodies (corresponding to ~ cells each from day - measurements), that represents differentiation of mesc to mesoderm cells . . an arcsinh transform with a scaling factor of was applied on all features - a standard procedure for cytof datasets, followed by normalization to unit variance and zero mean. given the small feature set, no pca is required (supplementary fig. s ) . via identifies main terminal states corresponding to day and day , palantir on the other hand identifies three terminal states that all correspond to days in the first half of the experiment and the pseudotime is heavily influenced by the root node being very weakly connected to the other stages of the process. slingshot appears to capture the overall pseudotime but the lineages imposed onto the low dimensional representation are difficult to interpret and distinguish. to improve palantir performance we used waypoints but this takes almost minutes to complete (excluding time taken for embedding the visualization). via runs in ~ minutes and produces results consistent with the known ordering. the pseudotime reflects the range of days very well, even capturing the small population of day cells on the left hand side of the day cells in the embedding (fig. , and supplementary fig. s ) . single-cell biophysical phenotypes derived from imaging flow cytometry. this is the in-house dataset of single-cell biophysical phenotypes of two different human breast cancer types (mda-mb and mcf ). following our recent image-based biophysical phenotyping strategy , , we defined the spatially-resolved biophysical features of a cell in a hierarchical manner based on both bright-field and quantitative phase images captured by the faced imaging flow cytometer (i.e., from the bulk features to the subcellular textures). at the bulk level, we extracted the cell size, dry mass density, and cell shape. at the subcellular texture level, we parameterized the global and local textural characteristics of optical density and mass density at both the coarse and fine scales (e.g., local variation of mass density, its higher-order statistics, phase entropy radial distribution etc.). this hierarchical phenotyping approach , allowed us to establish a single-cell biophysical profile of features, which were normalized based on the z-score ( see supplementary table s and table s ). all these features, without any pca, are used as input to via. in order to weigh the features, we use a mutual information classifier to rank the features, based on the integrated fluorescence intensity of the fluorescence faced images of the cells (which serve as the ground truth of the cell-cycle stages). following normalization, the top features (which relate to cell size) are weighted (using a factor between - ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= jzsvekzs z https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=msgdsdsh sty https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=qiw flfp a l https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=v z vlisxo y https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= jdtkd bs a https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / imaging flow cytometry experiment faced imaging flow cytometer setup a multimodal faced imaging flow cytometry (ifc) platform was used to obtain the quantitative phase and fluorescence images of single cells in microfluidic flow at an imaging throughput of ~ , cells/sec. the light source consisted of an nd:yvo picosecond laser (center wavelength = nm, time-bandwidth) and a periodically-poled lithium niobate (ppln) crystal (covesion) for second harmonic generation of a green pulsed beam (center wavelength = nm) with a repetition rate of mhz. the beam was then directed to the faced module, which mainly consists of a pair of almost-parallel plane mirrors. this module generated a linear array of beamlets (foci) which were projected by an objective lens ( x, . na, mrh , nikon) on the flowing cells in the microfluidic channel for imaging. each beamlet was designed to have a time delay of ns with the neighboring beamlet in order to minimize the fluorescence crosstalk due to the fluorescence decay. detailed configuration of the faced module can be referred to wu et al. . . the epi-fluorescence image signal was collected by the same objective lens and directed through a band-pass dichroic beamsplitter (center: nm, bandwidth: nm). the filtered orange fluorescence signal was collected by the photomultiplier tube (pmt) (rise time: . ns, hamamatsu). on the other hand, the transmitted light through the cell was collected by another objective lens ( x, . na, mrd , nikon). the light was then split equally by the : beamsplitter into two paths, each of which encodes different phase-gradient image contrasts of the same cell (a concept similar to scherlien photography . ). the two beams are combined, time-interleaved, and directed to the photodetector (pd) (bandwidth: > ghz, alphalas) for detection. the signals obtained from both pmt and pd were then passed to a real-time high-bandwidth digitizer ( ghz, gs/s, lecroy) for data recording. cell culture and preparation mda-mb (atcc) and mcf (atcc), which are two different breast cancer cell lines, were used for the cell cycle study. the culture medium for mda-mb was atcc modified rpmi (gibco) supplemented with % fetal bovine serum (fbs) (gibco) and % antibiotic-antimycotic (anti-anti) (gibco), while that for mcf was dmem supplemented with % fbs (gibco) and % anti-anti (gibco). the cells were cultured inside an incubator under % co and °c, and subcultured twice a week. e cells were pipetted out from each cell line and stained with vybrant dyecycle orange stain (invitrogen). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference= ezm xa jlh https://docs.google.com/document/d/ hkqd b gmabgnhrduyigpd_-ue ofupuwd jhxsou/edit#smartreference=k dnflcadtcx https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / data availability data used in figures - as well as supplementary figures s -s ) is available on: . pancreatic data: gene expression omnibus (geo) under accession code gse . . cardiac progenitor data is available from the ena repository under the accession code prjeb or from [ https://github.com/loosolab/cardiac-progenitors ]. . b-cell: stategradata github repository. [ https://github.com/stategradata/stategradata ] . mass cytometry mesoderm: cytobank [ https://community.cytobank.org/cytobank/experiments/ ]. . raw and processed data for scrna-seq human hematopoeisis are available through the human cell atlas data portal at https://data.humancellatlas.org/explore/projects/ cf b- bc- e - -f a c a . . embryoid body: mendeley data repository at https://doi.org/ . /v n h ng. . . mouse organogenesis : ncbi gene expression omnibus under accession number gse . faced cell cycle: https://github.com/shobistassen/via and on figshare https://doi.org/ . /m .figshare. .v . scatac-seq hematopoiesis: geo: gse . processed scatac-seq data, which include pc values and tf scores per cell can be found in data s . of https://doi.org/ . /j.cell. . . . toy data: https://github.com/shobistassen/via code availability via is available as a pip installable python library “pyvia” with tutorials and sample data available on https://github.com/shobistassen/via and https://pypi.org/project/pyvia/ references . street, k. et al. slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. bmc genomics , ( ). . setty m, kiseliovas v, levine j, gayoso a, mazutis l, pe'er d. characterization of cell fate probabilities in single-cell data with palantir [published correction appears in nat biotechnol. oct; ( ): ]. nat biotechnol. ; ( ): - . doi: . /s - - - . qiu, x., mao, q., tang, y. et al. reversed graph embedding resolves complex single-cell trajectories. nat methods , – ( ). https://doi.org/ . /nmeth. . saelens, w., cannoodt, r., todorov, h. et al. a comparison of single-cell trajectory inference methods. nat biotechnol , – ( ). https://doi.org/ . /s - - - . bastidas-ponce, a. et al. comprehensive single cell mrna profiling reveals a detailed roadmap for pancreatic endocrinegenesis. development , ( ). . cao, j. et al. comprehensive single- cell transcriptional profiling of a multicellular organism. science , – ( ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/loosolab/cardiac-progenitors https://github.com/stategradata/stategradata https://community.cytobank.org/cytobank/experiments/ https://data.humancellatlas.org/explore/projects/ cf b- bc- e - -f a c a http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse https://github.com/shobistassen/via https://doi.org/ . /m .figshare. .v https://doi.org/ . /j.cell. . . https://github.com/shobistassen/via https://github.com/shobistassen/via https://pypi.org/project/pyvia/ https://doi.org/ . /nmeth. https://doi.org/ . /s - - - https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / . packer, j. s. et al. a lineage- resolved molecular atlas of c. elegans embryogenesis at single- cell resolution.science , eaax ( ). . cao, j., spielmann, m., qiu, x. et al. the single-cell transcriptional landscape of mammalian organogenesis. nature , – ( ). . briggs, j. a. et al. the dynamics of gene expression in vertebrate embryogenesis at single- cell resolution.science , eaar ( ). . litviňuková, m., talavera-lópez, c., maatz, h. et al. cells of the adult human heart. nature ( ). . stassen sv, siu dmd, lee kcm, ho jwk, so hkh, tsia kk. parc: ultrafast and accurate clustering of phenotypic data of millions of single cells. bioinformatics. may ; ( ): - . doi: . /bioinformatics/btaa . . ulrike von luxburg, agnes rad, matthias hein. hitting and commute times in large random neighborhood graphs. journal of machine learning research , - ( ) . marius lange, volker bergen, michal klein, manu setty, bernhard reuter, mostafa bakhti, heiko lickert, meshal ansari, janine schniering, herbert b. schiller, dana pe’er, fabian j. theis. cellrank for directed single-cell fate mapping. biorxiv . . . ; doi: https://doi.org/ . / . . . . mcinnes, l., healy, j., saul, n. & großberger, l. umap: uniform manifold approximation and projection. j. open source software. , ( ). . moon, k.r., van dijk, d., wang, z. et al. visualizing structure and transitions in high-dimensional biological data. nat biotechnol , – ( ). https://doi.org/ . /s - - - . tam pp, behringer rr. mouse gastrulation: the formation of a mammalian body plan. mech dev. ; ( - ): - . doi: . /s - ( ) - . chin am, hill dr, aurora m, spence jr. morphogenesis and maturation of the embryonic and postnatal intestine. semin cell dev biol. jun; : - . doi: . /j.semcdb. . . . epub feb . . gilbert sf. developmental biology. th edition. sunderland (ma): sinauer associates; . the neural crest. available from: https://www.ncbi.nlm.nih.gov/books/nbk / . the human body at cellular resolution: the nih human biomolecular atlas program, nature ( ) https://doi.org/ . /s - - -x . jia g, preussner j, chen x, guenther s, yuan x, yekelchyk m, kuenne c, looso m, zhou y, teichmann s, braun t. single cell rna-seq and atac-seq analysis of cardiac progenitor cell transition states and lineage settlement. nat commun. nov ; ( ): . . tanya e. foley, bradley hess, joanne g. a. savory, randy ringuette, david lohnes.role of cdx factors in early mesodermal fate decisions.development : dev doi: . /dev. published april . yao y, yao j, boström ki. sox transcription factors in endothelial differentiation and endothelial-mesenchymal transitions. front cardiovasc med. ; : . published mar . doi: . /fcvm. . . potta sp, liang h, winkler j, doss mx, chen s, wagh v, pfannkuche k, hescheler j, sachinidis a. isolation and functional characterization of alpha-smooth muscle actin expressing .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . https://doi.org/ . /s - - - https://www.ncbi.nlm.nih.gov/books/nbk / https://doi.org/ . /s - - -x https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / cardiomyocytes from embryonic stem cells. cell physiol biochem. ; ( ): - . doi: . / . epub may . pmid: . . warkman as, whitman sa, miller mk, garriock rj, schwach cm, gregorio cc, krieg pa. developmental expression and cardiac transcriptional regulation of myh b, a third myosin heavy chain in the vertebrate heart. cytoskeleton (hoboken). may; ( ): - . doi: . /cm. . epub apr . erratum in: cytoskeleton (hoboken). dec; ( ): . pmid: ; pmcid: pmc . . mahmoud ai, kocabas f, muralidhar sa, et al. meis regulates postnatal cardiomyocyte cell cycle arrest. nature. ; ( ): - . doi: . /nature . gomez-cabrero, d., tarazona, s., ferreirós-vidal, i. et al. stategra, a comprehensive multi-omics dataset of b-cell differentiation in mouse. sci data , ( ). https://doi.org/ . /s - - - . jason d. buenrostro, m. ryan corces, caleb a. lareau, beijing wu, alicia n. schep, martin j. aryee, ravindra majeti, howard y. chang, william j. greenleaf, integrated single-cell analysis maps the continuous regulatory landscape of human hematopoietic differentiation, cell, , - .e , ( ) https://doi.org/ . /j.cell. . . . . wolf, f. a. et al. paga: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. genome biol. , ( ). . gutierrez gd, gromada j, sussel l. heterogeneity of the pancreatic beta cell. front genet. ; : . published mar . doi: . /fgene. . . krentz naj, lee myy, xu ee, sproul slj, maslova a, sasaki s, lynn fc. single-cell transcriptome profiling of mouse and hesc-derived pancreatic progenitors. stem cell reports. dec ; ( ): - . doi: . /j.stemcr. . . . pmid: ; pmcid: pmc . . chen, h., lareau, c., andreani, t. et al. assessment of computational methods for the analysis of single-cell atac-seq data. genome biol , ( ). https://doi.org/ . /s - - - . ko, m.e., williams, c.m., fread, k.i. et al. flow-map: a graph-based, force-directed layout algorithm for trajectory mapping in single-cell time course datasets. nat protoc , – ( ). https://doi.org/ . /s - - - . wu j. l., xu y. q., xu j. j., wei x. x., chan a. c. s., tang a. h. l., lau a. k. s., chung b. m. f., cheung shum h., lam e. y., wong k. k. y., tsia k. k., “ultrafast laser-scanning time-stretch imaging at visible wavelengths,” light sci. appl. ( ), e ( ). . /lsa. . . popescu g, park y, lue n, best-popescu c, deflores l, dasari rr, feld ms, badizadegan k. optical imaging of cell mass and growth dynamics. am j physiol cell physiol. aug; ( ):c - . doi: . /ajpcell. . . epub jun . . kyoohyun kim, jochen guck the relative densities of cytoplasm and nuclear compartments are robust against strong perturbationbiophysical journal. volume , issue , november , pages - . kafri r, levy j, ginzberg mb, oh s, lahav g, kirschner mw. dynamics extracted from fixed cells reveal feedback linking cell growth to cell cycle. nature. feb ; ( ): - . doi: . /nature . pmid: ; pmcid: pmc . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /s - - - https://doi.org/ . /j.cell. . . https://doi.org/ . /s - - - https://doi.org/ . /s - - - https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / . park sr, namkoong s, friesen l, cho cs, zhang zz, chen yc, yoon e, kim ch, kwak h, kang hm, lee jh. single-cell transcriptome analysis of colon cancer cell response to -fluorouracil-induced dna damage. cell rep. aug ; ( ): . doi: . /j.celrep. . . . zangle ta, teitell ma. live-cell mass profiling: an emerging approach in quantitative biophysics. nat methods. dec; ( ): - . doi: . /nmeth. . pmid: ; pmcid: pmc . . tse ht, gossett dr, moon ys, masaeli m, sohsman m, ying y, mislick k, adams rp, rao j, di carlo d. quantitative diagnosis of malignant pleural effusions by single-cell mechanophenotyping. sci transl med. nov ; ( ): ra . doi: . /scitranslmed. . pmid: . . otto, o., rosendahl, p., mietke, a. et al. real-time deformability cytometry: on-the-fly cell mechanical phenotyping. nat methods , – ( ). https://doi.org/ . /nmeth. . kimmerling, r.j., prakadan, s.m., gupta, a.j. et al. linking single-cell measurements of mass, growth rate, and gene expression. genome biol , ( ). https://doi.org/ . /s - - - . traag, v.a., waltman, l. & van eck, n.j. from louvain to leiden: guaranteeing well-connected communities. sci rep , ( ). https://doi.org/ . /s - - -z . langville, amy n., and carl d. meyer. google's pagerank and beyond: the science of search engine rankings. princeton university press, . . chung f., zhao w. ( ) pagerank and random walks on graphs. in: katona g.o.h., schrijver a., szőnyi t., sági g. (eds) fete of combinatorics and computer science. bolyai society mathematical studies, vol . springer, berlin, heidelberg. . van dijk d, sharma r, nainys j, et al. recovering gene interactions from single-cell data using data diffusion. cell. ; ( ): - .e . doi: . /j.cell. . . . coifman, r. r. et al. geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. proc. natl acad. sci. usa , – ( ). . haghverdi l, büttner m, wolf fa, buettner f, theis fj. diffusion pseudotime robustly reconstructs lineage branching. nat methods. ; ( ): - . doi: . /nmeth. . bergen, v., lange, m., peidli, s. et al. generalizing rna velocity to transient cell states through dynamical modeling. nat biotechnol , – ( ). https://doi.org/ . /s - - - . zheng gx, terry jm, et al. massively parallel digital transcriptional profiling of single cells. nat commun. jan ; : . doi: . /ncomms . . aran d et al., ( ). “reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” nat. immunol., , - . novershtern n. et al., densely interconnected transcriptional circuits control cell states in human hematopoiesis. cell. jan ; ( ): - . . stuart t, butler a, hoffman p, hafemeister c, papalexi e, mauck wm rd, hao y, stoeckius m, smibert p, satija r. comprehensive integration of single-cell data. cell. jun ; ( ): - .e . doi: . /j.cell. . . . epub jun . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . /nmeth. https://doi.org/ . /s - - - https://doi.org/ . /s - - -z https://doi.org/ . /s - - - https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / / . siu, kcm lee, mck lo, sv stassen, m wang, izq zhang, hkh so. deep-learning-assisted biophysical imaging cytometry at massive throughput delineates cell population heterogeneity. lab on a chip ( ), - . kcm lee, m wang, kse cheah, gcf chan, hkh so, kky wong, kk tsia.. quantitative phase imaging flow cytometry for ultra‐large‐scale single‐cell biophysical phenotyping. cytometry part a ( ), - . wenwei yan jianglai wu kenneth k. y. wong kevin k. tsia, a high‐throughput all‐optical laser‐scanning imaging flow cytometer with biomolecular specificity and subcellular resolution, j. biophotonics ( ) https://onlinelibrary.wiley.com/doi/abs/ . /jbio. . f. chung and s.-t. yau, discrete green’s functions. journal of combinatorial theory, series a, ( - ) ( ), pp. – . van den berge, k., roux de bézieux, h., street, k. et al. trajectory-based differential expression analysis for single-cell sequencing data. nat communications . yury a. malkov, d. yashunin. efficient and robust approximate nearest neighbor search using hierarchical navigable small world graph, computer science, medicine, mathematics, ieee transactions on pattern analysis and machine intelligence, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://onlinelibrary.wiley.com/doi/abs/ . /jbio. https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / neuronmotif: deciphering transcriptional cis-regulatory codes from deep neural networks neuronmotif: deciphering transcriptional cis-regulatory codes from deep neural networks zheng wei , kui hua , lei wei , shining ma , rui jiang , yanda li , wing hung wong , xiaowo wang , * . ministry of education key laboratory of bioinformatics; center for synthetic and systems biology; beijing national research center for information science and technology; department of automation, tsinghua university, beijing, , china . department of statistics, department of biomedical data science, stanford university, stanford, ca , usa abstract discovering dna regulatory sequence motifs and their relative positions are vital to understand the mechanisms of gene expression regulation. such complicated motif grammars are difficult to be summarized from shallow models. although deep convolutional neural network (dcnn) achieved great success in annotating cis- regulatory elements, few combinatorial motif grammars have been accurately interpreted due to the mixed signal in dcnn. to address this problem, we proposed neuronmotif, a general backward decoupling algorithm, to reveal the homo-/hetero-typic motif combinations and arrangements embedded in convolutional neurons. we applied neuronmotif on several widely-used dcnn models. many uncovered motif grammars of deep convolutional neurons are supported by literature or atac-seq footprinting. we further diagnosed the sick neurons that are sensitive to adversarial noises, which can guide dcnn architecture optimization for better prediction performance and motif feature extraction. overall, neuronmotif enables decoding cis-regulatory codes from deep convolutional neurons and understanding dcnn from a novel perspective. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . the overview of neuronmotif and existing methods. a, a trained dcnn model can annotate genome function with corresponding genome sequence as input. interpreting regulatory grammar from dcnn includes discovering the motif glossary and syntax. the motif is similar to the different word-forms for a lexeme, the smallest isolatable meaningful unit. soft/hard hetero/homo-multimer motif are organized by motif syntax tree. b,c, max activation and saliency map methods adapted from cv. d,e how neuronmotif decouple a layer- neuron based on the mechanism of dcnn. d, eight sequences 𝒙!"# matched by two ctcf- n-ddit ::cebpa motifs with four different relative positions are sampled by adapted genetic algorithm. in each layer, the masked subsequences are detected by the neurons of the corresponding colors . convolutional neuron combines the motif sequence recognized by previous layers (rectangles with black border) and fills the gap between them. max-pooling operation aligns the recognized regions by extending their length. the chaotic signal of nucleotide bases in 𝒙!"# with similar function are layer-wisely unified into the similar signal 𝑦!"# = 𝑦!"# (%) (feature map 𝒚('), only the key components of feature map in layers 𝑙 = , , are shown in figure). 𝒚!"# (() ,𝑦!"# (%) (𝑦!"#) are independent of different motif sequences and shift diversity. e, from layer to , feature maps of the sequences can be firstly distinguished at layer . to reverse the max-pooling operation of size , twice kmeans (k= ) are applied on feature maps 𝒚!"# ()) reclusively. 𝒙!"# are divided into groups for calculating ppm respectively. a is the max activation in each group. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . details of neuronmotif. a, in experiment like selex, the sequences (𝒛) bound by tf are filtered and aligned for motif estimation 𝔼𝑿. to simulate this process, enumerating the valid sequence (𝒙) for estimating the motif for a neuron is not correct (given distribution of 𝑿). the frequency/weight of sequences should be proportional to affinity level (given distribution of 𝑌 in b). within the whole dcnn structure, dcnn sub- structure of the neuron in red is equivalent to function 𝑦 = 𝑓(𝒙). the abbreviation 𝑠.𝑡. means subject to. b, distribution of neuron activation values (𝑦) in a. the sequence collection with a higher activation level contains more information in the sequence logo. c, the sequences are sampled during the optimization process of seed sequence. d, two types of latent variables lead to motif mixture in neuron model. the shifted motifs can be decoupled under the control of shift latent variable that determine the position and . the synonymous latent variable determine the different replaceable motif with similar function at the same position. the example is the original motif and its reverse complementary motif . for some tfs, function is not sensitive to orientation. e, comparing the neuron with (left column) and without (right column) synonymous mixture motif. under the controlling of synonymous latent variable 𝑆 = 𝑆 ,𝑆 , the sequences and corresponding motifs are similar in single model but different in mixture model. the sequences with max activation value in two model are 𝒙𝟏,𝒙𝟐,𝒙𝟑 (𝑓-(𝒙𝟏) > 𝑓.(𝒙𝟐) ≈ 𝑓.(𝒙𝟑)). both of the models share consensus sequence (𝒙/). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . use neuronmotif to annotate basset model. a, four motifs of a second-layer neuron decoupled by neuronmotif (row - ). the decouple motifs with the same size (the receptive field size of neuron in the second layer is bp) are aligned with bp offsets. they are matched by jaspar motif nfib using tomtom (row ). the interpretation results using methods of kelley et al., alipanahi et al. and saliency map are shown in row row - . b, motifs of a third-layer neuron decoupled by neuronmotif (row - ). the decoupled motifs with the same size (the receptive field size of neurons in the third layer is bp) are aligned with bp offsets. they are matched by jaspar motif cebpb, ctcf and ddit ::cebpa using tomtom (row ). the interpretation results using methods of kelley et al., alipanahi et al. and saliency map are shown in row row - . c, the neurons in the second layer learn the reverse complementary motifs. they represent the motifs of aac triplet repeats (row - ) and gtt triplet repeats (row - ) respectively, which were decoupled by neuronmotif. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . neuronmotif diagnose model defects and guide dcnn architecture design for better performance. a, dead kernel definition. dead neuron (pink) activation value distribution is negative. it will be filtered by relu activation function. the output of dead neuron is zero. the downstream neuron output does not depend on this neuron. b,c, diagnosis of motif mixture in the decoupled motifs from basset, bd- , deepsea and dd- model. each point is a decoupled motif generate by a sample set of sequence. the points of motifs generated by the sample set with less than sequences are marked by red color. otherwise, they are marked by blue color. the distribution of the max activation value is used to show if the relative max activation values of most of the motifs are too low. b, diagnosis of models trained by deepsea dataset c, diagnosis of models trained by basset dataset. only the max activation value of the decoupled motifs in fig. b are significantly higher than the decoupled motifs of other neurons in layer of basset- model. d, the meaning of each region in the sub-plots of b,c. e, schematic for receptive field coupling of previous layer neuron in the neuron sub-structure. f,g, use auprc as an indicator to compare the prediction performance of models. for each model pair, one-sided t-test of Δ𝐴𝑈𝑃𝑅𝐶 = 𝐴𝑈𝑃𝑅𝐶 " - − 𝐴𝑈𝑃𝑅𝐶 " - is used to access model performance difference level. f, comparison between deepsea and dd- models. g, comparison between basset and bd- models. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . motif discovery performance for different layers in different models. a, accuracy analysis for discovered motifs of different models. three columns of box plots describe similarity between neuron motifs and jaspar motifs in basset, bd- and bd- model respectively. for each selected layer in the model, -log (q-values) distribution of the top jaspar motifs matched neuron (q-value < . ) are shown with box and jittering points. the color of the box means the applied interpretation method. in the first row, it shows the result of input (first) layer of basset model and shallow layers in bd- and bd- model with similar receptive field sizes ( bp) of the basset input layer neuron ( bp). in the second row, it shows the convolutional output layer, the convolutional layer in the front of dense layer, result of the three models. b, the number of motifs discovered (q-value < . ) from the neuron in convolutional output layer of basset, bd- and bd- model. c, the number of motif discovered (q-value < . ) from the neuron in layer of basset model using different interpretation methods including kelley et al., alipanahi et al. and neuronmotif. d, discovered motifs from the neuron of top convolutional layer in bd- model (q-value < . ). these motifs can be matched to jaspar database. only the one with smallest q-value for each jaspar motif is shown. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . fig. . verification of neuron motif. a, motif syntax represented by a neuron of top convolutional layer (layer ) in dd- model applied on deepsea data. the neuron motifs ( bp) are matched to tf ctcf and ddit ::cebpa. ctcf-ddit ::cebpa is a hard hetero-trimer. the distance between two ctcf- ddit ::cebpa trimer is flexible. b, motif syntax represented by another neuron of top convolutional layer (layer ) in dd- model applied on deepsea data. the neuron motifs ( bp) are matched to tf nfix. c, five different cell types’ atac-seq data footprinting ( bp upstream and downstream from the motif matched midpoint is shown) of the motifs in a. cut-site counts of each position are normalized by total cut-site counts within bp window. d, similar to c, the footprinting of the neuron motif in b. e, ctcf-ddit ::cebpa motif matched count for each relative motif midpoint position. soft homodimer of ctcf-ddit ::cebpa heterotrimer relations are shown at the bottom.. f, nifx motif matched count for each relative motif midpoint position. soft homotrimer of nifx relation is shown at the bottom. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction the dna sequence is a language of life . to understand life processes, it is essential to decode the grammar of dna. one of the most important problems is deciphering transcriptional cis-regulatory code from functional dna sequence. deep sequencing techniques such as chip-seq, atac-seq , etc. have been developed to discover the sequences with specific function or characteristic like transcription factor binding sites (tfbss), histone-marks (hms) and chromatin openness. but the logic of the sequence is difficult to summarize directly. with the development of deep learning techniques, a growing number of researchers resort to the deep convolutional neural network (dcnn) for its significant advantage including automatic extraction of sequence motif (fig. d) and higher prediction accuracy . for example, deepsea and basset model successfully use dna sequence to predict chromatin-profiling data including tfbss, hms profiles and dnase i sensitivity. among these common functions, in cis-regulatory modules, transcription factors (tf) regulate gene expression through binding or co- binding to specific preferred dna sequences that occur at particular genome positions . accurately characterizing tf binding specificities and interpreting the relative positions of tfs from dcnn are vital to understand the logic of gene regulation (fig. a). unfortunately, dcnn is a black-box that is difficult to be interpreted what motif glossary or even motif grammar it exactly learns. interpretation of dcnn black-box is not as smooth as function annotations. most existing methods , , , seek to interpret dcnn by detecting the correlation between the predicted genome function as the model output and dna sequence at the resolution of a single nucleotide base as inputs via different approaches adapted from computer vision (cv) (fig. b, fig. c and fig s , see supplementary information for details). however, from the viewport of linguistics, letters of nucleotide bases do not have actual meanings unless they are combined into various words of motif sequences . thus, interpreting the meaning of a single nucleotide base while ignoring its the context-dependence is polysemous or even meaningless. the average interpretation of polysemous results is a confusing mixture. due to the lack of interpretation methods, the design of deeper dcnn structure with better prediction performance is limited. different from the deepening dcnns applied in the cv like -convolutional-layer vgg- and -convolutional-layer resnet , most dcnn models for studying genome functions contain up to convolutional layers to guarantee clear interpretations , . the interpretation of the first layer in shallow - convolutional-layer dcnn is more reliable with existing interpretation methods. these shallow model avoids serious motif mixing problems in deeper layers and motif fragmentation happened in the first layer of deeper dcnn . but the kernel size in the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . first layer has to be large enough to learn a single complete motif . however, deeper dcnns show better performance in genomics , . hence, performance and interpretation seem to be a trade-off determined by the dcnn architecture to a large extent. here, we proposed neuronmotif to decipher transcriptional cis-regulatory grammar from dcnn (fig. d,e). this algorithm considers the sequences recognized by an artificial neuron (an) as a mixture model depending on latent variables. from the output of an to the input, it automatically backward discovers the latent variables reflecting the neural network structure to decouple the an mixture model for extracting motif grammar. we applied neuronmotif on several existing shallow dcnns (deepsea and basset ). a large portion of uncovered motifs and syntaxes of their combinations are supported by literature or atac-seq profile, which outperforms existing state-of-art methods. the results of neuronmotif reveal the origin of adversarial noise in the model, which can be used to guide the design of dcnn architectures to suppress noise. with the help of neuronmotif, we further built and interpreted -convolutional-layer deeper dcnns with the help of neuronmotif. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . results the neuronmotif algorithm for uncovering motif and decoupling motif mixture from dcnn model tfs are proteins that can recognize and bind to specific dna sequences. the perferred sequences bound by a given tf are usually summarized as a motif. motif is a model typically refers to position weight matrix (pwm), which can be converted from position probability matrix (ppm) . at each base position in ppm, the four scores represent the probability of the four bases that occur at the relative position of tfbs. the probability can be estimated by collecting the dna sequences binding with tfbs through experiments such as systematic evolution of ligands by exponential enrichment (selex) (fig. a). this process is similar to sampling sequences (𝒙) of tfbss with 𝑁- bases length from an × n random variable matrix 𝑿 ∼ 𝑝𝐏𝐏𝐌𝟒×𝑵(𝒙) to estimate ppm (𝔼𝑿) by element-wise average 𝒙, (see methods). here, 𝒙 is the × 𝑁 one-hot code of the sequence, and each column of 𝑿 is an different independent categorial distribution. the sampling process in the experiment reflects tf binding affinities to sequences. the sequences with stronger affinities may occur at higher frequency. inspired by selex screening tf-preferred sequences, we attempted to imitate this process by sampling an-preferred sequences to study an. the sub-structure of an an processes the sequence input (𝒙) with a non-linear function 𝑦 = 𝑓(𝒙) and then outputs an activation value (𝑦) (fig. a and fig. s a). this process is quite similar to selex screening sequences because the sequences (𝒙) with higher activation 𝑦 are preferred by the an for affecting downstream ans and the final prediction result, which reflects sequence affinity. hence, the input random variable matrix 𝑿 ∼ 𝑝𝐏𝐏𝐌𝟒×𝑵(𝒙) depends on the output random variable 𝑌 ∼ 𝑝(𝑦) through 𝑌 = 𝑓(𝑿). to obtain ppm reflecting binding affinity rather than binding probability, we adopt a linear function as 𝑝(𝑦) of the distribution (fig. b, see methods for detail explanation). in other words, sampling weight or frequency of each unique sequence (𝒙) should be positive proportional to its activation value (𝑦) (fig. a and b). it can be implemented by sampling 𝑿 at the same level of 𝑦 to estimate 𝔼(𝑿|𝑌 = 𝑦) (bottom of fig. b) and then taking the weighted average of them (𝔼𝑿 = 𝔼[𝔼(𝑿|𝑌)]) to estimate ppm (fig. b, see methods for details). this method precedes previous studies in representing the tf binding affinity to dna sequences. adapted back propagation (bp) methods like saliency map and deeplift do not model the sequence preference with 𝑿. the importance score (e.g. 𝜕𝑦/𝜕𝒙 ) of these methods do not directly reflect ppm or pwm (fig. c). while adapted max activation methods like the methods developed by kelley et al. and alipanahi et al. for interpreting basset model and deepbind model try to follow the ppm model but they estimate 𝔼𝑿 by (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . 𝒙 ({𝒙|𝑓(𝒙) > } or {𝒙|𝑓(𝒙) > 𝑦#$%/ }) without depending on the level of 𝑌 , thus could hardly reflect the activation perference.(fig. b). here, we assumed that tfbs are located at the same relative position in the input sequences without shifting, and then we can define motif or ppm for the sequences recognized by an an as 𝔼𝑿 given distribution of 𝑌. we called it an motif or ppm of an. however, we found that due to the max-pooling operation in dcnn, tfbss may be located at different relative positions in the input sequences to activate the an. in the max-pooling layer, the key input feature maps reflecting the shifting diversity of tfbss will be unified into similar output feature maps (fig. d, e and s b). the downstream key ans including the output an will share similar activation values (𝑦) for different sequences with shifting tfbss. hence, the motif sequences of tfbs recognized by an an can be regarded as a latent variable mixture model. to decouple motif mixtures, we have to find some shift latent variables that reflecting different positions of the motif sequence (the top part of fig. d). only by controlling these latent variables can we obtain the consistent real sequence motifs (𝔼(𝑿|𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛)). this key issue is neglected by all existing methods (fig. b, fig. c and fig. s ). we further found that tfbss in the sequences may not share the same pattern . it indicates that we can find more than one motif by stacking tfbss with grouped consistent pattern respectively . one of the cases is the reverse complementary sequences (bottom part of fig. d). the mixing of these sequences can be controlled by another important type of latent variables in the mixture model named as synonymous latent variables, and we called the decoupled motifs as the synonymous motifs. the synonymous motifs represented by an an should satisfy: ( ) they are not shifted motifs; ( ) all or part of input variables 𝑿 are conditionally independent under controlling synonymous latent variables; ( ) the sampled sequences grouped by synonymous motifs should share similar maximun activation values so that they are all preferred by the an. if these conditionally independent positions affect little on activation values, then the an can be regarded as a single model (sm, 𝑦 = 𝑓&(𝒙), the left column of fig. e). otherwise, the motif sequences recognized by the an is a mixture model (mm, 𝑦 = 𝑓#(𝒙), the right column of fig. e). both sm and mm share similar motif (the bottom part of fig. e), but the sequences of the maximum activation value (sm:𝑥'; mm: 𝑥(,𝑥)) and the consensus sequences (𝒙*) show their difference. different from 𝑓&(𝒙*) ≈ 𝑓&(𝒙') in the sm, 𝑓#(𝒙*) in mm usually strongly deviates from 𝑓#(𝒙(),𝑓#(𝒙)), and could even be negative (the top-right part of fig. e). this is because 𝒙* may not match any conditionally dependent motifs embedded in the an (the bottom-right of fig. e). thus, bases flipping at the conditionally dependent positions of the sequence is a kind of adversarial noise discussed in cv that can dramatically change the an activation level (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . or even destroy prediction result. in addition, we also proved that severe mixing of synonymous motifs in the an correlates to the lower maximum activation and weights of the an (see methods). the results above suggested that the mixture of synonymous motifs seems to be noise rather than motif signals due to its vulnerable characteristics. hence, for a well-trained model with weak noises, we only need to decouple the signal of each an depending on the max-pooling structure. one of the most widely-used types of dcnn models is composed of general convolution layers and max-pooling layers. we took this type of dcnn as an instance, and developed the neuronmotif algorithm to uncover the motif combinatorial grammar from dcnn. first, we designed a sampling algorithm adapted from genetic algorithm to optimize seed samples and recorded the intermediate valid sequences as the sampling result (fig c, see methods for details). second, we used k-means (k is pooling size) to decouple the mixture signal from different sequences by clustering the shifting similar sub-patterns in the input feature map of the max pooling layer to split the sequences set (fig d,e and fig. s b). the decoupling process can be performed backward and recursively from the deepest layer to the first layer. third, the algorithm can annotate an an with motifs by estimating 𝔼[𝔼(𝑿|𝑌,𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑠𝑢𝑏𝑠𝑒𝑡)] from each subset samples clustered by k-means (fig e, see methods for detail). the steps above can only decouple the mixture of a single hard an motif with shifting diversity. the hard motif refers to the motif or motifs combination with a fixed gap, which characterizes homodimer, heterodimer or multimer tfs that can be considered as a stable molecular cluster binding to dna. however, a large portion of tfs cobinding are gapped by flexible intervals. their sequence pattern is the soft motif that composed of more than one hard motif, and the space between any two adjacent hard motifs is in a certain range.to decouple the hard motifs in a soft motif represented by an, users should run the decoupling algorithm in neuronmotif (the second step) iterately for several times based on the number of hard motifs (fig e, see methods for details). neuronmotif successfully decouple the motif mixture to evaluate the performance of neuronmotif on decoupling the motif mixture signal, we applied neuronmotif to annotate two well-known models, deepsea and basset , both of which are dna-sequence based dcnn models with general convolutional layers for genome function annotation. basset annotates open chromatin region trained by dnase- seq data. in addition to chromatin accessibility, deepsea also annotates tfbss and hms trained by chip-seq data. neuronmotif successfully decoupled the shifted mixture motifs from layer (l ) and layer (l ) of the both models (see supplementary information for all results). in the basset model, the first- and second-layer pooling size are and , so the numbers of shifted signals are and × = for l and l ans, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . respectively (fig. a-c). in the deepsea model, the first- and second-layer pooling size are both , so the numbers of shifted signals are and × = for l and l ans, respectively. in fig , all adjacent an motifs are shifted with bp and highly consistent. however, the state-of-the-art methods, such as kelley et al ., alipanahi et al and saliency map , cannot deal with the mixture signal which leads to the much lower information content and very noisy signals. the neuronmotif-annotated results of basset and deepsea models showed that ans extract various kinds of motifs. here, we took basset as an instance. some ans extract important tf motifs correlated with basset’s prediction targets (dnase i sensitivity). many motifs of ans can be matched with the known motifs in the jaspar database (fig. a and b). some matched tfs, like nfi and cobinding tfs ctcf-cebp, are highly correlated with chromatin openness , . in comparison, the interpretation result of existing methods can hardly be matched with any known motifs in jaspar. statistically, neuronmotif found more motifs and more accurate motifs from jaspar database (fig a and c). besides, some important functional sequence features and their reverse complement can also be identified from motifs of an. one of the typical examples is the repeats of aac triplets feature extracted by the basset model (fig. c). it has been reported that repeated triplets aac is enriched in intron . as the intron regions are usually open for gene transcription, it is reasonable that the basset model extract this feature. dcnn diagnosis and architecture design guidance from the neuronmotif result of deepsea and basset models, we found that the outputs of some ans were always zero no matter how we changed the input sequences. we called them dead ans (fig a). the dead ans are redundant because they cannot affect the downstream network. during the sampling process of annotating deepsea model via neuronmotif, we found the sampling algorithm cannot sample even one sequence that can activate some ans in l and l . for example, a total of and ans in l and l are dead ans in the deepsea model. another problem is that some ans may recognize synonymous motifs. we diagnosed this problem with two indicators of motifs based on the phenomena that an an may represent synonymous motifs. one indicator is the activation value of motif consensus sequence and the other is the maximun activation value of the sampled sequences for motif estimation. we found that if the two indicators severely deviate from each other, or the max activation value is close to zero, then the corresponding an may suffer from the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . synonymous motifs problem. the two indicators of basset l ans are almost consistent (the first column of fig. c). however, in l , activation values of many decoupled motifs’ consensus sequences are negative and severely deviated from the maximum activation value (the second column of fig. c). most problematic motifs are mainly caused by stacking sequences of different synonymous motifs with the maximum activation value closed to zero (see methods for details and supplementary information for case). to overcome these problems, the dcnn architecture should be optimized to avoid the mixed signals of synonymous motifs. both of deepsea and basset use large convolutional kernel size (> ) and large pooling size ( or ) in each layer. for an an with the certain receptive field, implementing its sub-structure with larger kernel size and pooling size tend to cause weaker coupling among sub-structures of the previous layer ans (fig. e). when using the same training set and optimization method for training the model, we found that the less coupling among the ans, the more sensitive to noise generated by synonymous motifs (see methods and fig. s ). deepsea adopted strong regularization methods to successfully suppress learning these the noises (fig. b) but with the cost of producing dead ans. in the field of cv, building deeper networks with smaller kernels and pooling structures has been found to be a more robust strategy with better performance . thus, we built -convolution-layer new models and trained them on the basset dataset (bd- ) and the deepsea dataset (dd- ) respectively. the synonymous motif problem was significant suppressed in bd- and dd- (the third column of fig. b and fig. c), and few dead kernels were found. furthermore, both bd- and dd- show much better prediction performance (fig. f and fig. g) than the original model. these results demonstrate how neuronmotif can be used to help diagnose dcnn and guide architecture design. accuracy and completeness of motif discovery in different layers of dcnn to study which layer is better for motif discovery in a dcnn model, we used neuronmotif to interpreter the shallow convolutional layers with receptive field around bp and the deepest convolutional layers in three models with , and convolutional layers (basset, bd- and bd- ) trained by the same data of basset paper. to measure the interpretation performance, we matched the decoupled motifs to the motifs in the jaspar database using tomtom . for each an matched to known motifs (q-value < . ), we selected the best matched motif in jaspar and took similarity measurement between the found motif and the jaspar motif (q-value) as the performance of the an. as the numbers of ans are different in each layer, we only selected the q-values of top ans for further analysis. given a dcnn model, we found that the motifs discovered from the deepest convolutional layer outperform the shallow layers with around bp receptive field (each column in fig a). we further (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . compared the layers with similar receptive field (the first row in fig a), and the deepest convolutional layers (the second row in fig a) among different models. the deeper models (bd- and bd- ) outperform the basset model by discovering much more known motifs (fig. b) and the motif is matched better to the jaspar database (each row in fig a). based on the comparison results, we recommand to use deep an for motif discovery and representation. for example, we built a motif dictionary from layer (l ) in bd- model. this dictionary contains motifs. among them, motifs are matched to at least one of jaspar motifs by tomtom (q-value < . , one of the best-discovered motifs matched to each jaspar motif is shown in fig d) and remaining motifs are novel motifs. neuronmotif successfully uncover motif grammar in previous works , , motif combination grammar is usually represented by the hard motif. they depict soft motif by enumerating different intervals among the component of hard motif (fig a and b). in comparison, dcnn structure is more powerful to describe these soft motifs when the receptive field is long enough. here, we take l ans in dd- as examples to study the an soft motif. we assume the l ans representing no more than two hard motifs, so we run decoupling algorithm twice in neuronmotif and a total of an motifs are generated (fig. a and b, see methods for details). these an motifs enumerating the combination of hard motifs with various sizes of gap. from these an motifs, we can slice the shared hard motif to build motif dictionary. based on the dictionary and all an motifs, we can summarize the interval range between arbitray two adjacent hard motifs and build the syntax tree (fig. a and b). some of the soft motif can be supported by literature. for example, an an in deepsea represents the soft ctcf homodimer with around bp interval that play important roles in the transcriptional process of cancer and germ cells development (fig. a). we also found that ddit ::cebpa can co-bind with ctcf, which is not reported in previous literatures. interestingly, ctcf-ddit ::cebpa is shown to be an conservative hard trimeric motif that also occurs in the basset model (fig. b), which show the reliability of this discovery. we further used the atac-seq data footprinting to validate the discovered an motif grammars. atac-seq uses tn transposes to cut dna into fragments. if there are some tfs or other molecules binding to dna, the cutting frequency will be affected. for each an, we aligned corresponding tn transposes cutting frequency of top sequences ( bp) with max an activation values in the test dataset. we extended the footprinting region to bp in total. most ans have their own footprintings generated by atac-seq data from five cell types or tissue (fig. c and fig. d, see supplementary information for other ans). soft ctcf-ddit ::cebpa homodimer footprintings from five cell types or tissue share the pattern of three peaks and two valleys (fig. c) but soft nfi homodimer footprinting signals are only significant in prostate tissue and lncap cell (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . lines which support the notion that nfi family can regulate prostate-specific gene expression (fig. d). the results indicated that some motif grammars of multimers are cell-type specific. to further confirm the footprinting caused by the specific tf binding, we calculated the distribution of motif matched positions for both ctcf–ddit ::cebpa and nfic motif (fig. e and fig. f). the peaks of motif-matched positions are consistent with the footprinting valley. all the results suggested that neuronmotif provides a novel way to discover the soft multimer motif grammar on the genome and to better depict multimeric tf motifs. discussion in summary, we presented neuronmotif as an effective algorithm to reveal the cis- regulatory motif grammar learned by dcnn model that use dna sequence to annotate genome function. we proposed the statistical form of an motif representation and the latent variable mixture model to understand each convolutional neuron. take max- pooling-convolutional structure as an instance, we uncovered the signal mixing mechanism including shifting latent variable and synonymous latent variable. the neuronmotif used a k-means-based algorithm to decouple the latent variable mixture, and a sampling strategy adapted from genetic algorithm for motif estimation. we eveluated neuronmotif interpretation performance on deepsea, basset and some in- house deeper models. many uncovered motif conbinatotial grammars are supported by literature and atac-seq data. finally, we showed that neuronmotif result can be used for model diagnoses and to guide model structure design for better prediction performance and motif extraction. except for interpretating cis-regulatory motif grammar from dcnn, the application of neuronmotif may be extend to many other problems. dna sequence is a special one- dimensional discrete data with four elements. it is possible to apply neruonmotif to the dcnn for amino acid sequence of protein or other continuous sequence like different kinds of sequencing profile. there are still some issues that should be addressed to further expand the application of neuronmotif. for instance, neuronmotif only focuses on max-pooling-cnn structure. many new dcnn structures such as resnet and densenet are put forward in recent years. as these structures show better performance in cv, it is valuable to adapt the neuronmotif method for these more general and complex dcnn structures in genomics studies. in the future, we envision that dcnn model interpreted by neuronmotif will advance our ability to discover and summarize the complicated regulatory rule, model transcriptional cis-regulatory process and understand dcnn blackbox itself. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . acknowledgements we thank z. duren and h. fang for valuable suggestions on motif discovery and relative biological issues. this work was supported by the national natural science foundation of china (no. , ), and the national key r&d program of china (no. yfa ) competing interests tsinghua university has a patent pending for neuronmotif. author contributions z.w., w.h.w. and x.w. conceived the main idea of the study. z.w. completed the theorem proof and formula derivation. k.h. repeated and checked the proof and inference. z.w. developed the algorithm, trained dcnn model, designed experiments and implemented all the experiments. r.j. provided and maintained the computing cluster. w.h.w. and x.w. designed some experiments and supervised the study. all authors wrote and revised the manuscript. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . methods statistical definition and estimation of ppm represented by a convolutional neuron an an has its own sub-structure in the dcnn model (fig. a and s a). the sub- structure includes an input 𝒙 and an output 𝑦. the relation between 𝒙 and 𝑦 is defined by a non-linear function 𝑦 = 𝑓(𝒙) which depends on the sub-structure of the an, because all upstream ans in sub-structure can affect the characteristic of an. the input of each an is the output of an in the previous layer. the sub-structure also determine the receptive field size 𝑁 (the length of 𝒙). for each valid dna input sequence (𝒙,s. t.𝑓(𝒙) > ), 𝒙 is a × 𝑁 matrix of one-hot code. it can be sampled from a random variable matrix 𝑿 ∼ 𝑝(𝒙). 𝑿 contains × 𝑁 random variables 𝑿+,- (𝑏 = a,c,g,t;𝑗 = , ,…,𝑁). each column can be modeled as an independent multinomial distribution (𝑿∙,-~multi[ ,𝝅∙,𝒋]), where the × 𝑁 probability matrix 𝝅 is the ppm that characterizes the preference of the nucleotide bases for the sequence motif. based on the nature of multinomial distribution, the parameter 𝜋∙,- = 𝔼𝑿∙,- so ppm can be estimated through sampling 𝑿∙,- and calculating the element-wise average 𝒙∙,-. as the unknown distribution 𝑝(𝒙) is to be estimated, we cannot sample 𝑿 directly. we know that 𝑿 is not a free random variable, but depends on the free output random variable 𝑌~𝑝(𝑦) through 𝑌 = 𝑓(𝑿). based on the identity equation 𝔼𝑿 = 𝔼[𝔼(𝑿|𝑌)], we can first sample 𝑿|𝑌 = 𝑦 to estimate 𝔼(𝑿|𝑌 = 𝑦), which represents the ppm for a specific activation value or affinity (𝑦). given an arbitrary distribution 𝑝(𝑦), we can obtain the ppm by taking a weighted average of these ppms with different affinities. 𝔼(𝑿) = 𝔼 [𝔼𝑿(𝑿|𝑌)] = _𝔼[𝑿|𝑌 = 𝑦]𝑝(𝑦)𝑑𝑦 ' = lim #→ b𝔼c𝑿d𝑌 = 𝑖𝑚𝐴g𝑝h 𝑖 𝑚 𝐴i∆𝑦 # ( = lim #→ b𝔼k𝑿l𝑌 = 𝑦,𝑦 ∈ c𝑖 − 𝑚 𝐴, 𝑖 𝑚𝐴go𝑃h𝑦 ∈ q 𝑖 − 𝑚 𝐴, 𝑖 𝑚 𝐴ri # ( numeric estimation of ppm for an an needs enough valid sequence (𝒙) samples. in this work, we set relu[𝑓(𝒙)] = max{𝑓(𝒙), } as the activation function of each convolutional neuron. therefore, the valid sequence dataset is 𝑿 = {𝒙|𝑓(𝒙) > ∧ |𝒙| = 𝑁}. here, 𝑓(𝒙) > constrains the activation value of a valid sequence to be positive so that it can activate the an, and |𝑥| = 𝑁 constrains that the length 𝒙 must match the an receptive field size 𝑁. for convenient, we rewrote the sequence dataset as 𝑋 = {𝒙 } ( |: | and (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . corresponding activation value set as 𝑉 = {𝑓(𝒙 )} ( |: | . the max activation value is 𝐴 = max(𝑉). the probability of tf binding to a dna sequence depends on the binding affinity . to sample the sequences reflecting their affinity levels (𝑦), the sequence with high affinity should be sampled in higher frequency. here, for the ease of calculation, we set the probability density function of 𝑌 as a linear function 𝑝(𝑦) = ) 𝑦,𝑦 ∈ [ ,𝐴] (fig. b). in practical, we split interval [ ,𝐴] into 𝑚 (𝑚 = ) bins to merge sequences with similar activation values into ppms (fig. b). in this way, we can get the average of ppms weighted by activation values. for each bin 𝑖 (𝑖 = , ,…,𝑚), the sequence index set is 𝐽 = }𝑗d ;( # 𝐴 < 𝑓(𝒙 ) ≤ # 𝐴 ⋀ 𝒙- ∈ 𝑿 �. sequences in bin 𝑖 share similar activation values. thus, their average activation value and ppm can be calculated by 𝑽, < ;( # , # = = ∑ 𝑓(𝒙𝒋)-∈? |𝐽 | 𝔼k𝑿l𝑌 = 𝑦,𝑦 ∈ c𝑖 − 𝑚 𝐴, 𝑖 𝑚𝐴go ≈ 𝐏𝐏𝐌[ ;(# , # ] = ∑ 𝒙𝒋-∈? |𝐽 | where |𝐽 | is the number of sequences in sequence index set 𝑖. the probability or weight for each bin can be estimated by 𝑃h𝑦 ∈ q 𝑖 − 𝑚 𝐴, 𝑖 𝑚 𝐴ri ≈ 𝑃 < ;( # , # = = 𝑽, < ;( # , # = ∑ 𝑽, < ;( # , # = # ( finally, 𝐏𝐏𝐌b×d of the an can be estimated by the average of ppms weighted by the activation value. 𝔼(𝑿) = 𝐏𝐏𝐌b×d ≈ b𝐏𝐏𝐌[ ;(# , # ] # ( 𝑃 < ;( # , # = the estimation above assumes that relative position of tfbs in the input sequence are the same and all of them share the same motif. in other words, it only works for sm neurons. however, the assumption was not suitable for most ans especially ans in deeper layer, where the estimation result is a mixture of different motifs. the random variable matrix 𝑿 can be considered as a mm. hence, we first needed to find the latent variables that can split the dataset 𝑿 into subsets 𝑿( ,𝑿) ,𝑿e ,…, each of which can be consider as an sm. the estimation should be applied on each subset respectively. it will generate several motifs 𝐏𝐏𝐌(,𝐏𝐏𝐌),𝐏𝐏𝐌e,… which are controlled by different conditions of the latent variables. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . discovering latent variables in a mixture model of neuron the activation value of the an is the only indicator to show the matching level of a sequence. sequences with high activation values of an an may be composed of completely different key tfbss at various relative positions due to the powerful representation ability of the neural network. this characteristic shows that sequences recognized by an an can be considered as a latent variable mixture model. the sequences matched by different sub-models in the mm are available to activate the an at the same level. hence, the activation value of the an is the unified or mixed signal that cannot distinguish the sequences with different tfbss. to find the mechanism of mixing process for an an, we can investigate the activation values of all upstream ans (feature maps) that reflects tfbs diversity in the sequences, which define the latent variables. controlling these latent variables, the sampled sequences share the same pattern (hard motif). the sequences are only matched by one sub-model in the mm. in this way, different sampled sequences shared the similar tfbss that are located at the same relative position. only obtain these sampled sequences can we estimate the an motif. in practical, when analyzing the feature maps for sampled sequences, we used k-means to cluster the feature maps of the convolutional layer and found shifted signals among each cluster. however, these clusters are not able to be rebuilt by the feature map of the downstream max-pooling layer. so, the max-pooling operation unify the shifted signals of various sequence, which removes the difference among clusters. in other words, an just tries to detect if tfbs exist in sequence, the position of tfbs in the sequences is not important to final an output. subsequently, we found the best cluster number k (the maximum shifting offset) is the same as the max-pooling size. each offset within k defines a shifting latent variable. the side-effect for the ans representing different synonymous motifs following the definition of synonymous motifs for an an, if an an (𝑦 = 𝑓(𝒙)) represents the mixture of two synonymous motifs, let 𝒙(,𝒙) be the vectors of flattened one hot code of the maximum activation sequences for the two motifs respectively, then they should satisfy 𝑓(𝒙() ≈ 𝑓(𝒙)) i.e. 𝑓(𝒙() − 𝑓(𝒙)) → . first, we studied the an in the first layer (𝑦 = 𝑓(𝒙) = 𝒌𝒙f + 𝑏). the activation values of 𝒙(,𝒙) are � 𝑦( = 𝒌𝒙( f + 𝑏 𝑦) = 𝒌𝒙) f + 𝑏 where 𝒌 is the weight of the an, and 𝑏 is the bias or inceptor. based on these two equations, we can easily obtain following equation (𝑦( − 𝑦))) = |𝒌|)[|𝒙(|𝟐 − 𝒙(𝒙) f + |𝒙)|𝟐] (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . where, both |𝒙(|𝟐 and |𝒙)|𝟐 are equal to the length of the sequence. if the difference between two synonymous motifs is very great, 𝒙(,𝒙) matched by these two motifs respectively should share much less bases (𝒙(𝒙)f → ). for an extreme case, the two sequence are totally different (𝒙( ⊥ 𝒙),𝒙(𝒙)f = ). based on the condition above including (𝑦( − 𝑦))) → , 𝒙(𝒙)f → and the constant value of |𝒙(|𝟐 + |𝒙)|𝟐, we can infer that |𝒌|) → . thus, the maximum activation value follows 𝑦 = 𝒌𝒙(f + 𝑏 → 𝑏. the an becomes a dead an if 𝑏 ≤ . so, compared with the an of sm that cannot represent two synonymous motif, the an of mm representing the mixture of synonymous motifs exhibits a lower maximum activation value and a smaller weight. the importance of this kind of ans for downstream ans will be suppressed. in a dcnn without the pooling layer, we further investigated an an representing two synonymous motifs in deeper convolutional layers 𝑖. we assumed that there are no an representing the mixture of synonymous motifs in layer to layer 𝑖 − . based on this assumption, the feature map (𝒙( ( ;(),𝒙) ( ;()) of 𝒙(,𝒙) at layer 𝑖 − are of great difference especially for the key features with high activation values. the negative values of the feature map are set by relu activation function (𝒙( ( ;() ≥ ,𝒙) ( ;() ≥ ). the key feature in 𝒙( ( ;() with high activation may be low activated or in 𝒙) ( ;() (for key feature 𝑗, �𝒙(- ( ;() − 𝒙)- ( ;()� ) will be larger compared to the value of similar sequences). it indicated that we were able to distinguish the sequences matched to the two synonymous motifs with the feature map of layer 𝑖 − . the activation of the an in layer 𝑖 is the linear combination of the previous layer feature map (𝑦 = 𝑓(𝒙) = 𝑔[𝒙( ;()] = 𝒌[𝒙( ;()] f + 𝑏). similarly, for an an in layer 𝑖, we can obtain the following equation (𝑦( − 𝑦))) = b𝒌- ) �𝒙(- ( ;() − 𝒙)- ( ;()� ) - where 𝑗 is the feature number in layer 𝑖 − . if this an mixed the signals of 𝒙( ( ;(),𝒙) ( ;() ((𝑦( − 𝑦))) → ), the result is the same with the first layer (∑ �𝒙(- ( ;() −- 𝒙)- ( ;()� ) ↑⇒ |𝒌|) → ). however, the an representing the mixture of synonymous motifs is usually accompanied by representing the strong consistent main motif (fig. e). in layer 𝑖 − of this an, for feature 𝑠 representing the strong consistent main motif and feature 𝑗 representing (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . synonymous motifs, they may satisfy 𝒙(& ( ;() ≈ 𝒙)& ( ;() ≫ 𝒙(- ( ;() ≠ 𝒙)- ( ;() or 𝒌𝒔 ≫ 𝒌𝒋. although |𝒌|) of this kind of an is smaller compared with the an cannot recognize synonymous motifs, it can still obtain similar activation value 𝑦 = 𝒌[𝒙( ;()] f + 𝑏 in two ways rather than becoming a dead kernel. one way is increasing 𝒙(& ( ;() and 𝒙)& ( ;() through the weight 𝒌( ;)) of previous layer neurons (�𝒙k ( ;()� f = relu(𝒌( ;))[𝒙( ;))] f + 𝑏)). the other way is increasing 𝒌𝒔 and decreasing 𝒌𝒋, which may greatly reduce the side-effect of mixture of synonymous motifs. in a well-trained model, for an an, compared to the large weight on the high activations of subsequence matched by the main motif, the signal generated by the subsequence matched by synonymous motif can be neglected. otherwise, an only representing strong synonymous motifs will destroy the robustness of the an (fig. e). weighted sampling algorithm adapted from the genetic algorithm the sequence sampling process is necessary to estimate the an motifs. the first operation is the initialization of seed sequences. we randomly generated seed sequences that match the receptive field size. for each sequence, we randomly replaced a specific sub-sequence with one motif sequence of ans in the previous layer. the position and the previous layer an were randomly selected based on value of the normalized maximum contribution score: 𝒄 - = max � ,𝒘 -𝐴-�/bmax � ,𝒘 -𝐴-� ,- where 𝑖 is the position number, 𝑗 is the previous layer an number and 𝐴- is the maximum activation value of the previous layer an 𝑗. the second operation is sequence optimization. the sequence (𝑥) is discrete so we cannot use the gradient decent method directly, so we adopt and adjusted the genetic algorithm. in one generation, we used the normalized gradient value 𝒈 = 𝜕𝑓(𝒙) 𝜕𝒙⁄ as the probability to guide randomly select better mutation bases: 𝒈𝒊𝒋 n = � 𝒈 - ,𝒈 - > 𝑒𝒈 : ,𝒈 - < 𝒑 - = 𝒈𝒊𝒋 n /b𝒈𝒊𝒋 n where 𝑖 is the base of a,c,g,t, and 𝑗 is the position. we kept % samples with top activation values in each generation. we randomly shifted % sequence samples based on the dcnn structure. the remaining samples were generated by roulette wheel selection and crossover operation. the total number of sequences did not change in (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . each generation. the optimization would not stop until the maximum record of the mean activation value of each generation did not increase for iterations. the third operation is sampling. at the end of each iteration in the genetic algorithm, sequence with positive activations were collected as samples. the duplicated sequences were removed. based on the maximum activation value of existing samples, we split the activation value interval into bins. we kept the number of samples in each bin less than . if it was overflowed, we randomly selected samples among them. see supplementary information for the pseudo code of this algorithm. shifting latent variable discovery and decoupling algorithm for one an, we need to design an algorithm to split the sample set according to the latent variables depending on the dcnn structure. from deep layers to shallow layers in dcnn model, when the result of a convolutional layer was the input of a max-pooling layer, the algorithm calculated the feature map of the convolutional layer and used k- means (k is the max-pooling size) to cluster the sequence samples into k subsets according to the features in feature maps. the algorithm would continuously cluster and split each subset reclusively once it found the result of convolutional layer was the input of the max-pooling layer. finally, the number of subsets is ∏ 𝑘p ;(p ( where 𝐿 is the layer number of the an and 𝑘p is the pooling size of the pooling operation applied on each convolutional layer. based on each subset, we obtained the numerical estimation of ppms. the algorithm can be applied on the newly generated subset again for decoupling the secondary important shifting motif if the samples are enough. this process has been shown in fig. d and e. see supplementary information for details and the pseudo code of this algorithm. algorithm implements neuronmotif were implemented in python. it depends on tensorflow and keras packages. current version of neuronmotif can only be applied to the dcnn implemented by tensorflow or keras. we only implemented the cpu version of neuronmotif so it does not depend on gpu. the scripts are parallelized and can be run across the nodes of the computing cluster. the memory consumption depends on the dcnn structure and the an receptive field size. we run the program on servers. each server contains cpus with cores (intel e - ) and gb memory. for each dcnn model mentioned in this work, the program can finish the decoupling of all convolutional ans in about days. rebuilding and decoupling the deepsea and basset models deepsea and basset are both -convolutional-layer models implemented in torch, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . which is not compatible to neuronmotif. we rewrote these two models with tensorflow and keras. we tried to keep the architecture, regularization, optimizer and so on consistent with the previous studies. we trained the models by the datasets that were used for training the original models. the dataset was split into a training set, a validation set and a test set according to the original papers. we also followed the training strategy described in the papers. we trained these two models with a single nvidia p gpu card. we applied neuronmotif on deepsea and basset. deepsea had ( , , ) kernels in each convolutional layer. the max-pooling size was for every convolutional layer. theoretically, we would obtain , × = , × × = an motifs for l , l and l . however, some of them were absent for the dead kernel or low information content motifs that should be excluded. similarly, basset had ( , , ) kernels and its max-pooling sizes were ( , , ). theoretically we would obtain , × = , × × = motifs from the l , l and l of basset model. synonymous motif mixture detection and diagnosis to estimate motifs from each sample subset, we calculated its maximum activation value and activation value of consensus sequence. the max activation value is obtained by feeding all sample sequences to the substructure of the an. the consensus sequence was obtained from the ppm of the motif. for each position, the nucleotide base with the largest probability among bases in ppm was selected as the nucleotide in the consensus sequence. we fed the consensus sequence to the substructure of the an and got the activation value. for all motifs in the same layer of the dcnn model, we can draw a scatter plot to find if serious synonymous motif mixture exists. it can be diagnosed by observing if activation values of consensus sequences are deviated from corresponding maximum activation values. more low activation values of the consensus sequences indicate more synonymous motif mixture in this dcnn model. problematic neuron analysis we investigated some problematic an in the basset model to find which part of the discovered motif makes the consensus sequence not be able to activate the an. this is caused by the inconsistent sub-sequences at the certain position of the various sampled sequences playing key role in activating the an. so, the consensus sequence of the motif cannot represent these sampled sequences. we call these motifs and their consensus sequences to be inconsistent, otherwise we call them to be consistent. among the motif for an an, consensus sequences of some motifs are consistent, which can activate the an, but the inconsistent ones can’t. it is difficult to distinguish them by naked eyes because the information contents at different position are almost the same. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . here, we took one inconsistent and one consistent motif consensus sequence as examples. we aligned two sequences and used a bp window to slide on it. for each position, we replaced the sub-sequence ( bp) of the inconsistent consensus sequence with the corresponding sub-sequence of consensus sequence that can activate the an to test if it can activate the an. we found the valid position and tried to find the latent variables through clustering the sub-sequence-related feature maps or sub-sequence one-hot code that make it becomes mixture. finally, we found they were the mixture of different synonymous motifs rather than the same shifted motif. see supplementary information for details. dcnn architecture optimization and deeper dcnn model construction we tried to optimize the architecture solely without using regularization methods. following the strategy of small kernels and max-pooling sizes, the kernel size and max- pooling size were set and respectively. we used relu as the activation function for each layer except for the last fully-connected layer with the sigmoid function. for basset, we built a -convolutional-layer model bd- (kernel number and pooling operation: , pooling, , pooling, , pooling, , pooling, ) and a -convolutional-layer model bd- (kernel number and pooling operation: , , pooling, , , pooling, , , pooling, , , pooling, , ). at the end of convolutional layer, two fully- connected layers with and ans were appended. the number of kernel sizes was doubled based on the previous layer because the receptive field size was doubled for the deeper an. in a longer receptive field, more combinations of the motifs need to be represented. for deepsea, we built a -convolutional layer model dd- (kernel number and pooling operation: , , pooling, , , pooling, , , pooling, , , pooling, , ). at the end of convolutional layer, two fully-connected layers with and ans were appended. however, the prediction performance of dd- was similar to deepsea. we found that the overlap of the first-layer receptive field is very small for the an of the second layer. if we set the kernel size in the first layer (receptive field size is bp), then the overlap proportion of the adjacent ans is / (receptive field size is bp). we need to get longer overlap by extending the kernel size in the first layer. we tried to train dd- with the first-layer kernel size equal to (overlap proportion: / ≈ %), (overlap proportion: / ≈ %) and (overlap proportion: / ≈ %). the best one is the model with kernel size equal to in the first layer. this result also matched the top convolution-pooling model in the imagenet competition . it seems to be a trade-off for the first kernel size. if it is too small, the structure is not good for training the second layer. on the contrary, the structure is not good for training the first layer. hence, we finally set first-layer kernel size as for the dd- and bd- model. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . prediction performance comparison five models were involved in this work. they are basset, bd- , bd- , deepsea and dd- . after they had been trained on the training set and the validation set, they were tested on the test set. for each prediction target, we calculated the value of area under the precision-recall curve (auprc). we used auprc rather than area under the receiver operating characteristic curve (auroc) because auprc is more sensitive to the unbalanced data. in the dataset of deepsea and basset, the negative samples were much more than the positive samples so auprc is a better indicator. to compare and test the performance difference between models, we assume that if the performance of two model is the same, the difference of auprc value of the same prediction target is Δauprc ~𝑁( ,𝜎)) . we did one-side t-test for each pair of models for comparison. motif discovery for each decoupled motif represented by the an, we needed to filter and slice the motifs for regulatory elements. the decoupled motifs were generated by a sequence set. when the number of the sequences is very small, the motif is not reliable. we first applied the laplace smoothing method to the ppms of decoupled motifs. the smoothed ppm (𝐏𝐏𝐌′) can be obtain by 𝐏𝐏𝐌′ = 𝐏𝐏𝐌 × 𝑁 + [ . ]b× × 𝑀 𝑁 + 𝑀 where 𝑁 is the number of sequences that generate the ppm, [ . ]b× is the × 𝐿 matrix with all elements of . , and 𝑀 is the smoothing parameter. a larger 𝑀 means a stronger smoothing process. we set 𝑀 = in our work. we regarded the nucleotide base position as a part of motif regions if its information content is greater than . we extended these motif regions with bp at both the upstream and downstream. we merged these regions if they were overlapped. regions longer than bp were regarded as motifs. we sliced these regions of ppm as the final discovered motifs. a large portion of these motifs can be matched with motifs in the jaspar database. we showed a small portion of motifs in fig. d with the motifstack package. motif syntax discovery and validation we used the ans of layer in bd- and dd- for the motif syntax discovery. we applied the decoupling algorithm twice for each an and obtain decoupled motifs. these decoupled motifs of the same an usually shared similar shifted motifs. for convenient, we summarized the motifs by using tomtom to match them to motifs in jaspar. based on the summarized tf motif set, we knew the arrangement of these tf motif. for instance, in fig. a, the tf motif set includes ctcf and ddit :cebpa and the arrangement of this two motif is ctcf- n-ddit :cebpa-[ - n]- ddit :cebpa- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . n-ctcf, which is the motif syntax of the an. except for literature validations such as jaspar databases or some published papers, we also used atac-seq data to valid the motif syntax. if the motif syntax is real on genome, the region matched by the motif syntax should interact with some important molecules like tfs. thus, the tn transposes cutting frequency in the aligned regions may show footprinting. we collected five atac-seq datasets of five cell types or tissue including gm , h , k , lncap and prostate (gsm , gsm , gsm , gsm , gsm ). we used the esatac package developed by us to preprocess the dataset. for a concerned an, we used it to scan the test data of basset or deepsea. we collected the top activated regions, extended the regions to bp and stack their tn cutting frequency. we also counted the hard motif matching frequency at each position of these bp regions with motifmatchr . code and more relative results neuronmotif code will be available at: https://github.com/wzthu/neuronmotif relative results will be exibit at: https://wzthu.github.io/neuronmotif (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . references searls, d. b. the language of genes. nature , - , doi: . /nature ( ). buenrostro, j. d., giresi, p. g., zaba, l. c., chang, h. y. & greenleaf, w. j. transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, dna- binding proteins and nucleosome position. nature methods , - , doi: . /nmeth. ( ). eraslan, g., avsec, z., gagneur, j. & theis, f. j. deep learning: new computational modelling techniques for genomics. nat rev genet , - , doi: . /s - - - ( ). zhou, j. & troyanskaya, o. g. predicting effects of noncoding variants with deep learning- based sequence model. nature methods , - , doi: . /nmeth. ( ). kelley, d. r., snoek, j. & rinn, j. l. basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. genome research , - , doi: . /gr. . ( ). spitz, f. & furlong, e. e. m. transcription factors: from enhancer binding to developmental control. nat rev genet , - ( ). shrikumar, a., greenside, p. & kundaje, a. learning important features through propagating activation differences. arxiv preprint arxiv: . ( ). alipanahi, b., delong, a., weirauch, m. t. & frey, b. j. predicting the sequence specificities of dna- and rna-binding proteins by deep learning. nature biotechnology , - , doi: . /nbt. ( ). searls, d. b. the linguistics of dna. am sci , - ( ). simonyan, k. & zisserman, a. very deep convolutional networks for large-scale image recognition. arxiv preprint arxiv: . ( ). zou, j. et al. a primer on deep learning in genomics. nature genetics , - ( ). he, y., shen, z., zhang, q., wang, s. & huang, d.-s. a survey on deep learning in dna/rna motif mining. briefings in bioinformatics, doi: . /bib/bbaa ( ). nguyen, a., yosinski, j. & clune, j. multifaceted feature visualization: uncovering the different types of features learned by each neuron in deep neural networks. arxiv preprint arxiv: . ( ). koo, p. k. & eddy, s. r. representation learning of genomic sequence motifs with convolutional neural networks. plos computational biology , doi: . /journal.pcbi. ( ). jaganathan, k. et al. predicting splicing from primary sequence with deep learning. cell , - e , doi: . /j.cell. . . ( ). bogard, n., linder, j., rosenberg, a. b. & seelig, g. a deep neural network for predicting and engineering alternative polyadenylation. cell , - e , (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . doi: . /j.cell. . . ( ). stormo, g. d. introduction to protein-dna interactions: structure, thermodynamics, and bioinformatics. (cold spring harbor laboratory press, ). jolma, a. et al. dna-binding specificities of human transcription factors. cell , - , doi: . /j.cell. . . ( ). goodfellow, i. j., shlens, j. & szegedy, c. explaining and harnessing adversarial examples. arxiv preprint arxiv: . ( ). simonyan, k., vedaldi, a. & zisserman, a. deep inside convolutional networks: visualising image classification models and saliency maps. arxiv preprint arxiv: . ( ). fornes, o. et al. jaspar : update of the open-access database of transcription factor binding profiles. nucleic acids res , d -d ( ). klemm, s. l., shipony, z. & greenleaf, w. j. chromatin accessibility and the regulatory epigenome. nat rev genet , - , doi: . /s - - - ( ). schwalie, p. c. et al. co-binding by yy identifies the transcriptionally active, highly conserved set of ctcf-bound regions in primate genomes. genome biology , doi: . /gb- - - -r ( ). molla, m., delcher, a., sunyaev, s., cantor, c. & kasif, s. triplet repeat length bias and variation in the human transcriptome. proceedings of the national academy of sciences of the united states of america , - , doi: . /pnas. ( ). russakovsky, o. et al. imagenet large scale visual recognition challenge. int j comput vision , - , doi: . /s - - -y ( ). gupta, s., stamatoyannopoulos, j. a., bailey, t. l. & noble, w. s. quantifying similarity between motifs. genome biology , doi: . /gb- - - -r ( ). jolma, a. et al. dna-dependent formation of transcription factor pairs alters their binding specificity. nature , - , doi: . /nature ( ). pugacheva, e. m. et al. comparative analyses of ctcf and boris occupancies uncover two distinct classes of ctcf binding genomic regions. genome biology , doi: . /s - - - ( ). grabowska, m. m. et al. nfi transcription factors interact with foxa to regulate prostate-specific gene expression. mol endocrinol , - , doi: . /me. - ( ). stormo, g. d. & zhao, y. determining the specificity of protein-dna interactions. nat rev genet , - , doi: . /nrg ( ). ou, j. h., wolfe, s. a., brodsky, m. h. & zhu, l. h. j. motifstack for the analysis of transcription factor binding site evolution. nature methods , - , doi: . /nmeth. ( ). liu, q. et al. genome-wide temporal profiling of transcriptome and open chromatin of early cardiomyocyte differentiation derived from hipscs and hescs. circ res , - , doi: . /circresaha. . ( ). calviello, a. k., hirsekorn, a., wurmus, r., yusuf, d. & ohler, u. reproducible inference of transcription factor footprints in atac-seq and dnase-seq datasets using protocol-specific (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . bias modeling. genome biology , doi: . /s - - -y ( ). zhang, z. d. et al. loss of chd promotes heterogeneous mechanisms of resistance to ar-targeted therapy via chromatin dysregulation. cancer cell , - e , doi: . /j.ccell. . . ( ). park, j. w. et al. reprogramming normal human epithelial tissues to a common, lethal neuroendocrine cancer lineage. science , - , doi: . /science.aat ( ). wei, z., zhang, w., fang, h., li, y. d. & wang, x. w. esatac: an easy-to-use systematic pipeline for atac-seq data analysis. bioinformatics , - , doi: . /bioinformatics/bty ( ). schep, a. n., wu, b. j., buenrostro, j. d. & greenleaf, w. j. chromvar : inferring transcription-factor-associated accessibility from single-cell epigenomic data. nature methods , - , doi: . /nmeth. ( ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . comparative evaluation of full-length isoform quantification from rna-seq comparative evaluation of full-length isoform quantification from rna-seq dimitra sarantopoulou ,#a ¶, thomas g. brooks ¶, soumyashant nayak , anthonijo mrcela , nicholas f. lahens , gregory r. grant , * institute for translational medicine and therapeutics, university of pennsylvania, philadelphia, pennsylvania, united states of america department of genetics, university of pennsylvania, philadelphia, pennsylvania, united states of america #a current address: national institute on aging, national institutes of health, baltimore, maryland, united states of america ¶ equal contributors * corresponding author email: ggrant@pennmedicine.upenn.edu (gg) abstract full-length isoform quantification from rna-seq is a key goal in transcriptomics analyses and has been an area of active development since the beginning. the fundamental difficulty stems from the fact that rna transcripts are long, while rna-seq reads are short. here we use simulated benchmarking data that reflects many properties of real data, including polymorphisms, intron signal and non-uniform coverage, allowing for systematic comparative analyses of isoform quantification accuracy and its impact on differential expression analysis. genome, transcriptome and pseudo alignment-based methods are included; and a simple approach is included as a baseline control. salmon, kallisto, rsem, and cufflinks exhibit the highest accuracy on idealized data, while on more realistic data they do not perform dramatically better than the simple approach. we determine the structural parameters with the greatest impact on quantification accuracy to be length and sequence compression complexity and not so much the number of isoforms. the effect of incomplete annotation on performance is also investigated. overall, the tested methods show sufficient divergence from the truth to suggest that full-length isoform quantification and isoform level de should still be employed selectively. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint mailto:ggrant@pennmedicine.upenn.edu https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / keywords benchmarking, isoform quantification, simulated data, pseudo-alignment, rna-seq, short reads background alternative splicing and isoform switching play central roles in cell function; and disruption of the splicing mechanism is associated with many diseases and drug targets ( , ). the function of a protein is ultimately determined by its full complement of functional domains. differential splicing typically involves a reshuffling of the functional domains to construct a functionally different protein. gene level analyses must ignore these differences. before things like pathway enrichment analysis can be brought down to the transcript level, it will be necessary to quantify expression of full-length isoforms. for investigations specifically focused on splicing, one also has the option of working at the local splicing level (e.g., majiq( )). if, for example, full-length isoform quantification simply leads to an exon skipping event, that would have also been found by local splicing methods. investigators must therefore carefully factor in the goals of their analysis to decide at which level features should be quantified. another reason for estimating isoform level expression is to give better estimates of gene level expression. indeed, it is not clear how to achieve gene level quantification from local splicing information. for various purposes, full length isoform quantification must be more informative than local splicing information when it can be achieved, and the primary reason local splicing methods are popular right now is due to the relative difficulty in working with full length. the fact that isoform quantification is a key goal for modern transcriptomic profiling is reflected in how active the community has been in developing methods and how popular those methods have been, in spite of their notoriously high false positive rates. despite many published algorithms, in practice effective quantification of full-length isoforms from short-read rna-seq remains problematic and therefore has never been routine. the fundamental limitation is that individual short reads do not contain information on long- range interactions that would associate splicing events that are separated by more than the fragment length. regardless, methods can exploit additional biological and stochastic .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /blj +veovf https://paperpile.com/c/kjin /cdsll https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / information, like canonical splice sites, which combined with alignment information can increase accuracy ( – ). although long sequence read technology is improving, compared to short read technology it continues to be lower throughput with a much higher base-wise error rate and is generally more expensive. therefore, most rna-seq studies are still performed with short reads and this will likely remain the case until competing technologies mature. short reads are typically - bases long, and usually obtained from both ends of short - base fragments. meanwhile a significant portion of rna transcripts are over bases and many are much longer. given the difficulty in full-length isoform quantification, many rna-seq studies simply quantify at the gene level, which is much easier because uniquely aligning reads are rarely ambiguous at the gene level. indeed, unless the investigator is specifically interested in splicing, gene level analysis will likely lead to the same conclusions, since all isoforms of the same gene typically have the same pathway annotations. meaningful unbiased benchmarking conclusions rely on independent investigations and realistic benchmarking data where the ground truth is known or well-approximated. there are in fact a few independent studies that compare the performance of transcript quantification methods using simulated data ( ), real data ( ), or a hybrid approach with both real and simulated data ( – ). so why did we embark on another comparative study? angelini et al ( ) and kanitz et al ( ) are five and six years old, respectively, and hence they do not reflect the recent developments in this fast-changing field. for instance, they do not include the popular pseudo-alignment-based methods kallisto ( ) and salmon ( ). angelini et al ( ) take the approach of using simulated data, which is most similar to the approach employed here, however, they utilize the flux simulator which does not allow for many of the effects of real data we can model using the beers simulator ( ). also, the primary focus is on detection of isoforms as being “present/absent” in the sample and accuracy of quantification was presented as tables of quantiles. their conclusion was that “all tables indicate that the problem of obtaining reliable estimates is still open.” therefore, these methods require ongoing evaluation by the user community. zhang et al ( ) use the human universal reference sample (uhrr) and the human brain universal reference (hbrr) which are such artificial samples that it is not clear what practical guidance can be drawn from the results. in particular the uhrr is a mixture of cancer cell lines. cancer transcriptomes are notoriously scrambled and mutated, and therefore represent a very special case, particularly with regards to annotation-based .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /pwvxf+kb e +h p a+fde https://paperpile.com/c/kjin /wv bo https://paperpile.com/c/kjin /o zvv https://paperpile.com/c/kjin / nlxh+ wiap+iudtb https://paperpile.com/c/kjin /wv bo https://paperpile.com/c/kjin /iudtb https://paperpile.com/c/kjin / nlxh https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / quantification. moreover, a mixture of ten such cell lines give a sample so different from what researchers use in practice that it precludes the possibility of evaluating the methods in the context of a typical differential expression analysis, which is the main goal of most rna-seq studies. with the uhrr and hbrr samples only technical replicates can be generated, while all de methods require biological replicates. simulated data which mimic real samples is arguably more realistic than real data obtained from mixtures of cancer cell lines. in silico simulated data offer more control as the truth is known exactly, but these data invariably simplify some of the inherent complexities of real data. in , teng et al ( ) published very nice guidance on quantification benchmarking. their approach assumes one has benchmarking data where the truth is known on the level of differential expression, without assuming as known the actual quantified values. since the goal here is to investigate quantification accuracy directly, the methods in teng et al are not directly applicable. other studies focus only on single-cell data ( ), or on differential splicing ( ). commonly, rna-seq transcript level quantification is validated by pcr. however, pcr is low-throughput and is based on probes that interrogate only a small part of a given transcript; it is also sensitive to biases at the amplification step. on the other hand, in in silico simulated data the truth is known exactly. hayer et al ( ) investigated de novo transcriptome assembly, where isoform structures need to be inferred directly from the rna-seq data and concluded that none of the evaluated methods is accurate enough for routine use and further method development is required. the problem we investigate is considerably easier; isoform level annotation is given and reads must just be assigned to the correct isoform. approaches for quantifying isoform expression can be divided into three main categories. the first approach uses reads mapped to the genome by an intron-aware aligner, e.g. star ( ). the genome alignment information is then used to assign quantified values to transcripts ( – ). the second approach is similar to the first, except it is based on reads aligned directly to the transcriptome, rather than the genome ( , ). the third approach follows the concept of pseudo-alignment which prioritizes execution performance and does not involve bona fide alignment ( , ). in reality, all genome aligners are transcriptome- aware, and transcriptome alignments are genome aware, so the distinction is not as cut and dried as it once was. but nonetheless, we continue to distinguish the two, with caveats. there are many published methods for quantifying full-length isoforms, however the vast majority of studies performing isoform specific analysis have used cufflinks, rsem or some simple counting method following genome alignment (fig ) ( – ). pseudo-aligners were .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /igd b https://paperpile.com/c/kjin / ms https://paperpile.com/c/kjin / wiap https://paperpile.com/c/kjin /iqpuq https://paperpile.com/c/kjin /fde +pwvxf+h p a+kb e https://paperpile.com/c/kjin / uhcx+fde https://paperpile.com/c/kjin /tegbs https://paperpile.com/c/kjin / cqrd+dmkfh+gklkp https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / introduced more recently and therefore have lower adoption but are beginning to see wider usage ( , ). here we present a benchmarking analysis of the six most popular isoform quantification methods: kallisto, salmon, rsem, cufflinks, htseq, and featurecounts, based on a survey of the literature (fig ). htseq and featurecounts are not recommended by the authors for full-length isoform quantification, however they were included for the purpose of comparison and because they are used in practice. we also include a naïve read proportioning method, based on employing the distribution of signal inferred from the unambiguous read alignments to portion out the ambiguous read alignments, similar to the method first described by mortazavi et al ( ). we generated datasets from two mouse tissues, liver and hippocampus, which are known to be quite different in terms of splicing, with brain generally being more complex than any other tissue. a hybrid approach is taken to obtaining benchmarking data, where real samples are emulated to generate simulated data where the true isoform abundances are known; this was done using a modified version of the beers simulator ( ). idealized data were generated to obtain upper bounds on the accuracy of all methods. data were also generated with variants, sequencing errors, intron signal and non-uniform coverage, to assess how they affect performance. since annotation is never perfect, we evaluate performance while varying annotation completeness. fig . most popular quantification methods. ranking of quantification methods by the number of times found in the most recent rna-seq studies (published during march-may, ), which reported the quantification method used. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /pp si+ppxjz https://paperpile.com/c/kjin /ujyyq https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / usually, the aim of an rna-seq analysis is to inform a downstream differential expression (de) analysis. therefore, we also evaluate the methods on this level, using both real and simulated data. however, it is much more challenging to produce realistic data with known ground truth at the de level. unlike isoform level quantification which is sample-specific, de ground truth is established at the population level, and therefore involves much more complex benchmarking data. our simulated samples reflect the complex joint distribution of expression across biological replicates, and thus it is meaningful to perform a de analysis on them. this is described in more detail below but briefly, in lieu of knowing the ground truth in terms of which isoforms are differentially expressed, for each method we compare the de analysis performed on the known true isoform quantifications of the simulated data to the de analyses performed on the estimated counts determined using the particular method. the more different the two analyses are, the less accurate the quantification method must be in informing the de analysis. this then allows us to compare the methods in terms of their accuracy of quantification. it is possible that a method underperforms another method at the level of quantification, but outperforms it in the de analysis. results hybrid benchmarking study using both real and simulated data for the simulated data we started with real rna-seq samples: six liver and six hippocampus samples from the mouse genome project ( ). isoform expression distributions were estimated from these samples in ( ) which were then used to generate simulated data for which the source isoform of every read is known. two types of simulated datasets were generated with the beers simulator ( ). first, idealized simulated data were generated, with no snps, indels, or sequencing errors, no intron signal and uniform coverage across each isoform ( ). second, simulated data were generated with polymorphisms (snps and indels), sequencing errors, intron signal, and empirically inferred non-uniform coverage ( ). relative performance on idealized data does not necessarily reflect relative performance on real data, but we do expect the accuracy of the methods on idealized data to be upper bounds on the accuracy in practice. if a bound on idealized data is below what one would tolerate in practice, then it cannot be expected to be viable in practice. the (more) realistic data provide insight into the effect of the various factors on the method performance. the realistic data probably also gives bounds on accuracy of real data since it was designed to be no more complex than real data. for simplicity of exposition, we will refer to the data with the complexities as the “realistic” data, keeping in .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin / y https://paperpile.com/c/kjin /cdsll https://paperpile.com/c/kjin /ujyyq https://paperpile.com/c/kjin /cdsll https://paperpile.com/c/kjin /cdsll https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / mind it does not reflect every property of real data, just the five properties listed above (snps, indels, sequencing error, intron signal and non-uniform coverage). for both the idealized and realistic simulated data, we use three liver and three hippocampus samples to evaluate isoform quantification, and six liver and five hippocampus samples to evaluate de analysis, as in [ ]. all samples were obtained from independent animals raised as biological replicates. comparisons between tissues were employed to assess consistency and differential expression; brain has a more complex transcriptome than other tissues ( ), and thus isoform level analysis is expected to be more challenging for the algorithms. we performed a comparative analysis of seven of the most commonly used full-length isoform quantification algorithms; kallisto ( ), salmon ( ), rsem ( ), cufflinks ( ), htseq ( ), featurecounts ( ) and a naïve read proportioning approach similar to the method first described by mortazavi et al ( ) (nrp; see methods). kallisto and salmon are pseudo-aligners; rsem, cufflinks, htseq, and featurecounts are genome alignment-based approaches where the alignments are guided by incorporating transcriptome information, and nrp is a transcriptome alignment-based approach. these methods were evaluated at the isoform expression level using idealized and realistic simulated data, with full and incomplete annotation, and also at the differential expression level using both realistic and real data. comparison of full-length quantification methods idealized data the idealized data has no indels, snp’s, or errors, includes no intron signal, and deviates from uniform coverage across each isoform only as much as may happen due to random sampling. under such perfect conditions we expect that all methods will achieve their best performance. the data were aligned to the reference genome or transcriptome with star ( ) and quantified with the seven methods. in fig a, estimated expression is plotted against the true transcript counts, for each method in liver. each point represents the average of the three replicates of that tissue. a point on the diagonal indicates a perfect estimate. a point on the x-axis indicates an unexpressed transcript which was erroneously given positive expression. a point on the y-axis indicates an expressed transcript which was erroneously given zero expression. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /o rct https://paperpile.com/c/kjin /tegbs https://paperpile.com/c/kjin /fde https://paperpile.com/c/kjin /kb e https://paperpile.com/c/kjin /pwvxf https://paperpile.com/c/kjin /h p a https://paperpile.com/c/kjin /iqpuq https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / fig . comparison of estimated quantification with the truth in simulated data. (a,b) scatter plots between the inferred and true counts. each point represents the average expression of three samples. a) idealized data b) realistic data. (c,d) percentiles of the |logfc| (relative to true counts), for the set of expressed isoforms in sample in c) idealized and d) realistic data. a point (x,y) on a graph means x% of the transcripts have |logfc| . specifically, a point (x,y) on the graph means x% of transcripts have |logfc|y. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / fig shows the percentile plots of the |logfc|. hippocampus sample hip is shown but all samples hip and liv look very similar. the first thing to note is that removing the maximally expressed isoform has dramatically decreased the accuracy of all methods except for htseq and featurecounts. and removing the non-expressed isoforms has marginally increased accuracy for those methods. in contrast, for htseq and featurecounts we observe the opposite. removing the non-expressed isoforms has dramatically decreased accuracy and removing the highest expressed isoform has made very little difference, particularly with featurecounts. fig compares for the different methods the percentile plots for removing the maximally expressed isoform. this eliminates the isoform of the majority of the reads so should have a dramatic effect on accuracy. here salmon has the most difficulty and htseq and featurecounts are the most robust to this, followed by nrp. here we see a significant difference between salmon and kallisto that goes in the opposite direction of the differences seen by the other perspectives. effects on differential expression fig . removal of highest expressed isoforms. the annotation was modified by removing the highest expressed isoform of every gene. for each method the percentile plots are shown. here a point (x,y) on a curve means x% of isoforms have |logfc|>y. the lower the curve, the better. surprisingly, salmon has the most difficulty and htseq the least. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / next, we use differential expression to assess quantification accuracy. if differential expression analysis is the downstream goal for the quantified values, then it does not matter if the absolute abundances differ from the truth, if the de p-values are unaffected. to investigate this, the two tissues were compared against each other; different enough tissues so that there is an abundance of differentially expressed genes. six hippocampus samples and six liver samples of the realistic data were quantified, with each of the seven methods, and the resulting quantified values were used as input for de analyses with ebseq ( ), which is optimized for isoform differential expression. the p-values generated from the true counts are compared to p-values from the inferred counts - the assumption being that the closer a de analysis on the inferred counts is to the corresponding de analysis on the true counts, the more effectively the method has quantified the expression, with respect to informing the de analysis. kallisto and salmon are recommended to be run with sleuth, however since sleuth cannot take true counts as input, the comparison would not be meaningful. since we are comparing ebseq (truth) to ebseq (inferred) for all methods, it should be meaningful to compare methods to each other with this metric. fig . method effect on differential expression analysis, using realistic data. for each method, a de analysis with ebseq was performed between the two tissues. (a) a point (x,y) on a curve means for the top x de transcripts using real counts, and the top x de transcripts using the inferred quants, have jaccard index y. (b) a point (x,y) on a curve means there are y isoforms with q-value < x. the curves should be evaluated in relation to the truth, which is the gray curve. at varying q-value cutoffs between . and . all methods become anti-conservative. salmon and cufflinks track the truth closest at small cutoffs. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /lznvl https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / comparing two developmentally divergent tissues, we expect the majority of transcripts that are expressed to be differentially expressed. figure a shows the overlap with the truth, for the top n most significant genes, as n varies from to , . since ebseq reports a lot of zero p-values, rounded down from their limit of precision, ties were broken with the logfc. the vertical axis is the jaccard index ( ) of the top n de transcripts determined using the real counts and the top n de transcripts determined using the inferred counts. the jaccard index of two subsets of a set is the size of the intersection divided by the size of the union. the higher the curve, the better. salmon and cufflinks are performing best from this perspective, followed by rsem. nrp and kallisto appear roughly equivalent. in fig b the number of de transcripts is plotted as a function of the q-value cutoff (s table). if a curve rises above the truth, then that method must be reporting more false-positives than the q-value indicates. at varying places between . and . all methods become anti-conservative. salmon and cufflinks track the truth closest at small cutoffs. this data can also be used to evaluate the de methods themselves – ebseq, sleuth and deseq . deseq is included for reference, but it was not specifically designed with transcript-level de in mind. in de benchmarking, it is notoriously difficult to determine a benchmark set of either differential, or non-differential, transcripts. however, if an isoform fig . method effect on differential expression analysis, using realistic data. the roughly , isoforms with zero true expression in both liver and hippocampus, serve as a set of null isoforms for the de analysis. (a) gives a lower bound on the true fdr of the isoforms rejected at each q-value cutoff. plots above the black line are anti-conservative. (b) same as a but shows the actual number of null isoforms determined de as a function of the q- value. note that only , isoforms exist in total. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / has zero expression in all replicates of both conditions, then it must necessarily be non- differential. a total of , isoforms have zero in all replicates of both conditions. any transcript called de in this set must be a false positive arising from mistakes in the quantification process. this allows us to define a lower bound on the actual fdr, because it gives a lower bound on the number of false positives, as given by the number of these null isoforms that were called de. this lower bound on the fdr is plotted as a function of the q- value cutoff (fig a). additionally, the actual number of null isoforms called de is plotted as a function of the q-value cutoff in fig b. fig a shows that in all cases the true fdr is much greater than reported. indeed, fig b shows that even at very small q-values ebseq and deseq are reporting thousands of these false positives. at an fdr of . there are at least , isoforms using any method. these cannot simply be the % false positives allowed by an fdr of . since that would then require an additional , true positives, which is more isoforms than are even annotated. why is this happening? when an isoform has zero true expression, but another isoform of the same gene has positive expression, it is easy for reads of the expressed isoform to be misassigned to unexpressed. however, if none of the isoforms of a gene are expressed, it is far less likely that any of the isoforms are assigned spurious reads since it is much less likely that any reads map anywhere to the gene’s locus. therefore, if a gene has no expressed isoforms in liver and has one or more expressed isoform in hippocampus, in addition to one or more unexpressed isoforms, then the unexpressed isoforms will tend to have zero expression in liver and will tend to incur spurious expression in hippocampus. such isoforms are then easily mistaken as differential. an isoform level de method should account for this variability, but we see in fig that both ebseq and sleuth are anti- conservative. the isoform-level de methods do however outperform deseq , which is not intended for transcript-level analysis. on the quantification methods where it is applicable, sleuth shows the lowest false positive rate, reflecting the fact that it uses additional variance information from bootstrap samples. evaluation with real data in all comparisons performed with the simulated datasets, htseq and featurecounts are very similar and kallisto, salmon, rsem, cufflinks, and nrp are also generally comparable. to explore whether the comparative analyses can be replicated with a real experiment, we .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / used the real data that informed the simulations. here we used six hippocampus and six liver samples. hierarchical clustering was performed with correlation distance, on the average expression of six samples. the results recapitulate these two groups in hippocampus (fig a), while in liver cufflinks clusters further and alone (fig b), as in the realistic simulated data (fig a-b). this suggests that cufflinks is strongly influenced by a tissue-specific effect and confirms that the simulated data successfully capture properties of the real data. furthermore, we compare the seven quantification approaches on how well they inform a de analysis, using the real data. we quantified six samples from each tissue with the seven methods, followed by de analysis between the two tissues using ebseq. the methods cluster similarly for both realistic and real data (figs , ). there is a significant difference in the number of de transcripts identified at various q-value cutoffs, among the seven methods (fig d, s table). fig . method effect on de analysis, using real data. hierarchical clustering by correlation distance of the average expression using a) six liver samples or b) six hippocampus samples. c) hierarchical clustering by correlation distance of the logfc of hippocampus over liver samples. for each method, we performed a de analysis between the two tissues. d) number of de transcripts identified at various q-value cutoffs. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / discussion isoform level quantification has been an area of active development since the inception of rna-seq. it got off to a rough start and progressed slowly, however steadily, and we see considerable improvement over the last five years. nevertheless, using both realistic simulated and real data, no method achieved high enough accuracy across the board that it can be recommended for general purposes. overall, salmon marginally outperformed the other methods by our benchmarks. it must be kept in mind however that the additional complexities of real data will likely affect those marginal differences in unpredictable ways. therefore, if one is going to do full length isoform quantification at this stage, then salmon or rsem could be equally effective choices. cufflinks performs well from many perspectives but the erratic behavior in the liver clustering (fig , ) is concerning. salmon, as a pseudo-aligner, has the advantage of efficiency. however, if one is performing small or medium sized rna-seq studies, then genome alignments should in principle always be performed anyway so that coverage plots can be examined in a genome browser. since there is no shortcut to that process, the advantages of salmon and kallisto in terms of efficiency really only come into play when hundreds or thousands of samples must be processed. since data sets with hundreds of thousands of samples are on the horizon, this is a real concern. but for most targeted rna- seq analyses, as is done routinely in research labs, this will factor less into the decision. salmon ( ) is similar to kallisto, and originally was identical except for incorporating a sample-specific model of fragment gc bias to improve its quantification estimates. our simulated data, generated by beers ( ), do not reflect these biases, and thus this feature of salmon could not be reasonably evaluated in this study. the only simulator currently available that models fragment gc biases is polyester ( ). however, both polyester and salmon use the same underlying model for fragment gc bias ( ), which may bias results towards salmon’s benefit. salmon further has options to control for read start sequence bias (such as from random hexamer priming) and positional bias (such as ’ or ’ bias), which were also not evaluated here. future benchmarking studies will require datasets (both real and simulated) that capture the true sequence properties underlying non-uniform coverage in order to quantitatively assess the performance impact offered by incorporating a fragment bias model. this will be accounted for in beers . . .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin / uhcx https://paperpile.com/c/kjin /ujyyq https://paperpile.com/c/kjin / drao https://paperpile.com/c/kjin /loeyd https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / additionally, we investigated some extreme cases of inaccuracy, in both simulated and real data, where transcripts were estimated to be highly expressed by one method and non- expressed by the other. in the simulated data, we identified enriched genomic properties that drive the deviation of each method from the true counts. and in real data, we isolated one example of large quantification differences between methods. in this, the inclusion of a single read causes kallisto and rsem to disagree by counts to , and the difference resolves if that read is removed. this edge case occurred because only two reads were unambiguous to the two isoforms of a highly expressed gene. the transcript-level de method sleuth ( ) uses bootstrap resampling to control for possibilities like this example. ebseq uses the number of sibling isoforms as a factor in its variance computation. however, our analysis indicates these while these methods outperform deseq , they could still be generating too many false positives. in particular when all isoforms of a gene are unexpressed in the first condition, and one isoform is expressed in the second condition, we observe a lot of false positives on the other unexpressed isoforms of that same gene, due entirely to quantification inaccuracy. overall, kallisto and salmon as alignment-free methods require less computational time while achieving similar or better accuracy compared to other methods whereas rsem and cufflinks perform well among the alignment-dependent methods. however, our results indicate that all tested methods should be employed selectively, especially when long transcripts with many isoforms or transcripts with low sequence complexity are the candidates of interest for the study. nrp is a straightforward and simple approach that is relatively robust to polymorphisms, non-uniform coverage and intron signal; however, it struggles with a greater number of isoforms. in any case it performs equally well or in some cases outperforms more sophisticated methods, suggesting that information extraction and inference from short rna-seq reads is largely saturated and future, more complex models might offer only small benefits in gene isoform quantification. these results indicate the differing strengths of different approaches to this problem. as such, it may be possible to leverage the different methods to achieve overall greater accuracy. for example, nrp, htseq and featurecounts appear to do better on one-isoform genes. so, it may make sense to treat those genes separately. in any case this must continue to be an active area of research before the technology can transform transcriptomics and realize the advantages of full-length isoform quantification. methods .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin / ytps https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / data generation we used the same method for generating simulated data as described in norton et al ( ). for all of the procedures described below, we used gene models from release of ensembl grcm annotation, and sequence information from the grcm build of the mouse genome. we used the empirical expression levels and percent spliced included (psi) values across all of the mouse genome project (mgp) ( ) liver and hippocampus samples estimated in norton et al ( ). briefly, the samples were aligned with star, and gene-level counts were calculated with htseq-count. next, ensembl transcript models were used to identify local splicing variations (lsvs); loci with exon junctions that start at the same coordinate but end at different coordinates (or vice versa). of the , annotated genes expressed in the mgp data, , were randomly selected to reflect the empirical psi values for their associated transcripts. for this "empirical set" of genes we estimated psi values separately for each sample by comparing the relative ratios of all junction-spanning reads that mapped to an lsv. these psi values reflect the biological noise and real differential splicing (if any) between the two tissues. for each of the remaining genes, we simulated no differential splicing between tissues with the following procedure: ) for a given gene with n spliceforms, randomly select a gene with the same number of spliceforms from the empirical set. ) for this empirical gene, randomly select the psi values from one mgp sample. ) assign these psi values across all samples for the gene in the simulated set. ) to add inter-sample variability, randomly add/subtract a random number (uniform from - . ) to the psi values in each sample, such that psi values for the gene/sample still sum to . these estimated gene expression counts and psi values, for both the empirical set and remaining set of genes, served as input into the beers simulator ( ). for the idealized data, we used a uniform distribution for read coverage, with no intronic signal, and no sequencing errors, substitutions, or indels (parameters: -strandspecific -error - subfreq -indelfreq -intronfreq . -fraglength , , ). for the realistic data, we used a ' biased distribution for read coverage that was inferred empirically from previous data ( ). we also added % intronic signal, and used a sequencing error rate of . %, a substitution frequency of . %, and an indel frequency of . % (parameters: - strandspecific -error . -subfreq . -indelfreq . -intronfreq . -fraglength , , ). lastly, we did not simulate novel (unannotated) splicing events in either dataset (parameter: -palt ). rna-seq analysis .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /cdsll https://paperpile.com/c/kjin / y https://paperpile.com/c/kjin /cdsll https://paperpile.com/c/kjin /ujyyq https://paperpile.com/c/kjin /y x h https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / the two simulated rna-seq datasets were aligned to both the grcm build of the mouse genome and transcriptome with star-v . . a ( ). for all transcript models we used release of the ensembl grcm annotation. the breakdown of the annotation by number of spliceforms is given in s fig. the raw read counts were quantified at the transcript level, using the following methods: the pseudo-aligners kallisto-v . . ( ) and salmon-v . . ( ), the naïve read proportioning approach (nrp: http://bioinf.itmat.upenn.edu/beers/bp /) based on transcriptome alignment, as well as the genome alignment based methods rsem ( ), cuffdiff (cufflinks-v . . ) ( , ), htseq- v . . ( ), and featurecounts (subread-v . . ) ( ). ebseq-v . . ( ) was used for differential analysis, both between hippocampus and liver; and also between estimated and true transcript counts. all visualizations were done with r-v . . packages ( ). the command line parameters used for each tool are in s table. differential expression analysis transcript-level differential expression was assessed via three methods. deseq -v . and ebseq-v . . were run on the inferred quantified values from all quantification methods. in addition, the sleuth-v . . method was run on the quantifications from salmon and kallisto, using bootstrap samples and the wasabi package (https://github.com/combine-lab/wasabi) to convert salmon to the sleuth input format. all methods were run on the realistic simulated data and compared the five hippocampus samples to the six liver samples and on the real samples, six hippocampus versus six liver samples. for the simulated data, we also ran deseq and ebseq given the true quantified variables for comparison with the inferred quantifications. ebseq was configured to perform two-condition isoform-level de with the recommended uncertainty groups of genes with , or or more transcripts. the maxround parameter was set to . since ebseq is a bayesian method, we used the reported posterior probability of equivalent expression as the q-value of the transcript being de ( ). since ebseq yields many transcripts with q= , we broke ties by using the logfc from the quantified values, when ranking genes by q-value. description of the seven quantification methods kallisto is a pseudo-aligner which uses a hash-based approach to assemble compatibility classes of transcripts for every read by mapping the read’s k-mers, using the transcriptome k-mer de bruijn graphs ( ). it requires few computing resources and has a fast runtime. the index was built from the transcript sequences and transcript abundances were .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /iqpuq https://paperpile.com/c/kjin /tegbs http://bioinf.itmat.upenn.edu/beers/bp / https://paperpile.com/c/kjin /fde https://paperpile.com/c/kjin /kb e +psgqy https://paperpile.com/c/kjin /pwvxf https://paperpile.com/c/kjin /h p a https://paperpile.com/c/kjin /gpybm https://paperpile.com/c/kjin /tegbs https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / quantified via pseudo-alignment using the index. the counts estimates in the est counts column were used in our analyses. fifty bootstrap runs were performed for de analysis by sleuth. salmon is a pseudo-aligner which also accounts for various biases in the data (gc content, starting sequencing bias, position-specific fragment start location bias such as a ’ or ’ bias) ( ). like kallisto, it has fast runtime and low resource requirements. the index is built from transcript sequences and decoy sequences of the entire genome were provided. the numreads estimate was used in our analysis. fifty bootstrap runs were performed for de analysis by sleuth. rsem is a gene/isoform abundance tool for rna-seq data which uses a generative model for the rna-seq read sequencing process with parameters given by the expression level for each isoform ( , ). a set of reference transcript sequences was built using rsem-prepare- reference script based on the grcm ensemblv reference genome and the corresponding transcript annotation file. then the isoform abundances were estimated using rsem-calculate-expression. for our analysis, we use the expected count in the isoform output file which contains the sum (taken over all reads) of the posterior probability that each read comes from the isoform. to prepare input for cufflinks, htseq and featurecounts, the real and simulated data were aligned to a star genome index built with the grcm ensemblv transcript annotation file. cuffdiff ( ) is an algorithm of the cufflinks suite ( ), which estimates expression at the transcript-level and controls for variability across replicates. because of alternative splicing in higher eukaryotes, isoforms of most genes share large numbers of exonic sequences which leads to ambiguous mapping of reads at the transcript-level. cuffdiff first estimates the transcript-level fragment counts and then updates the estimate using a measure of uncertainty which captures the confidence that a given fragment is correctly assigned to the transcript that generated it ( ). we provided the sorted aligned files and the appropriate annotation file to cuffdiff and used the isoforms.count_tracking file generated. for htseq ( ), htseq-count was used to estimate isoform level abundances from the alignments. we used the recommended default mode which discards any ambiguously mapped reads and hence conservative in its estimate. the htseq documentation suggests that one should expect sub-optimal results when it is used for transcript-level estimates and recommends performing exon-level analysis instead (using dexseq). nevertheless, we use .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /fde +df xd https://paperpile.com/c/kjin /psgqy https://paperpile.com/c/kjin /kb e https://paperpile.com/c/kjin /brvzo https://paperpile.com/c/kjin /pwvxf https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / it for transcript-level fragment count estimates in order to quantify its underperformance relative to the other methods. featurecounts ( ) is a read count program to quantify rna-seq (or dna-seq) reads in terms of any type of genomic property (such as gene, transcript, exon, etc.). it is very similar to htseq-count, with the main differences being efficient memory management and low runtime. as a baseline comparison, we considered a naïve read proportioning (nrp) approach as a baseline. this is essentially the method described by mortazavi et al ( ) but without normalizing by transcript length. nrp uses a transcriptome alignment (provided by star in this case) and in the first pass, computes the number of reads mapping unambiguously to each transcript. to deal with ambiguous mappers, it then takes a second pass on the alignment file. if a read maps ambiguously to a set of transcripts 𝓣 {𝑇 , 𝑇 , … 𝑇𝑛 } and 𝑐 , 𝑐 , … 𝑐𝑛are the respective fragment counts from unambiguous mappers in the first step, it increments the fragment count of 𝑇𝑖 by 𝑐𝑖 𝑐 +⋯+𝑐𝑛 . if all of the 𝑐𝑖’s are , that is, none of the transcripts in 𝓣 have any reads mapping unambiguously to them, we increment the fragment count of 𝑇𝑖 by 𝑙𝑖 𝑙 +⋯+𝑙𝑛 where 𝑙𝑖 is the length of transcript 𝑇𝑖. statistical analysis as a measure of the accuracy of each method, we compute the absolute value of the log fold-change (fold-change after adjusting numerator and denominator by pseudocount of ) for estimated counts relative to the known simulated true counts. for example, if x is the true count and y is the estimated count for a particular method, we calculate the quantity of | log 𝑦+ 𝑥+ | for each transcript. the closer the logfc is to , the more accurate the method is for that transcript. in order to better represent the distribution of the |logfc| values for each method, we plot (for the set of expressed isoforms) the value of |logfc| corresponding to every tenth percentile starting from . if the method has high accuracy, we expect the graph to be close to . thus, if the graph for method a is higher than method b, we conclude that tool b is more accurate. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /h p a https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / moreover, we identify the genomic properties of the data that affect the accuracy of the methods. for each method, we identified the most discordant transcripts sorting by |logfc|. using the ensembl annotation and genome sequence for grcm , we created a database of transcript properties (such as number of isoforms, hexamer entropy, transcript length, compression complexity* ( ), exon count, etc.) and their global distributions across the transcriptome. then for the lists of discordant transcripts, we computed the kolmogorov- smirnov two-sample test p-values for each transcript property, followed by bonferroni correction for multiple testing, to identify the properties that exhibit significant deviation from the global distribution. * transcript sequence compression complexity is a metric that captures the amount of lossless compression of the transcript sequence. the higher the sequence complexity, the lower the compression, which implies higher transcript sequence compression complexity. list of abbreviations logfc: log fold change de analysis: differential expression analysis de transcripts: differentially expressed transcripts nrp: naïve read proportioning approach declarations acknowledgements we thank the high performance computing at penn medicine (pmacs hpc) funded by s od nih, for the cluster computing support. availability of data and materials .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://paperpile.com/c/kjin /tscr https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / all raw and processed rna-seq data used in this study are available at array express under accession number e-mtab- . all simulated data generated in this study are available at http://bioinf.itmat.upenn.edu/beers/bp /. additional files supplemental materials are in sarantopoulou_fliquant_supplemental_material.pdf authors’ contributions gg, ds, and sn conceived of and designed the study. ds, tb and sn performed all computational analysis and visualization. nl produced all rna-seq simulated data. all authors contributed to discussions and running the algorithms. ds, tb, sn, and gg wrote the manuscript. all authors read and approved the manuscript. references . kahles a, lehmann k-v, toussaint nc, hüser m, stark sg, sachsenberg t, et al. comprehensive analysis of alternative splicing across tumors from , patients. cancer cell. aug ; ( ): – .e . . cooper ta, wan l, dreyfuss g. rna and disease. cell. feb ; ( ): – . . anders s, pyl pt, huber w. htseq--a python framework to work with high-throughput sequencing data. bioinformatics. jan ; ( ): – . . trapnell c, williams ba, pertea g, mortazavi a, kwan g, van baren mj, et al. transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. nat biotechnol. may; ( ): – . . liao y, smyth gk, shi w. featurecounts: an efficient general purpose program for assigning sequence reads to genomic features. bioinformatics. apr ; ( ): – . . li b, dewey cn. rsem: accurate transcript quantification from rna-seq data with or without a reference genome. bmc bioinformatics. aug ; : . . norton ss, vaquero-garcia j, lahens nf, grant gr, barash y. outlier detection for .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint http://bioinf.itmat.upenn.edu/beers/bp / http://paperpile.com/b/kjin /blj http://paperpile.com/b/kjin /blj http://paperpile.com/b/kjin /blj http://paperpile.com/b/kjin /veovf http://paperpile.com/b/kjin /pwvxf http://paperpile.com/b/kjin /pwvxf http://paperpile.com/b/kjin /kb e http://paperpile.com/b/kjin /kb e http://paperpile.com/b/kjin /kb e http://paperpile.com/b/kjin /h p a http://paperpile.com/b/kjin /h p a http://paperpile.com/b/kjin /h p a http://paperpile.com/b/kjin /fde http://paperpile.com/b/kjin /fde http://paperpile.com/b/kjin /cdsll https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / improved differential splicing quantification from rna-seq experiments with replicates. bioinformatics. may ; ( ): – . . angelini c, de canditiis d, de feis i. computational approaches for isoform detection and estimation: good and bad news. bmc bioinformatics. may ; : . . chandramohan r, wu p-y, phan jh, wang md. benchmarking rna-seq quantification tools. conf proc ieee eng med biol soc. ; : – . . zhang c, zhang b, lin l-l, zhao s. evaluation and comparison of computational tools for rna-seq isoform quantification. bmc genomics. aug ; ( ): . . hayer ke, pizarro a, lahens nf, hogenesch jb, grant gr. benchmark analysis of algorithms for determining and quantifying full-length mrna splice forms from rna-seq data. bioinformatics. dec ; ( ): – . . kanitz a, gypas f, gruber aj, gruber ar, martin g, zavolan m. comparative assessment of methods for the computational inference of transcript isoform abundance from rna- seq data. genome biol. jul ; : . . teng m, love m, davis ca, djebali s, dobin a, graveley br, li s, mason ce, olson s, pervouchine d, sloan ca, wei x, zhan l, irizzary ra. a benchmark for rna-seq quantification pipelines. genome bio. , ( ). . westoby j, herrera ms, ferguson-smith ac, hemberg m. simulation-based benchmarking of isoform quantification in single-cell rna-seq. genome biol. nov ; ( ): . . merino ga, conesa a, fernández ea. a benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human rna-seq studies. brief bioinform. mar ; ( ): – . . grant gr, farkas mh, pizarro ad, lahens nf, schug j, brunk bp, et al. comparative analysis of rna-seq alignment algorithms and the rna-seq unified mapper (rum). bioinformatics. sep ; ( ): – . . dobin a, davis ca, schlesinger f, drenkow j, zaleski c, jha s, et al. star: ultrafast universal rna-seq aligner. bioinformatics. jan ; ( ): – . . patro r, duggal g, love mi, irizarry ra, kingsford c. salmon provides fast and bias- .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint http://paperpile.com/b/kjin /cdsll http://paperpile.com/b/kjin /cdsll http://paperpile.com/b/kjin /wv bo http://paperpile.com/b/kjin /wv bo http://paperpile.com/b/kjin /o zvv http://paperpile.com/b/kjin /o zvv http://paperpile.com/b/kjin / nlxh http://paperpile.com/b/kjin / nlxh http://paperpile.com/b/kjin / wiap http://paperpile.com/b/kjin / wiap http://paperpile.com/b/kjin / wiap http://paperpile.com/b/kjin /iudtb http://paperpile.com/b/kjin /iudtb http://paperpile.com/b/kjin /iudtb http://paperpile.com/b/kjin /igd b http://paperpile.com/b/kjin /igd b http://paperpile.com/b/kjin /igd b http://paperpile.com/b/kjin / ms http://paperpile.com/b/kjin / ms http://paperpile.com/b/kjin / ms http://paperpile.com/b/kjin /ujyyq http://paperpile.com/b/kjin /ujyyq http://paperpile.com/b/kjin /ujyyq http://paperpile.com/b/kjin /iqpuq http://paperpile.com/b/kjin /iqpuq http://paperpile.com/b/kjin / uhcx https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / aware quantification of transcript expression. nat methods. apr; ( ): – . . bray nl, pimentel h, melsted p, pachter l. near-optimal probabilistic rna-seq quantification. nat biotechnol. may; ( ): – . . lateef a, prabhudas sk, natarajan p. rna sequencing and de novo assembly of solanum trilobatum leaf transcriptome to identify putative transcripts for major metabolic pathways. sci rep. oct ; ( ): . . hoang tv, kumar pkr, sutharzan s, tsonis pa, liang c, robinson ml. comparative transcriptome analysis of epithelial and fiber cells in newborn mouse lenses with rna sequencing. mol vis. nov ; : – . . wu kc, cui jy, liu j, lu h, zhong x-b, klaassen cd. rna-seq provides new insights on the relative mrna abundance of antioxidant components during mouse liver development. free radic biol med. jan ; : – . . del-aguila jl, benitez ba, li z, dube u, mihindukulasuriya ka, budde jp, et al. trem brain transcript-specific studies in ad and trem mutation carriers. mol neurodegener. may ; ( ): . . sharma a, das s, kumar v. transcriptome-wide changes in testes reveal molecular differences in photoperiod-induced seasonal reproductive life-history states in migratory songbirds. mol reprod dev [internet]. apr ; available from: http://dx.doi.org/ . /mrd. . keane tm, goodstadt l, danecek p, white ma, wong k, yalcin b, et al. mouse genomic variation and its effect on phenotypes and gene regulation. nature. sep ; ( ): – . . zaghlool a, ameur a, cavelier l, feuk l. splicing in the human brain [internet]. international review of neurobiology. . p. – . available from: http://dx.doi.org/ . /b - - - - . - . kim d, langmead b, salzberg sl. hisat: a fast spliced aligner with low memory requirements. nat methods. apr; ( ): – . . nayak s, lahens nf, kim ej, ricciotti e, paschos g, tishkoff s, et al. iso-relevance functions - a systematic approach to ranking genomic features by differential effect size [internet]. biorxiv. [cited may ]. p. . available from: .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint http://paperpile.com/b/kjin / uhcx http://paperpile.com/b/kjin /tegbs http://paperpile.com/b/kjin /tegbs http://paperpile.com/b/kjin / cqrd http://paperpile.com/b/kjin / cqrd http://paperpile.com/b/kjin / cqrd http://paperpile.com/b/kjin /dmkfh http://paperpile.com/b/kjin /dmkfh http://paperpile.com/b/kjin /dmkfh http://paperpile.com/b/kjin /gklkp http://paperpile.com/b/kjin /gklkp http://paperpile.com/b/kjin /gklkp http://paperpile.com/b/kjin /pp si http://paperpile.com/b/kjin /pp si http://paperpile.com/b/kjin /pp si http://paperpile.com/b/kjin /ppxjz http://paperpile.com/b/kjin /ppxjz http://paperpile.com/b/kjin /ppxjz http://paperpile.com/b/kjin /ppxjz http://dx.doi.org/ . /mrd. http://paperpile.com/b/kjin / y http://paperpile.com/b/kjin / y http://paperpile.com/b/kjin / y http://paperpile.com/b/kjin /o rct http://paperpile.com/b/kjin /o rct http://dx.doi.org/ . /b - - - - . - http://paperpile.com/b/kjin /ocnwo http://paperpile.com/b/kjin /ocnwo http://paperpile.com/b/kjin /t hq http://paperpile.com/b/kjin /t hq http://paperpile.com/b/kjin /t hq https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / https://www.biorxiv.org/content/ . / v .abstract . jaccard p. nouvelles researches sur la distribution florale. bulletin de la société vaudoise des sciences naturelles. vols. , - . . . love mi, huber w, anders s. moderated estimation of fold change and dispersion for rna-seq data with deseq . genome biol. ; ( ): . . pimentel h, bray nl, puente s, melsted p, pachter l. differential analysis of rna-seq incorporating quantification uncertainty. nat methods. jul; ( ): – . . lempel a, ziv j. on the complexity of finite sequences. ieee trans inf theory. jan; ( ): – . . frazee ac, jaffe ae, langmead b, leek jt. polyester: simulating rna-seq datasets with differential transcript expression. bioinformatics. sep ; ( ): – . . love mi, hogenesch jb, irizarry ra. modeling of rna-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. nat biotechnol. dec; ( ): – . . lahens nf, kavakli ih, zhang r, hayer k, black mb, dueck h, et al. ivt-seq reveals extreme bias in rna sequencing. genome biol. jun ; ( ):r . . trapnell c, hendrickson dg, sauvageau m, goff l, rinn jl, pachter l. differential analysis of gene regulation at transcript resolution with rna-seq. nat biotechnol. jan; ( ): – . . r core team. r: a language and environment for statistical computing. r foundation for statistical computing, vienna, austria [internet]. ; available from: http://www.r-project.org/ . li b, ruotti v, stewart rm, thomson ja, dewey cn. rna-seq gene expression estimation with read mapping uncertainty. bioinformatics. feb ; ( ): – . . roberts a, trapnell c, donaghey j, rinn jl, pachter l. improving rna-seq expression estimates by correcting for fragment bias. genome biol. mar ; ( ):r . . mortazavi a, williams ba, mccue k, schaeffer l, wold b. mapping and quantifying mammalian transcriptomes by rna-seq. nat methods. ; : - . doi: . /nmeth. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://www.biorxiv.org/content/ . / v .abstract http://paperpile.com/b/kjin /w sd http://paperpile.com/b/kjin /w sd http://paperpile.com/b/kjin /lznvl http://paperpile.com/b/kjin /lznvl http://paperpile.com/b/kjin / ytps http://paperpile.com/b/kjin / ytps http://paperpile.com/b/kjin /tscr http://paperpile.com/b/kjin /tscr http://paperpile.com/b/kjin / drao http://paperpile.com/b/kjin / drao http://paperpile.com/b/kjin /loeyd http://paperpile.com/b/kjin /loeyd http://paperpile.com/b/kjin /loeyd http://paperpile.com/b/kjin /y x h http://paperpile.com/b/kjin /y x h http://paperpile.com/b/kjin /psgqy http://paperpile.com/b/kjin /psgqy http://paperpile.com/b/kjin /psgqy http://paperpile.com/b/kjin /gpybm http://paperpile.com/b/kjin /gpybm http://www.r-project.org/ http://paperpile.com/b/kjin /df xd http://paperpile.com/b/kjin /df xd http://paperpile.com/b/kjin /brvzo http://paperpile.com/b/kjin /brvzo https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / . leng n, dawson ja, thomson ja, ruotti v, rissman ai, smits bmg, haag jd, gould mn, stewart rm, kendziorski c. ebseq: ebseq: an empirical bayes hierarchical model for inference in rna-seq experiments. bioinformatics, volume , issue , april , pages – , .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / sarantopoulou et al, benchmarking of fli quantification for rna-seq (supplemental material) - supplemental figures s fig. method effect on full-length isoform quantification using simulated data. method effect on full-length isoform quantification using simulated data. average expression of three hippocampus samples, comparing each method to the truth, using a) idealized and b) realistic data. percentiles of cumulative distribution of |logfc| using c) idealized data, d) realistic data, e-f) idealized and realistic data respectively, where we restricted to the set of genes that have at least expressed isoforms. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / sarantopoulou et al, benchmarking of fli quantification for rna-seq (supplemental material) - s fig. effect of transcript length on quantification accuracy. effect of transcript length on quantification accuracy, given by adjusted logfc of the average of the three hippocampus samples, using a) idealized and b) realistic data. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / sarantopoulou et al, benchmarking of fli quantification for rna-seq (supplemental material) - fig s fig. differential distribution of transcript compression complexity. for each method the foreground and background distributions are shown for transcript compression complexity. the background is over all isoforms, the foreground is over the top , discordant transcripts sorted by absolute adjusted log fc. the foreground distribution is highly enriched for low compression complexity for all methods. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / sarantopoulou et al, benchmarking of fli quantification for rna-seq (supplemental material) - s fig. the distribution of the #genes according to the #annotated isoforms. the distribution of the number of genes for different number of annotated isoforms. .cc-by-nc-nd . international licenseunder a not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available the copyright holder for this preprint (which wasthis version posted february , . ; https://doi.org/ . / doi: biorxiv preprint https://doi.org/ . / http://creativecommons.org/licenses/by-nc-nd/ . / sarantopoulou_fliquant_benchmark sarantopoulou_fliquant_benchmark_supplemental sparc data structure: rationale and design of a fair standard for biomedical research data sparc data structure: rationale and design of a fair standard for biomedical research data anita bandrowskia, jeffrey s. grethea, anna pilkoa, tom gillespiea, gabi pinea, bhavesh patelb, monique surles-zeiglera, and maryann e. martonea,*, auniversity of california, san diego, ca bcalifornia medical innovations institute, san diego, ca *correspondence should be addressed to mmartone@ucsd.edu abstract the nih common fund’s stimulating peripheral activity to relieve conditions (sparc) initiative is a large-scale program that seeks to accelerate the development of therapeutic devices that modulate electrical activity in nerves to improve organ function. integral to the sparc program are the rich anatomical and functional datasets produced by investigators across the sparc consortium that provide key details about organ-specific circuitry, including structural and functional connectivity, mapping of cell types and molecular profiling. these datasets are provided to the research community through an open data platform, the sparc portal. to ensure sparc datasets are findable, accessible, interoperable and reusable (fair), they are all submitted to the sparc portal following a standard scheme established by the sparc curation team, called the sparc data structure (sds). inspired by the brain imaging data structure (bids), the sds has been designed to capture the large variety of data generated by sparc investigators who are coming from all fields of biomedical research. here we present the rationale and design of the sds, including a description of the sparc curation process and the automated tools for complying with the sds, including the sds validator and software to organize data automatically (soda) for sparc. the objective is to provide detailed guidelines for anyone desiring to comply with the sds. since the sds are suitable for any type of biomedical research data, it can be adopted by any group desiring to follow the fair data principles for managing their data, even outside of the sparc consortium. finally, this manuscript provides a foundational framework that can be used by any organization desiring to either adapt the sds to suit the specific needs of their data or simply desiring to design their own fair data sharing scheme from scratch. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . introduction the nih common fund’s sparc project, stimulating peripheral activity to relieve conditions, is a large-scale program whose mission is to map the peripheral nervous system across multiple species and improve our understanding of nerve-organ interactions. sparc achieves this aim by providing access to high-value datasets, maps, and computational studies in support of bioelectronic medicine. bioelectric medicine is defined as “...the convergence of molecular medicine, neuroscience, engineering and computing to develop devices to diagnose and treat diseases” . integral to the sparc program are the rich anatomical and functional datasets produced by investigators across the sparc consortium that provide key details about organ-specific circuitry, including structural and functional connectivity, mapping of cell types and molecular profiling. these datasets are provided to the research community through an open data platform, the sparc portal available at sparc.science. sparc is also developing new tools and technologies to support modeling and simulation of nerve-end organ interactions. the data produced by the sparc project is highly heterogeneous, deriving from multiple species, spatial and temporal scales, and anatomical, physiological and molecular techniques. to ensure that sparc data adhere to the principles for making data findable, accessible, interoperable and reusable (fair) , the sparc curation team is charged with identifying, and implementing community standards and annotating sparc data with rich metadata. standards are integral to fair because they make it easier to combine across datasets, ensure that necessary metadata is provided, and make it possible to write automated tools to promote reuse of data. community standards are either adopted from other domains or developed by sparc to serve their needs. to date, sparc has been curating data to two primary standards developed by the sparc consortium: ) the minimal information specification (mis), a semantic metadata scheme capturing key experimental and dataset details; ) the sparc dataset structure (sds), a file and metadata organizational scheme based on the brain imaging data structure (bids), developed by the neuroimaging community . sparc investigators are required to organize their data files and metadata according to sds; sparc curators then align the submitted metadata and file pointers to the mis using automated and semi-automated workflows. in this paper, we explain the rationale behind the design of the sds and give a detailed description of the associated guidelines. this provides a full overview and instructions for anyone wanting to follow these fair data standards for any field of biomedical research. the sds may be useful for fields where fair data standards are yet to be established as it is agnostic to data type. we also present automated validation and curation tools that have been developed for sparc, which could facilitate use of the sds beyond sparc. this paper also provides a foundational framework that could be used for adapting the sds to suit the specific needs of data from a particular field of research. . overview of sparc curation process data and curation services and infrastructure for sparc are provided by the sparc data and resource center. currently, sparc data is uploaded to the blackfynn data platform , which provides a private, password-protected space for researchers to store and organize their data. data are uploaded from individual investigators in the sparc consortium according to timelines and milestones negotiated with the us national institutes of health (nih). investigators are required to upload their data within days of completing a particular milestone. each batch of data uploaded to complete a milestone is considered a sparc dataset. investigators are given instructions and templates for organizing their data according to the sds and are expected to upload their data in this format. once uploaded, data are curated by sparc curators who will review for compliance with the sds, completeness of data and metadata and overall quality. certain types of data, e.g., d and d images, undergo spatial registration using the tissuemaker software developed by mbf biosciences with organ-specific d scaffolds and data visualizations being created by the auckland bioinformatics institute (abi). a more detailed curation workflow is described in section . when complete, a dataset in sparc comprises the following: . data files uploaded to the blackfynn platform organized according to the sparc data structure that includes all required metadata . a complete detailed experimental protocol in protocols.io describing any procedures used to obtain the data uploaded . if applicable, a set of fiducial mark up of d images for spatial registration of images to scaffolds; converting image files to required formats (performed by mbf biosciences) . if applicable, data registered to d spatial scaffolds, which includes creating visualizations of certain types of data, e.g., rnaseq (performed by abi) . a set of curator’s notes that accompanies the data file to summarize key parameters of the dataset in this paper, we outline the rationale and structure of the sds and some of the tooling that has been developed to support it. a separate paper will be prepared for the mis. . development of the sds to capture data across diverse types of biological data, the sparc consortium has adopted the brain imaging data structure (bids, rrid:scr_ ) format for research objects as a foundation for the sds (see fig. ). the bids format is a simple file folder organization and metadata scheme. at the top level, the bids format functions as a series of folders representing a dataset, consisting of a set of specified files and subfolders containing different types of metadata and data. . . rationale formal data structures, like bids aim to increase the integrity of scientific research through the active encouragement and facilitation of fair. “findability” is improved when the names of organisms and organs are .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://scicrunch.org/resolver/rrid:scr_ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / standardized to established community ontologies. “accessibility” and “interoperability” are improved as files are organized in more predictable locations across different datasets and when they use common and open formats, such as csv or tiff. “reusability” is improved by ensuring that all contributed data is well annotated and conforms to community standards, e.g., minimal information models, when such are available, and are made available under a clear license. for sparc, all datasets are released under the cc-by- . license. bids was deliberately and carefully designed to complement likely research practices in the laboratory to ensure accurate capture of complex imaging experiments. towards this end, bids can be used by laboratories with minimal bioinformatics experience or support to manage, exchange and, submit well-annotated data in a human and machine-readable format. the bids format creates a resulting structure sufficiently standardized to support the creation of validation code (e.g., bids validator rrid:scr_ ). the bids validator is an application that checks for the presence of required files, and the completion of required fields within those files. the bids format and bids validation code are already used in several repositories that store imaging data, including openneuro.org (rrid:scr_ ). bids was developed and refined over many years, through many meetings and by many contributors. this standard has become relatively well accepted in the neuroimaging community as a means to package and describe neuroimaging studies and has been endorsed by the international neuroinformatics coordinating facility (incf) through its standards review process . the curation team joined sparc in just as the first deadlines for data submission by consortia members were approaching. based on the recent incf endorsement of bids, we recommended the project adopt a modification of bids as an initial effort to coordinate data across different laboratories. although bids was developed originally for neuroimaging, its basic structure is adaptable to various experimental paradigms. because of the diversity of data in sparc, the large number of files and complex structure of the datasets, we felt that without a consistent structure, data in sparc would be very difficult to work with by end users, and very difficult to curate, as each dataset would be organized and documented differently. as bids had already gone through multiple rounds of community review, including the independent incf review, and is a recognized standard for the openneuro data archive supported by the us brain initiative, we felt confident that it provided a solid foundation for sparc in the early stages of data sharing. . . sds overview the bids structure was modified to remove neuroimaging specific aspects, and to accommodate the fact that most data in sparc are derived from animals and animal tissue. thus, unlike in non-invasive neuroimaging studies, data may be acquired at the subject level, e.g., in vivo physiological recordings, or at the specimen level (from an ex-vivo tissue specimen or in vitro cell culture) (fig. ). the proposed modifications to bids were accepted by the sparc data standards committee and we moved forward with working with investigators to organize their data according to the sparc dataset structure. version . was put in place to organize the first data submitted from january - july in anticipation of the debut of the sparc data portal at the th congress of the international society for the autonomic neuroscience (isan ). the overall structure is shown in fig. b. it defined a set of high-level folders, including one for subjects and one for specimens, and included various spreadsheets into which investigators could enter metadata for the dataset as a whole (dataset_description), subjects and samples. note that the file format chosen for these spreadsheets is .xlsx, rather than an open format like .tsv or .csv. although .csv is the preferred file format for tabular data in sparc, the curation team wanted to make it easier for both investigators and curators by including features such as drop down value sets for certain metadata fields, features which are not supported by these basic formats. in addition, the blackfynn data platform did not have a viewer available for .csv files, but did support on-line viewing of .xlsx through the microsoft open office suite. as with bids, the sds follows the inheritance principle that requires any metadata files in the root directory to apply to all folders and files below it, except when explicitly overridden by a metadata file contained in a lower order folder. after a review of datasets submitted for isan and interviews with investigators, the curation team modified the basic structure to simplify the folder structure (fig. c), collapsing the subject and samples folders into a single folder named primary. samples may now be nested under their respective subjects. the current release is version . , (fig. c). the required folders and files are figure . transformation between dicom and bids . figure . a comparison of high-level details of bids (a), sds . (b) and sds . (c). .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://scicrunch.org/resolver/rrid:scr_ https://scicrunch.org/resolver/rrid:scr_ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / provided to investigators as a downloadable versioned template via github (https://github.com/scicrunch/sparc- curation/releases/tag/dataset-template- . . ). all datasets are now curated according to version . , including those that were released for the isan meeting. although the sds is modeled on the approach by bids, i.e., file folder organization, naming scheme and provision of critical metadata, it is sufficiently distinct from bids that we do not consider it an extension, but rather a derivative (see fig. for a comparison between bids and sds). we now describe sds v . in more detail. . sparc data structure v . the sparc dataset structure includes the following components (fig. ): • a set of organized data files in a hierarchical set of predictably named folders and subfolders. folders/subfolders may contain supplementary and additional documentation, i.e. manifest files that describe the files and/or folders contained therein. • a set of descriptive top-level files that contain information on subjects, experimental information, and dataset descriptions. these descriptive files include both spreadsheets containing structured metadata and text files with additional information. • a set of file manifests associated with each folder that provides descriptions of the contents. . . top-level structure data files are organized into different top-level folders, depending on the type of data: • primary: a required dataset dependent folder that contains all folders and files for experimental subjects and/or samples, e.g., time-series data, tabular data, clinical imaging data, genomic, metabolomic, microscopy data. the data generally have been minimally processed so they are in a form ready for analysis. within the primary folder, data is organized by subjects or samples (see section ). all subjects and samples will have a unique folder with a standardized name corresponding to the exact names or ids as referenced in the subjects and samples metadata file (see fig. ). • source: an optional folder containing unaltered, raw files from an experiment, if they are included in the data. for example, this folder may include the “truly” raw k-space data for a magnetic resonance (mr) image that has not yet been reconstructed, or a set of microscopic images that had not yet been assembled into a mosaic. the reconstructed dicom or nifti files and the image mosaic, for example, would be found within the primary folder. • derivative: a required folder if derivative data exists. this folder contains derived data files. for example, processed image stacks that are annotated via the microbrightfield (mbf biosciences) tools, segmentation files, or smoothed overlays of current and voltage that demonstrate a particular effect. if files are converted into a format other than what was submitted, these files are included in the derivative folder. derived data should be organized into subject and sample folders, using the subject and sample ids as the folder names, as with the primary data. other files are organized in three different (optional) folders: • code: a required folder only if code is used in generation of the data; the folder contains all the source code used in the study, e.g, matlab. • protocol: an optional folder that contains supplementary files to accompany the experimental protocols submitted to protocols.io. the additional files in this folder are not a substitution for the experimental protocol which should have been submitted to protocols.io/sparc. • docs: an optional folder that contains all the supporting documents for the dataset, including but not limited to, a representative image for the dataset. unlike the readme file, which is necessarily a text document, docs can contain documents in multiple formats, including images. figure . the organization structure of the files and folders for a sparc dataset. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/scicrunch/sparc-curation/releases/tag/dataset-template- . . https://github.com/scicrunch/sparc-curation/releases/tag/dataset-template- . . https://www.protocols.io/ https://www.protocols.io/groups/sparc https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . . descriptive top-level files a set of descriptive, top-level files contain information on subjects, samples, dataset descriptions and administrative data. these files contain required metadata fields that are aligned to the datacite schema (dataset description), and the hbp’s (minimal information about a neuroscience data set) for subjects and samples. additional recommended fields are included for each (see supplementary material a). investigators are encouraged to add additional columns beyond this core set to thoroughly describe the dataset. while there is a great deal of flexibility built into the metadata templates in order to accommodate the diversity of experimental paradigms and data, for the effective functioning of the validator (described in section ), it is important for data wranglers not add, edit or delete required columns in the mandatory descriptive files (these are color-coded (see green and blue in appendix a). if there is information that doesn't correspond with available columns, the information should be added to a new column on the right-hand side (subjects and samples) or a new row on the bottom of the sheet (dataset description). if there is information not available to the researcher at the time of submission, fields should be left empty or marked “unknown”. an overview of the spreadsheet metadata templates is provided below: • dataset_description (xlsx, csv or json): required file containing basic metadata about a dataset, derived largely from the datacite schema . a full list of metadata and definitions is provided in supplementary table a . investigators provide basic metadata such as title, description, contributors, funding and contact person, that provide provenance for the dataset and also support formal data citation. the version . release includes an additional field specifying the metadata version. this field is not to be changed by data submitters. it allows proper alignment between different metadata releases, securing the data integrity for multiple batches of submissions. we also encourage researchers to describe if they plan to submit more data for new or for the same subjects, i.e., this dataset is part of a larger study. this will help determine when all the primary data has been deposited and help with mapping across the different parts of the dataset. • submission (xlsx, csv or json): required file containing information relevant to internal sparc bookkeeping, relating milestones negotiated with nih to datasets submitted. according to the sparc material sharing policy , data is to be deposited within days of milestone completion and will become public no later than year after milestone completion. this file is for internal use only; it must not be released when the data are published. • subjects (xlsx, csv or json): required file if subjects are used in the experiment producing the dataset. contains updated fields with required and optional metadata fields providing information about subjects (model organism or animals) involved in data collection. the file contains fields specifying provenance for the subject, e.g., subject_id, pool_id and experimental group (blue fields in appendix a ). each subject and pooled subjects must be assigned a unique id, as this id is used to name the data folders for individual subjects. for proper mapping of the data, folders containing experimental data need to exactly match the subject id. all subject identifiers must be unique within a dataset and not contain any sensitive, identifiable information (for human subjects). having each lab use unique subject identifiers across datasets is highly desirable to aid in connecting multiple experiments using the same subjects. in the future, we plan to connect subjects across datasets and projects; however, we currently do not map subjects across multiple data submissions. the subjects.xlsx file contains several mandatory fields (green in appendix a ) including species, age, strain and research resource identifier (rrid). additional columns containing additional descriptive metadata, demographic assessment data, etc, largely derived from openminds, are provided for investigators in the template. in the download template, these are highlighted in yellow (see appendix a ) and serve as exemplars of the types of metadata that are important for providing scientific context. according to the fair principles, data should be described by a “plurality of relevant attributes”, but we are leaving it up to the investigators’ discretion to decide what is sufficient for others to understand and reuse the data. investigators have the liberty to add as many fields as needed that they deem necessary. currently, all metadata provided for subjects and samples is provided in free text, which is then mapped to the sparc vocabularies by the curation team (see section . ). however, we are actively working with investigators on lists of controlled vocabularies for certain fields. • samples (xlsx, csv or json): conditional file required if measurements are obtained from samples, e.g., tissue slices, derived from individual or pooled subjects. this file contains information about samples used to generate the data. investigators must provide a unique id for each sample that will be used to name the data folders. the sample id must match the folder id exactly. each sample should also reference a subject from the subject file; a single subject (a research animal/donor) may be linked to multiple biological samples derived from that subject. if the samples are pooled from multiple subjects, the complete provenance must be specified in the subject file. the metadata present in the samples file should also explicitly note whether a sample was collected directly or was derived from another sample. required metadata includes the subject or tissue from which the sample was derived and the anatomical location (green in appendix a ) additional fields may be added by the investigator. the template provides some suggested fields derived from the minimal information about a neuroscience dataset (openminds). investigators .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://commonfund.nih.gov/sites/default/files/sparc_material% sharing% policy% jan _ .pdf https://commonfund.nih.gov/sites/default/files/sparc_material% sharing% policy% jan _ .pdf https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / should only use columns that are relevant to their type of study. an overview of the descriptive text files is provided below: • readme (txt): required file provided by investigators that contains necessary details for reuse of the data, beyond that which is captured in structured metadata. some information that should be included are: o how would a user use the files that are provided? e.g., first open file x and then look at file y. o what additional details do they need to know? are some subjects missing data? o are there warnings about how to use the data or code? o are there appropriate/inappropriate uses for this data? o are there other places that users can go for more information? e.g., did you provide a github repository or are there additional papers beyond what was provided in the metadata form? • plog (xlsx or txt): optional performance log file, which can be used to attach information about individual performances of the experiment, e.g. how long they took, what the average room temperature was, or who performed them. there is currently no other place in the data model to attach that kind of information. • changes (txt): conditional file required if a new version of the dataset is uploaded to document any changes from the previous version. . . manifests manifest (xlsx, csv or json): required file that must be in all folders containing data files (fig. , fig. ) and in folders with subfolders whose meaning is not clear. this file contains information and metadata about the files and folders that are expected in the folder where they sit. required fields include file name (or file name pattern for folders with many related files), description and file type, although investigators have a lot of flexibility by adding additional columns, including notes about pertinent aspects of each file that differentiate the files (e.g., data collection specific protocol, stimulation condition, microscope filter applied, drug applied, etc). the manifest file can apply to collections of files (through the use of a file pattern) or list specific files (e.g. sub??-task -run?? can specify all the files related to task in the protocol). if investigators include folders that organize data along a particular dimension, e.g., datatype or time point, a manifest file should be generated that describes the content of the folders. . folder hierarchy principles the folders and files pictured in fig. are required and invariant for each sparc dataset. this invariance imposes a standard structure for sparc datasets that allows a user to reliably navigate the often complex experimental data (fig. ). however, given the variety of different experimental protocols and the way in which subjects and samples are treated across different types of experiments, the folder and file structure can vary among different datasets within the primary data folder. some examples are provided with the template download (fig. ). for the majority of sparc datasets, data in the primary data folder are organized into subject folders, with the folder names corresponding to the subject ids provided in the subjects.xlsx file. if samples are derived from these subjects, data files are organized within sample subfolders under the appropriate subject, according to this pattern (pattern ) the inheritance principle applies, so that if sample (sam- ) appears as a subfolder of subject (sub- ), then it is assumed that sample was obtained from subject (fig. ). in some cases, no data may be derived from the subjects directly, i.e., no data files are generated at the subject level. in this case, investigators could omit the subject folder (although the subject.xlsx file must be included to provide the appropriate metadata). time series data: for functional studies where measurements are obtained at different time points, figure . relationships between metadata files and folder structure. example taken from (morris et al. ) figure . example of a complete manifest. from morris et al ( ). .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / different time points should be organized into folders labeled perf- , perf- etc, where the numbering indicates the temporal ordering, under either the sample or subject folder. an example is shown in fig. (pattern ). note that in this case the manifest file would specify information about the data that will be found in the perf- and perf- folders. pooled samples or subjects: although the majority of data are organized with subjects nested under the primary data folder and samples nested under subjects, this simple hierarchical arrangement does not apply to all datasets. in some cases, samples may be pooled from multiple subjects, in which case the sample folder lives alongside the subject folder and not nested within it, according to pattern . the sds also accommodates subject pools where the samples folder is replaced with the pooled folder (pattern ; fig. ). note that pool_ids and characteristics must be provided in the subjects.xlsx file. . tooling to support sparc dataset structure . . sds validator to enforce the sds structure and required metadata fields, the curation team developed a sparc dataset structure validator that is used for frequent checks to ensure the integrity of the data across the platform and provide valuable feedback to the curation team. the validator is written in python and uses json schema to specify the expected structure of the dataset files and folders, as well as the structure and contents of the types of metadata files (dataset_description, subjects/samples, submission, and manifest). tabular metadata files are transformed into json, and validated against the schemas. the validator first checks that all the required metadata files are present after which the content of the individual metadata files is validated. for example, in the subjects.xlsx file, checks are performed to ensure that all subject ids are unique, that there are not names in columns that expect numbers (e.g. 'adult' in the 'age' column is an error) and that the files in the primary data folder match the names and number of subjects and samples provided in the metadata files. the validator also checks that organism and anatomical entities are present in the appropriate columns, by matching the content of these columns against the sparc vocabularies (see section . ). this is not an exhaustive list of the checks that are performed, but it gives a flavor for the types of checks that are done (fig. ). some of the most common mistakes detected by the validator arise when investigators remove headers or cells for which they do not have information. this means that the validator looks for information in the wrong place, e.g. if species information is expected in cell f (fig. ), but the investigator deleted column e, then the strain information will be noted in the species field producing an error because c bl/ j is not in the ontology as a valid species name, and all other information to the right of the deleted column or below the deleted row will also be incorrect. errors are noted per dataset and categorized by type for curation so that curators can act on the error. for simple alignment errors, the curators usually replace the affected files by pasting misaligned data into a fresh template. with the newly released data organization tool, soda, (see section . ) these sorts of errors will be less of a problem because at least some of the metadata files will be replaced by a form that asks investigators questions and produces a properly formatted file. the process of validation is done automatically on each dataset, but is only meaningful for datasets that are undergoing curation, where these errors are read and acted upon. while the data are being prepared by the investigator for submission, it is not uncommon for datasets to have very large error numbers as none of the files may be in the right location and metadata fields may be incomplete. the complete curation workflow is figure . dataset-template . . folder hierarchy. figure . workflow for the sds data validator. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / described in section . in addition to running the validation of the required metadata, the validation code also extracts metadata from the sds, and maps them to the mis. during this process, certain metadata fields, e.g., anatomical structure, are mapped to the nif standard ontology ( rrid:scr_ ), which in turn imports multiple community ontologies such as ncbi taxonomy, uberon, chebi. additional ontologies, e.g., fma are used as necessary. a list of identifiers used to map sparc data is provided in table . the validator produces a set of files from the contents of the required metadata files, the ontologies and other data sources such as protocols.io. these are made available in several formats including the mis “ttl” file (also json and csv), to blackfynn, curation systems, and the drc staff. with each run of the validator code the metadata in the ttl file will therefore change to reflect the current state of the dataset. the curation team has created a private searchable and sortable table using ucsd’s scicrunch.org infrastructure, https://scicrunch.org/sparc, which allows curators to quickly see which elements are missing in each dataset and determine if the error can be fixed by curators or whether the investigator needs to help resolve issues. . . software to organize data automatically (soda) for sparc complying with the above-described guidelines requires additional time investment from the researchers (as with any data curation and submission standards), and the data curation process can become progressively overwhelming as additional data is submitted. if researchers are not currently using any standard way of organizing their data within the laboratory, in the long run, this work will benefit the laboratory. however, if researchers already use a formal method for organizing their data, complying with sparc requirements could prove even more burdensome as they must organize their data according to additional rules. to remediate this issue, a software named software to organize data automatically (soda) for sparc has been developed to assist sparc investigators in easily curating and annotating their datasets. distributed as an open-source (mit license) and cross-platform (windows, macos, linux) desktop application, the goal of soda for sparc is to bridge a long-standing, overlooked gap between comprehensive data standards and their convenient application by researchers. soda for sparc provides an interactive interface that, without requiring any coding knowledge, walks sparc investigators step-by-step through the sparc data curation process, all the while automating repetitive, complex, and time-consuming tasks. besides being time-efficient, soda for sparc also provides the convenience to sparc investigators of organizing their datasets following a custom workflow (e.g., based on personal preferences or to comply with internal guidelines applicable in their labs) and rapidly organize their data according to the sds only when they are ready to submit the dataset for review by the sparc curation team. the soda for sparc installers as well as the source code are accessible via the dedicated github repository . during the first phase of development (may -august ), the following features were integrated into soda for sparc (fig. ): . prepare submission and dataset_description metadata files through an intuitive interface and with assistance from the program that provides access to standard values/terminologies and makes automated suggestions based on previously saved information. . prepare datasets step-by-step via a convenient interface • specify desired local data files to be included in each of the sparc folders. • specify metadata files to be attached. • request manifest files to be generated automatically. • check that information provided during the previous steps will generate a sparc-approved dataset using an automated validator (before a thorough validation by the sparc curation team). • generate a dataset based on information specified during the previous steps either locally or directly on the blackfynn platform (to avoid duplicating files on the user’s computer). table . list of ontologies and controlled vocabularies used to map sparc metadata. entity identifier sys- tem controlled vocab- ularies author orcid contributor roles datacite species ncbi taxonomy strains rrids antibodies rrids cell lines rrids software tools and instru- ments rrids anatomical structures uberon and fma small mole- cules chebi techniques nifstd experimental modalities controlled list see appendix b, table b diseases or conditions mondo or dis- ease ontology .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://scicrunch.org/resolver/rrid:scr_ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . manage datasets by easily connecting to blackfynn with soda for sparc then conveniently create datasets, add metadata to blackfynn datasets, manage dataset permissions, upload local files/folders, and share datasets with the sparc curation team for review. during the second phase of development (starting september ), more features are being added to the software including a virtual interface for organizing data, support for collaborative data curation, assistance for preparing samples and subjects metadata files, and file- level curation support. the user interface is also being upgraded to make use of the software more intuitive. a screenshot of the user interface from the current version ( . . ) is provided in fig. . a team of beta testers, all of whom receive funding from the sparc program, is reviewing and providing feedback frequently to ensure that soda for sparc meets the needs of the sparc investigators. preliminary testing by the beta testers has shown that computer-assisted curation with soda not only reduces the time required by investigators to organize and submit their data, but also minimizes human errors . more features will be included in the future to enhance further the curation workflow and ensure that sparc datasets are disseminated efficiently. even beyond the sparc consortium, quality data curation is a critical concern. soda for sparc could impact the broader research community by providing an exemplar, foundational tool for convenient and time-efficient data curation, which could then be adopted by other projects. in the future, we expect to modify the bids-inspired sparc sds for computational studies (the changes as they currently stand are in a draft version and will need to be approved by the data sharing committee before being acted on) that are undertaken as part of the sparc project, it is likely that this will involve changes to the soda for sparc tool in compliance. . sparc data submission workflow all investigators in sparc have year from the time a milestone is completed (fig. ), and a draft dataset is submitted (step ) to publish the resulting dataset (step ). a dataset is published when it has been assigned a digital object identifier (doi) and is available for viewing and download by the public. during that year, the dataset will move through several curatorial stages and possibly an embargo period. investigators will have days from the completion of a milestone to formally submit their data to the sparc data repository. data is considered completely submitted only when the data are shared with the data curation team. once curation is complete, the dataset moves into an embargo phase or is published. during the embargo phase, the data set is visible only to members of the sparc consortium who have signed a data use agreement. the submission + curation + figure . overview of the major features included in soda during the first development phase. figure . user interface (on a windows computer) from version . . of soda for sparc. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / embargo period add up to year, that is, the length of the embargo period depends on how long it takes to curate the data to the above standards. curation is a collaborative process that involves a back and forth between the investigator and the curation team and so the time to completion is difficult to predict. however, if investigators wish to publish before the end of the embargo period, they are encouraged to do so. creating a sparc dataset in the sds structure involves multiple steps. instructions for creating a dataset with detailed steps can be found at https://sparc.science/help/ k nepuw fjoq hus ovsd) : • investigator: create and name a draft dataset in their private space on the sparc data repository hosted by blackfynn (within the “sparc consortium” organization on blackfynn) • investigator: organize and upload files to the dataset within this space according to the requirements of the sparc data structure, using the template provided by sparc. • investigator: request a publication review. this step initiates the curation process and locks the dataset so that changes can only be made by the curation team • investigator: upload the experimental protocol to protocols.io and share this protocol with the sparc group. • script: downloads all sparc data from blackfynn. • curator: logs all sparc datasets into the master spreadsheet, their status and any communication tickets with the investigator. • script: run weekly to find new datasets by matching the dataset ids in the data dump with those on the master spreadsheet. • curator: send an email acknowledgment when new dataset is detected within working days. • script: run all datasets through the validator • curator and investigator: curators will work with the automated validator report and investigators to ensure that required fields are complete and the folder structure is appropriate. • curator: find mis data elements in the protocol using semi-automated tools, adding these to the structured metadata package that will be sent back to blackfynn as a .ttl file. • curator: hand off image datasets to mbf biosciences curators for segmentation assistance, spatial registration and conversion to sparc approved formats and to transform the banner image. • curator: hand off data if genetic or physiology data are present to the auckland curation team to create appropriate data visualizations for those data types. • curator: finalize the dataset within blackfynn, adding the finalized description once data is aligned to the sparc standards, annotated and sign off is received from the mbf biosciences & auckland team, adding license information and provisioning a doi, if the data are to be published immediately. • investigator: final check by pi of dataset after curators sign off. • investigator: request dataset to be published • curator: publishes the dataset (principal investigator), or allows it to be published automatically after the embargo period ends. these steps can be viewed within the private data portal using the “dataset status”, a feature implemented in blackfynn in december . the steps that each dataset go through are formalized, numbered, and color coded (fig. ). each label is associated with the party that is responsible for setting the particular status. please note that the teams at mbf biosciences and abi are considered curators for this workflow. these teams are responsible for ensuring that sparc data are aligned to common spatial frameworks, as described in the introduction. these steps are not necessarily performed in sequential order. for example, the image registration, conversion and segmentation performed by mbf biosciences may be performed before the imaging files are uploaded to blackfynn. researchers do, however, create the necessary dataset descriptors in blackfynn and often upload the necessary metadata files. this will mean that in some cases the order will go from - - - (mbf biosciences)- - (ucsd curation). fig. is a schematic representation of the workflow described above. it highlights how data is generated by individual investigators, curated by the data curation team, and shared as an embargoed dataset with the sparc embargoed data sharing group. it shows how the data is made available to the public over time. figure . ordered status types set by investigators or set by cura- tors. figure . data submission milestones. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://sparc.science/help/ k nepuw fjoq hus ovsd https://www.protocols.io/groups/sparc https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . quality control metrics for sparc datasets the sparc curation team has developed a set of qc guidelines that are used to check for errors and to ensure consistency in the descriptions of sparc data. sparc datasets are checked by the curation team for the following: . they conform to the requirements of the sparc data set structure . the files are appropriately organized into the primary, derivative, docs, code, protocol folders . manifests are included at each level of the folder tree and contain a sufficient description of the files present. . title and description are clear, appropriate and detailed . . if the data are part of a larger dataset, that relationships are specified . the species is appropriately identified and referred to consistently across protocol, dataset description and experimental data . all types of experimental data referred to in the abstract or protocol are contained in the dataset . all abbreviations used to describe the dataset across the different documents are defined . all experimental or sample groups referenced in the metadata are defined . all file types submitted conform to approved file types (upcoming) . as metadata standards are defined they are used appropriately a checklist has been developed, which also includes questions that can be asked of the investigator . some of these checks will be incorporated into future versions of soda. . discussion the establishment of the sds has proven to be essential to curating the complex and large datasets submitted for sparc. with a common structure, curation can take advantage of tools such as the validator to help with the curation process, thereby allowing them to focus on the scientific aspects of the dataset, e.g., is the description and protocol clear, rather than simple mechanistic tasks such as checking whether the number of subjects listed matches the number of subject folders. we realize that the sds presents extra work for sparc investigators, who must adapt their local lab practices to comply with a new structure. however, with the launch of soda, investigators should find it easier to walk through the curation process. finally, as the sparc portal evolves, the user interface can take advantage of the regular structure to make it easier to browse sparc data in a consistent manner. in order for the sparc project to meet its deliverables, the first round of standards needed to be implemented relatively quickly. the first public data released for sparc occurred in july at the isan meeting. at that time, curators were curating to sds . , but many of the datasets released were demo datasets and were not fully structured. the sds was revised in october of in response to the july release and through discussions with investigators. data for the february release was curated to sds . . . at that time, all of the original datasets were also recurated. as the sds continues to evolve - version . is scheduled to be released in spring of - we are not planning on recurating older data, as it would not be feasible to constantly revise the large number of datasets available through sparc. we are, however, extracting larger amounts of structured information from these datasets, e.g., from the experimental protocols, and mapping it to the mis, so some re-curation of metadata figure . overview of the entire submission-curation-publishing workflow. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / does occur. this information will be used to create more powerful and nuanced search across sparc datasets and models. because of the consortium’s variability in experimental methodology and primary data types, we are continuing to evaluate whether the sparc data structure is sufficient for any investigator to unambiguously interpret the datasets from other research labs. on the technical side, we are examining this new structure’s ability to facilitate datasets to be exchanged and queried freely as well as understood by other scientists. for each use case such as simulation data or physiology data, we look at the relevant sparc protocols and current results to determine the required parameters needed to understand the resulting data. we are using this information as a basis for formalizing modality- specific extensions to the sds and mis and to develop qc guidelines, as outlined in section . there are several additional areas where standardization will benefit sparc. within the next year, sparc will also move to implement more consistent file formats for major data types, ensuring that sparc data is available in non-proprietary formats. for example, all imaging data will have to be submitted as jpeg and biotiff or in a format that can be converted to these formats. guidelines for additional data types will be released in summer of . a third driver of standards in sparc is the requirement to be interoperable with other data repositories, particularly those being created by the us brain initiative and other large brain projects around the world. the us brain initiative is investing in the creation of standards for major data types such as neuroimaging (bids ), neurophysiology (nwb ) and standards for d microscopy. these standards underlie the major archives established for brain data: openneuro, dandi and the brain image library, respectively. sparc will be monitoring these standards for maturity and will create the means for sparc data to be converted into these formats. the establishment of standards for sparc also underwent a governance change after the first data release. while in the first phase of the project, data standards were developed or recommended by the data standards committee comprising sparc investigators, after the first sets of data were released, responsibility for recommending and implementing new standards was shifted to the curation team, as they are most familiar with the breadth of sparc data and the areas requiring standardization. the recommendations of the sparc curation team are then put forward for review by the data standards group and the sparc community at large. references . olofsson, p. s. & tracey, k. j. bioelectronic medicine: technology targeting molecular mechanisms for therapy. journal of internal medicine vol. – ( ). . wilkinson, m. d. et al. the fair guiding principles for scientific data management and stewardship. sci. data , ( ). . gorgolewski, k. j. et al. the brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. sci. data , – ( ). . blackfynn. https://www.blackfynn.com/. . abrams, m. b. et al. a standards organization for open and fair neuroscience: the international neuroinformatics coordinating facility. neuroinformatics – ( ) doi: . /s - - - . . openminds. https://github.com/humanbrainproject/openminds. . datacite metadata working group. datacite metadata schema documentation for the publication and citation of research data. version . . datacite e.v doi:https://doi.org/ . / xq -zf . . sparc material sharing policy. nih commons https://commonfund.nih.gov/sparc/materialsharing ( ). . morris, k. et al. feline brainstem neuron extracellular potential recordings. ( ) doi:https://doi.org/ . / upo-xvkt. . sparc-curation: code and files for sparc curation workflows. https://github.com/scicrunch/sparc- curation. . json schema draft- release notes | json schema. https://json-schema.org/draft- /json- schema-release-notes.html. . bug, w. j. et al. the nifstd and birnlex vocabularies: building comprehensive ontologies for neuroscience. neuroinformatics vol. – ( ). . patel, b. soda for sparc: simplifying data curation for researchers funded by the nih sparc initiative. https://github.com/bvhpatel/soda. . patel, b., srivastava, h., aghasafari, p. & helmer, k. sparc: soda, an interactive software for curating sparc datasets. faseb j. , – ( ). . qc documentation for investigators: titles and descriptions - google docs. https://docs.google.com/document/d/ zo xdkrkpfofj qll f_ mlox qjmbzigbpdwti hjw /edit#. . qc checklist for sparc - google sheets. https://docs.google.com/spreadsheets/d/ emnkgibvef tsi-rpatrb kgu-v kimhwbm v- mvvuy/edit#gid= . . the brain imaging data structure - bids v . . . https://bids-specification.readthedocs.io/en/stable/. . ruebel, o. et al. nwb:n . : an accessible data standard for neurophysiology. biorxiv ( ) doi: . / . acknowledgements we thank funding from nih sparc ot od , nih sparc ot od , and nih sparc ot od . author contributions a.b., j.g., a.p., t.g, g.p., m s-z, and m.m. form the sparc curation team and have all participated in the development of the sds. b.p. is leading the development of soda for sparc. all have contributed to the writing and revision of this manuscript. competing interest statement ab, mm and jg have equity interest in scicrunch.com, a tech start up out of ucsd that develops tools and services for reproducible science, including support for rrids. ab is the ceo of scicrunch.com. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://scicrunch.com/ http://scicrunch.com/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / appendix metadata specifications for sparc datasets v . . table a : descriptive metadata for v . . . required fields are highlighted in green while conditional fields (i.e., required if present) are highlighted in yellow. metadata ele- ment description example name descriptive title for the data set. equivalent to the title of a scientific paper. the metadata associated with the published version of this dataset does not currently make use of this field. my sparc dataset description note this field is not currently used when publishing a sparc dataset. brief description of the study and the data set. equivalent to the abstract of a scientific paper. include the rationale for the approach, the types of data collected, the techniques used, formats and number of files and an approximate size. the metadata associated with the published version of this dataset does not currently make use of this field. a really cool dataset that i collected to answer some question. keywords a set of - keywords other those in the title that will aid in search spinal cord, electrophysiology, rna- seq, mouse contributors name of any contributors to the dataset. these individuals need not have been authors on any publications describing the data, but should be acknowledged for their role in producing and publishing the data set. if more than one, add each contributor in a new column. last, first middle contributor orcid id orcid id. if you don't have an orcid, we suggest you sign up for one. https://orcid.org/ - - - contributor affiliation institutional affiliation for contributors https://ror.org/ r w contributor role contributor role, e.g., principleinvestigator, creator, coinvestigator, con- tactperson, datacollector, datacurator, datamanager, distributor, editor, producer, projectleader, projectmanager, projectmember, relatedper- son, researcher, researchgroup, sponsor, supervisor, workpackage- leader, other. these roles are provided by the data cite schema. if more than one, add additional columns data collector is contact person yes or no if the contributor is a contact person for the dataset yes acknowl- edgements acknowledgements beyond funding and contributors thank you everyone! funding funding sources ot od originating article doi dois of published articles that were generated from this dataset https://doi.org/ . / jchdy protocol url or doi urls (if still private) / dois (if public) of protocols from protocols.io re- lated to this dataset additional links urls of additional resources used by this dataset (e.g., a link to a code repository) https://github.com/myuser/code-for- really-cool-data link descrip- tion short description of url content, you do not need to fill this in for origi- nating article doi or protocol url or doi link to github repository for code used in this study number of subjects number of unique subjects in this dataset, should match subjects metadata file. number of samples number of unique samples in this dataset, should match samples metadata file. set to zero if there are no samples. complete- ness of data set is the data set as uploaded complete or is it part of an ongoing study. use "hasnext" to indicate that you expect more data on different subjects as a continuation of this study. use “haschildren” to indicate that you expect more data on the same subjects or samples derived from those subjects. hasnext, haschildren .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://orcid.org/ https://doi.org/ . / jchdy https://github.com/myuser/code-for-really-cool-data https://github.com/myuser/code-for-really-cool-data https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / parent da- taset id if this is a part of a larger data set, or references subjects or samples from a parent dataset, what was the accession number of the prior batch. you need only give us the number of the last batch, not all batches. if samples and subjects are from multiple parent datasets please create a comma separated list of all parent ids. n:dataset:c c f f- be- -bfc - b f cf title for com- plete data set please give us a provisional title for the entire data set. metadata version do not change . . . . table a : subject metadata. required fields are highlighted in green while recommended fields are highlighted in yellow. blue fields are required (pool_id only if pooled subjects were used) and provide the necessary fields for providing provenance of subjects and subject pools within experiments. attribute description example subject_id lab-based schema for identifying each subject, should match folder names sub- pool_id if data is collected on multiple subjects at the same time include the identifier of the pool where the data file will be found. if this is included it should be the name of the top level folder inside primary. pool- experimental group experimental group subject is assigned to in research project control age age of the subject (e.g., hours, days, weeks, years old) or if unknown fill in with “unknown” weeks sex sex of the subject, or if unknown fill in with “unknown” female species subject species rattus norvegicus strain organism strain of the subject sprague-dawley rrid for strain research resource identifier identification (rrid) for the strain for this field rrid:rgd_ additional fields (e.g. minds) minds = minimal information about a neuroscience dataset age category description of age category from derived from uberon life cycle stage prime adult stage age range (min) the minimal age (youngest) of the research subjects. the format for this field: numerical value + space + unit (spelled out) days age range (max) the maximal age (oldest) of the research subjects. the format for this field: numerical value + space + unit (spelled out) days handedness preference of the subject to use the right or left hand, if applicable right genotype ignore if rrid is filled in, genetic makeup of genetically modified al- leles in transgenic animals belonging to the same subject group mgi: reference at- las the reference atlas and organ paxinos and watson, the rat brain in stereotaxic coordinates, th ed, protocol title once the research protocol is uploaded to protocols.io, the title of the protocol within protocols.io must be noted in this field. spinal cord extraction .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://scicrunch.org/resources/organisms/search https://scicrunch.org/resources/organisms/search http://www.ontobee.org/ontology/catalog/uberon?iri=http://purl.obolibrary.org/obo/uberon_ http://www.ontobee.org/ontology/catalog/uberon?iri=http://purl.obolibrary.org/obo/uberon_ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / protocol.io lo- cation the protocol.io url for the protocol. once the protocol is uploaded to protocols.io, the protocol must be shared with the sparc group and the protocol.io url is noted in this field. please share with the sparc group. https://www.protocols .io/view/corchea-paper-based-micro- fluidic-device-vtwe pe experimental log file name a file containing experimental records for each sample. table a : sample metadata the color key is the same as for subjects (a ) attribute description example subject_id lab-based schema for identifying each subject sub- sample_id lab-based schema for identifying each sample, must be unique sub- _sam- wasderivedfromsam- ple sample_id of the sample from which the current sample was derived (e.g., slice, tissue punch, biopsy, etc.) sub- _sam- pool_id if data is collected on multiple samples at the same time in- clude the identifier of the pool where the data file will be found. pool- experimental group experimental group subject is assigned to in research pro- ject. if you have experimental groups for samples please add another column. control specimen type physical type of the specimen from which the data were extracted tissue specimen anatomical lo- cation the organ, or subregion of organ from which the data were extracted dentate gyrus additional fields (e.g. minds) species subject species rattus norvegicus sex sex of the subject, or if unknown fill in with “unknown” female age age of the subject (e.g., hours, days, weeks, years old) or if unknown fill in with “unknown” weeks age category qualitative description of age category derived from uberon life cycle stage prime adult stage age range (min) the minimal age (youngest) of the research subjects. the format for this field: numerical value + space + unit (spelled out) days age range (max) the maximal age (oldest) of the research subjects. the for- mat for this field: numerical value + space + unit (spelled out) days handedness preference of the subject to use the right or left hand, if ap- plicable right strain organism strain of the subject sprague-dawley rrid for strain rrid for the strain for this field rrid:rgd_ .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://scicrunch.org/resources/organisms/search https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / genotype ignore if rrid is filled in, genetic makeup of genetically modified alleles in transgenic animals belonging to the same subject group mgi: reference atlas the reference atlas and organ paxinos rat v protocol title once the research protocol is uploaded to protocols.io, the title of the protocol within protocols.io must be noted in this field. spinal cord extraction protocol.io location the protocol.io url for the protocol. once the protocol is uploaded to protocols.io, the protocol must be shared with the sparc group and the protocol.io url is noted in this field. please share with the sparc group. https://www.protocols.io/view/cor- chea-paper-based-microfluidic-de- vice-vtwe pe experimental log file name a file containing experimental records for each sample. table a : controlled vocabulary for experimental modes used in sparc. these terms are in the process of being added to the nifstd ontology techniques branch. name definition nifstd id anatomy study that aims to understand the structure of organisms or their parts. behavioral study that induces and/or measures the behavior of the subject cell counting study that is designed to quantify cell populations cell culture study that employs cells isolated from the organism or tissue that are kept alive and studied in vitro cell morphology study that specifically seeks to understand the shape and structure of individual cells cell population characterization study that measures biochemical, molecular and/or physiological characteristics of popula- tions of cells as opposed to individual cells connectivity study that maps or measures functional and/or anatomical connections between nerve cells and their targets or connections between populations of neurons in defined anatomical re- gions. electrophysiology study that measures electrical impulses within an organism, cell or tissue or the effects of direct electrical stimulation epigenomics study that measures modifications of genetic material that affect transcription but do not al- ter the organism's dna expression study that measures or visualizes gene or protein expression within cells or tissues. fo- cuses on the gene. expression char- acterization study that characterizes the cellular, anatomical, or morphological distribution of gene ex- pression. focuses on population. genomics study that measures aspects related to the complete dna genome of an organism histology study that investigates the microscopic structure of tissues microscopy study that primarily uses light or electron microscopic imaging models study that creates or characterizes computational models or simulations of other experi- mentally observed phenomena morphology study designed to determine the shape and structure of tissues and body parts multimodal study that employs multiple modalities in significant ways optical study that makes measurements using photons in the visible spectrum. physiology study that measures the function or behavior of organs and tissues in living systems. radiology study that uses at least one of a variety of minimally invasive probes such as x-rays, ultra- sound, or nuclear magnetic resonance signals to capture data about the internal structure of intact subjects. spatial tran- scriptomics study used to spatially resolve rna-seq data, and thereby all mrnas, in individual tissue sections (wikipedia). transcriptomics study that measures rna transcription in the organism or cell population of interest .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://www.protocols.io/view/corchea-paper-based-microfluidic-device-vtwe pe https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / topology-based sparsification of graph annotations topology-based sparsification of graph annotations daniel danciu , ,* mikhail karasikov , , ,* harun mustafa , , andré kahles , , ,† gunnar rätsch , , , ,† biomedical informatics group, department of computer science, eth zurich, zurich, switzerland biomedical informatics research, university hospital zurich, zurich, switzerland swiss institute of bioinformatics, zurich, switzerland department of biology, eth zurich, zurich, switzerland abstract since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. labeled de bruijn graphs are a frequently-used approach for representing large sets of sequencing data. while significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. in this paper, we present rowdiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. rowdiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. in addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. rowdiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. experiments on , rna-seq datasets show that rowdiff combined with multi- brwt results in a % reduction in annotation footprint over mantis-mst, the previously known most compact annotation representation. experiments on the sparser fungi subset of the refseq collection show that applying rowdiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of . when combining rowdiff with a multi-brwt representation, the resulting annotation is times smaller than mantis-mst. introduction the exponential increase in global sequencing capacity [ ] and the resulting growth of public sequence repositories have created an urgent need for the development of compact representation schemes of bio- logical sequences. such schemes should not only maintain all relevant biological sequence variation but also provide fast access for sequence search and extraction. after initial attempts focused on the lossless compression of full sequences, e.g., using the burrows-wheeler transform [ ], the field soon turned towards representing a proxy of the input sequences instead: the sets of all k-mers contained in them. for this, any recurrent occurrence of a substring of length k in the input is represented by a unique k-mer, forming a k-mer set. a query of a given sequence against the input text can then be replaced by exact k-mer matching against the set. longer strings are queried as a succession of k-mers. although it is a lossy representation of the input (as, e.g., repeats longer than k are collapsed), constructing k-mer sets has proved highly useful in practice [ , , , ]. *joint-first authors. †joint corresponding authors; contact: andre.kahles@inf.ethz.ch and gunnar.ratsch@ratschlab.org. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . representation of k-mer sets various representations have been developed to balance the trade-off between the space taken by the k-mer set and query time or representation accuracy. conceptually, the k-mer set fully defines a vertex-centric de bruijn graph, where each k-mer forms a vertex and arcs are represented implicitly, based on whether any two vertices share a k − overlap. the simplest representations are bitmaps or (perfect) hash-tables that indicate the presence or absence of any possible k-mer over the input alphabet in the input text. while non- optimal in space, they offer constant-time query of k-mers. more compact representations use approximate membership query data structures to probabilistically represent a de bruijn graph [ , ] or utilize succinct de bruijn graphs (a generalization of the burrows-wheeler transform) [ ], which usually require less than one byte per input k-mer over the nucleotide alphabet {a,c,g,t}. . de bruijn graph annotation a major limitation of the above representations is that the identities of any sequence labels contained in the input text set are lost. to alleviate this, the concept of colored de bruijn graphs emerged [ ] (otherwise known as annotated or labeled de bruijn graphs), allowing for the representation of additional annotations per k-mer. these annotations can either be stored in conjunction with the k-mers or be organized in a separate data structure, using the k-mer representation only as an index space. although the first option is used by a number of conceptually interesting methods, such as mantis [ ] that uses counting quotient filters to represent the k-mers linked to an annotation identifier, here we will only focus on the second option, as it allows for the connection of arbitrary annotations to the k-mer set, without re-processing the k-mer index. conceptually, the set of annotations is a relation between k-mers and labels that can be represented as a binary matrix, where the k-mer set indexes the rows and each annotation label specifies a column. any entry (i,j) in the matrix represents the relation of k-mer i and annotation j. different methods have been suggested to compress this annotation matrix in a way that still allows for efficient query. vari [ , ] concatenates the rows of the annotation matrix and compresses the result using either an rrr [ ] or elias- fano coding [ , ]. rainbowfish [ ] takes advantage of high redundancy in matrix rows by computing a frequency code for the unique rows, compressing the unique rows in a matrix ordered by these codes, then representing the original matrix as a variable-length code vector. however, this method and other frequency coding- based approaches become less effective for data sets with greater levels of noise or inter-sample variability. multi-brwt [ ] compresses the matrix in a hierarchical tree structure exploiting column similarity, but leaving the possible row redundancy unexploited. alongside these methods, there is a rich literature of different compressors for graph annotations developed over the years, each improving on the compression performance of previous methods [ , , ]. all of these methods share the common property that they act as general purpose binary matrix compressors, and thus, they do not take into account any particular domain knowledge in their construction. . leveraging graph topology to improve annotation compression while the methods mentioned above rely solely on similarities between annotation matrix elements to achieve their compression, a few have additionally leveraged graph topology to increase their compression potential. the bloom filter correction method introduced by [ ] encodes the columns of the annotation matrix in bloom filters with high false positive rate. assuming that all vertices within a graph unitig (a path in which all vertices except for the first and last have in- and out-degree ) share identical annotations, a row in the annotation matrix (corresponding to all vertices from the same unitig in the graph) is computed as the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / bit-wise and of the rows stored for every vertex of that unitig. while achieving high accuracy in decoding row annotations, the corrected bloom filters are not able to losslessly decode the rows of the encoded anno- tation matrix. in addition, the authors introduce a lossless approach based on wavelet tries which leverages graph backbone paths to improve compression performance. however, these paths must be provided by the user and cannot be computed automatically by the method. the more recently introduced mantis mst method [ ] constructs an annotation graph with nodes representing the unique rows of the annotation matrix. in this annotation graph, a weighted edge between two nodes v and v is created if there exist adjacent vertices s and s in the underlying de bruijn graph whose annotations are represented by v and v , respectively. the weight of this edge (v ,v ) is then set to the hamming distance of the unique rows v and v . mantis mst computes the minimal spanning tree of the annotation graph and represents the annotation of a node as its bit-wise xor with the annotation of its parent node in the spanning tree, while only the annotation of the root node is represented explicitly. . our contribution we present a new scheme for representing graph annotations, rowdiff, which takes advantage of similar- ities between the annotations of neighboring vertices to compress annotation matrices. rowdiff can be constructed using |g|+ m + o(|c|) bits of memory, where |c| is the compressed size of the largest column in the annotation matrix and |g| is the size of the memory representation of the graph and in our case is less than m + o(m) bits [ ], where m is the number of k-mers, thus making it suitable for annotating virtually arbitrarily large graphs. since rowdiff is a transformation of the input annotation matrix attempting to in- crease its sparsity, rowdiff can be naturally chained with any generic scheme for compressed binary matrix representation to achieve further improvements in compression performance. we demonstrate the com- pression performance of rowdiff relative to the state-of-the-art lossless rainbow-mst and multibrwt methods on datasets representing different annotation matrix densities. in the next sections, we define the underlying concepts (sections . and . ) and detail our methods for construction (sections . to . ) and querying (section . ) of the rowdiff data structures. we then describe the test datasets (section . ) and study the representation sizes (sections . , . , and . ), construction time (section . ) and query time (section . ) of rowdiff-compressed annotations. finally, we discuss limitations and directions for future work (section ). method . notation we will operate in the following setting. let k be a positive integer. the order k de bruijn graph over a set of sequences s, denoted by dbgk(s), is a directed graph dbgk(s) := (vk,ek), whose vertices vk are the set of all distinct sub-strings of length k of sequences in s (k-mers), and an arc links u ∈ vk to v ∈ vk, if u :k = v :k− , where si:j denotes the sub-sequence of s from position i up to and including position j. we denote with deg−(v) and deg+(v),v ∈ vk the in- and out-degree of a vertex, respectively. vertices v ∈ vk, deg−(v) = are called source vertices and vertices v ∈ vk, deg+(v) = are called sink vertices. given an arbitrary set of labels l, an annotation for a de bruijn graph dbgk(s) is a relation a⊂ vk×l, which assigns to each vertex v ∈ vk a set of labels, l(v) ⊂ l. we will trivially represent a using a binary matrix a ∈ { , }|vk|×|l|, denote with ai the i-th row of a, and with ai ⊕ aj the element-wise xor of rows i and j. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . rowdiff transformation rowdiff relies on the observation that adjacent vertices in the graph are likely similarly annotated, and thus, their respective rows in the annotation matrix a are similar as well. this implies that if (u,v) ∈ ek, storing the difference between au and av may be more space efficient than storing au, i.e. popcount(au ⊕av) < popcount(au), where popcount(x) represents the number of set bits in row x. rowdiff is defined as a transformation that converts an annotation matrix a of a de bruijn graph into a new, sparser, annotation matrix a∗ of the same size and an additional anchor vector a ∈ { , }|vk|, that is, rddbgk(s) : { , } |vk|×|l| →{ , }|vk|×(|l|+ ). the anchor vector a stores which rows remain unchanged. we show that the original annotation matrix a can be reconstructed from the rowdiff transformed matrix a∗ and the anchor vector a. empirically, the rowdiff transformed matrix is significantly better compressible in the typical case where neighboring vertices have similar annotations. we develop an efficient algorithm for defining good anchors and for computing the rowdiff transform rddbgk(s) and its inverse. for each vertex u ∈ vk we arbitrarily define its rowdiff successor as its lexicographically largest outgoing vertex succ(u), such that (u, succ(u)) ∈ ek and succ(u) ≥ v ∀(u,v) ∈ ek, if such u exists. rowdiff replaces each row au with the (likely sparser) delta relative to its rowdiff successor. for binary rows, the delta is simply the element-wise xor, a∗u := au ⊕asucc(u), while for non-binary rows, the delta could store the difference between the row and its successor. in this work, we focus on binary matrices. the previous equation implies that ai = a∗i ⊕ asucc(i), which gives us a simple formula for recursively reconstructing the original row. in order to be able to reconstruct the original annotation a from a∗, some rows are left unchanged. a vertex v ∈ vk for which the annotation is stored unchanged is called an anchor and its corresponding value in the anchor bit vector will be set to , av = . sink vertices do not have a rowdiff successor, and must thus be anchors. algorithm shows the implementation of the inverse transformation rd− dbgk(s), which reconstructs the original row ai from the rowdiff representation a∗. algorithm row annotation reconstruction : function reconstructannotation(i) : row ← a∗i : while ai = do . current vertex is not an anchor : i ← succ(i) : row ← row ⊕ a∗i : end while : return row : end function starting from any vertex in the de bruijn graph, algorithm defines a traversal leading to an anchor vertex, for which the annotation was not transformed. since de bruijn graphs may have cycles, additional anchor vertices might have to be assigned in order to break rowdiff cycles (those cycles where every vertex is a rowdiff successor relative to its predecessor in the cycle). proposition . algorithm finishes for every starting vertex, if and only if every sink vertex in the graph is an anchor and every rowdiff cycle contains at least one anchor vertex. proof. assume the algorithm does not finish for a starting vertex i. this implies that asucck(i) = ,∀k ∈ n. since the number of vertices in the graph is finite, there must exist l,m ∈ n, l = m, s.t. succl(i) = succm(i). thus, (succl(i), succl+ (i), . . . , succm(i)) is a cycle and, hence, must contain at least one anchor vertex, which contradicts the initial assumption. proof of necessity is equally trivial. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / proposition . algorithm correctly reconstructs the original annotation row ai for every vertex i ∈ vk. proof. the algorithm computes a∗i ⊕a ∗ succ(i) ⊕···⊕a ∗ succp(i), where asuccp(i) = , and thus, a ∗ succp(i) = asuccp(i). by repeatedly reducing the last terms using a ∗ succp− (i) ⊕ asuccp(i) = asuccp− (i), the original equation is reduced to ai, which is the desired value. once the set of anchor vertices a satisfies proposition , the rowdiff-transformed matrix a∗ together with the anchor indicator bitmap a encode the original annotation matrix. . anchor assignment in addition to the small set of anchors described in proposition , we seek to cap the maximum rowdiff path length (i.e. a path taken by algorithm ) to a certain value m (typically between and ) by ensuring that at least every m-th vertex in a rowdiff path is an anchor, as described below. this guarantees that the number of iterations in algorithm is bounded by a constant, and thus the average time complexity of reconstructing a single row is o(l ·m), where l �|l| is the average number of set bits (labels) per row. at the same time, since anchor vertices require storing the original, less sparse annotation row, it is desirable to minimize the total number of anchor vertices in order to keep the popcount (and thus the compressed size) of the rowdiff annotation a∗ small. the following anchor assignment algorithm allocates anchor vertices near-optimally in four steps as follows (see algorithm ). first, we traverse rowdiff paths backwards (in parallel) starting from sink algorithm anchor assignment : function assignanchors(m) : visited[] ←{false} . initialize mask of visited vertices : anchor[] ←{false} . initialize mask of anchor vertices : for all s ∈ sinks() parallel do : anchor ← traversebwd(s, visited, anchor, m) : for all s ∈ sources() parallel do : anchor ← traversefwd(s, visited, anchor, m) : for all s ∈ forks() parallel do : anchor ← traversefwd(s, visited, anchor, m) : . only vertices in simple cycles (no forks) left unvisited at this point : for all s ∈ nodes() parallel do : anchor ← traversefwd(s, visited, anchor, m) : return anchor : end function vertices (see algorithm s ). the backward traversal stops either when we reach a source vertex or when we reach a vertex v ∈ vk, s.t. succ(v) = u for the previously traversed vertex u (see figure , top). note, the traversal is not terminated when reaching a vertex with multiple incoming arcs, but explores each of them and continues to further traverse these rowdiff paths backwards. when the distance from the current vertex to the next assigned anchor in the current rowdiff path reaches m, the vertex is marked as an anchor. in practice, once the backward traversal is finished, the vast majority of the vertices have been traversed, and the anchor assignment is optimal, in the sense that no anchors are closer than m to each other. in the second step, we start at source vertices and traverse rowdiff paths forwards, i.e. paths of the form v, succ(v), succ (v), . . . (see algorithm s ). the traversal stops when we reach an already visited vertex. in the third step, we start traversing forward at all forks with unvisited vertices. after the third .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / step, the only vertices that were not traversed must belong to a simple cycle (a cycle where all vertices have deg−(v) = deg+(v) = ). the fourth step traverses these cycles (in parallel). each of these traversals sets an anchor every m vertices during the traversal. since we visit each vertex only once, the time complexity of anchor assignment is o(|vk|). proposition . the anchors assigned by algorithm guarantee successful termination of algorithm for any input vertex v ∈ vk. proof. step of the algorithm trivially guarantees that all sink vertices are anchor vertices. steps and guarantee that all cycles in the graph are traversed and at least one anchor vertex is set in each cycle. the conditions in proposition are thereby satisfied and algorithm finishes and successfully reconstructs a from a∗. one important detail in the forward traversal step is handling the situation when the traversal stops due to merging into a visited vertex. not setting an anchor in such cases may result in arbitrarily long paths with no anchors (when such merges are chained). always setting an anchor at a merge will introduce unnecessary anchors and increase the annotation density. we handle merges with the following simple heuristic: use an additional bit vector, nearanchor, to mark all vertices that are known to be at a distance smaller than m to an anchor vertex. during forward traversal, when hitting a visited merge vertex not marked in nearanchor, the anchor is not set (figure , bottom). a more optimal algorithm for deciding if a merge vertex should create an anchor would require labeling each vertex with the distance to its nearest anchor. in our implementation we preferred the heuristic algorithm due to its significantly reduced space complexity. forward traversal backward traversal bwd traversal stops, v≠succ(u) source sinku succ(u) v merge (no anchor created) merge (anchor created) previously visited near anchor traversing now m m figure : top: rowdiff traversal. when traversing backward to assign anchor vertices, the traversal stops at vertex u, because succ(u) = v. when traversing forward, the last outgoing vertex is selected. bottom: chained merge. dark grey vertices are marked as nearanchor. when traversing the light grey vertices, we merge into m , marked as nearanchor, thus, no anchor is set. when traversing the blue vertices, an anchor must be set at m , as m is not marked as nearanchor. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . . anchor optimization to guarantee that none of the rows a∗v in a ∗ have more set bits than the corresponding row av in the original annotation, we perform the following anchor optimization procedure. for each v ∈ vk, s.t. popcount(a∗v) > popcount(av), we make such vertex an anchor, av := , and replace a∗v with av. this ensures that all rows in the rowdiff-transformed annotation matrix are at least as sparse as the corresponding rows in the original annotation matrix. proposition . each row in a rowdiff-transformed annotation matrix has the same or fewer set bits than its corresponding row in the original annotation matrix. the anchor optimization procedure is implemented similarly to the initial construction of rowdiff (see section . ). thus, it has the same time and space complexity. . rowdiff construction a naı̈ve implementation of the rowdiff construction would be to load the matrix a in memory, and gradu- ally replace its rows with their sparsified counterpart, while traversing the graph. although fast and simple, this method requires to keep the entire annotation matrix a and the graph in memory. unfortunately, often this is not realistic and even the annotation matrix a alone can easily reach several terabytes in size. thus, we developed a distributed parallel construction algorithm that only loads a few columns of a at a time and, hence, needs a limited amount of memory. in the first stage, we load the graph and for each vertex pre-compute the indices of the unique rowdiff successor and the (possibly multiple) rowdiff predecessors, stored in vectors pred and succ, respectively. the pred and succ vectors are used to build a∗ in the second stage without the need to query the graph itself and load it in memory. to make the algorithm scale to de bruijn graphs with trillions of vertices, vectors pred and succ are built and traversed in a streaming manner. they are loaded in small blocks, as described in algorithms and s , and never kept in memory in full. thus, pre-computing the pred and algorithm rowdiff transform : function sparsify(columns) . sparsifies a batch loaded in memory : for block ← , numrows, blocksize do . process by blocks : load pred[block..block+blocksize] : load succ[block..block+blocksize] : for all c ∈ columns parallel do : for all i ∈ c[block..block+blocksize] do . iterate only set bits : if not c[succ[i]] then : . the bits at i and succ[i] are different, hence, diff = : c∗[i] ← true : end if : for all p ∈ pred[i] do : if not c[p] then : . the bits at p and i are different, hence, diff = : c∗[p] ← true : end if : end for : end for : end for : end function .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / s pa rs ify in p ar al le l lo ad in b at ch es sparsification overview c ij := ¬c s[i]j ^ c p[i]j := ¬c p[i]j ^ parallel pred/succ if c ij == : m p hdd hdd hdd hdd ram ram p i. . . p k s i. . . s k c ij. . . c kj figure : rowdiff transform algorithm – schematic overview of sparsification on a single machine. top: columns are loaded into memory in batches (until memory is exhausted) and each batch is fully transformed to rowdiff. the result is serialized and the process moves on to the next batch. bottom: each batch is transformed to rowdiff as follows. the algorithm iteratively loads into memory blocks of the pre- computed vectors pred and succ. then, all columns of the batch are processed in parallel. the algorithm iterates only through set bits of each column in the active block and computes the elements of the rowdiff transformed matrix a∗ (see algorithm for a more detailed description). succ vectors essentially makes it possible to query the graph topology during the second stage while only using o( ) additional space. after the rowdiff annotation a∗ has been generated, the pred and succ vectors are not required for querying and, thus, can be discarded. the second stage of the construction algorithm (the sparsification workflow) is schematically described in figure . the initial sparsification of a can be trivially distributed by dividing the columns of a into groups and processing each group on a different machine. each machine processes its assigned columns in batches. the size of each batch is determined dynamically by loading columns into memory until a desired upper limit is reached. this upper limit must be greater than the largest column being processed in compressed bit-vector format, but otherwise not restricted. for each column in the batch, we iterate only the set bits (only those rows corresponding to vertices annotated with the label represented by that column) and compare them with the bits at positions pred and succ in the same column to compute the rowdiff-transformed row, as shown in figure . . . scalability and complexity algorithm only traverses set bits in a, and for each set bit in row i it performs o(deg−(i) + ) operations, hence the total time complexity is o(( + α)popcount(a)), where α is the average in-degree of the graph. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for de bruijn graphs, α ≤ |Σ|, and hence the time complexity is linear in the number of set bits of the original annotation matrix, i.e. o(popcount(a)). algorithm s , for constructing pred and succ, traverses each vertex exactly once, hence its time complexity is o(|vk|). since the buffer used by algorithm s has a constant size, the space complexity is |dbgk(s)|+ o( ), where |dbgk(s)| denotes the memory footprint of the graph, which, for instance, in the case of the boss representation [ ], typically does not exceed m+ o(m) bits, where m = |vk|. after taking into account algorithm for anchor assignment, which requires additional bits per vertex to indicate anchors and the traversal state, and putting it all together, we get that the rowdiff-transform can be performed in o(popcount(a) + |vk|) time and in |dbgk(s)| + |vk| + o(|c|) space, where |c| is the memory footprint of the largest (densest) column of a in a compressed bit-vector format. note that the first term in the sum is usually the dominant. in conclusion, we mention again that rowdiff construction can be easily distributed on multiple ma- chines with modest hardware requirements and run in parallel on each machine, which makes the method very attractive for practical use on very large data sets. . querying annotations for paths we now note that, when querying annotations for paths in the graph, or sets of rows corresponding to vertices from a local neighborhood in the graph, algorithm leads to redundant reconstruction work, as many of the queried rows belong to the same rowdiff paths. to alleviate this, we perform the traversal first and pre- compute all rowdiff paths from the rows queried. then, we query all diff rows in one batch and reconstruct annotations for each row from the query. this ensures that no arc in these paths is traversed more than once. moreover, querying all rows in one batch often allows making the query of the underlying representation of the sparsified binary matrix faster by exploiting its potential intrinsic features (e.g., jointly querying n bits in m columns is more cache-efficient and faster than n queries of single bits in each of the m columns). . implementation details we implemented rowdiff as part of the metagraph framework [ ]. the code for reproducing results of the experiments is available at https://github.com/ratschlab/row_diff. for storing original columns of the annotation matrix as well as the indicator bitmap with anchor vertices, we used the sd vectors from the sdsl-lite library [ ] for compressed representation of bitmaps. for compression of the transformed annotation matrix, we used the multi-brwt representation scheme proposed in [ ], with its improved and scaled up implementation from metagraph. results and discussion in this section, we evaluate the performance of the methods described above both in terms of their final representation sizes and their construction time. in addition, we also study the effect of the maximum rowdiff path length on the final rowdiff representation size of the compressed annotations. finally, we evaluate the degree of size reduction that rowdiff provides on a per-column basis. . data sets we evaluated the compression performance of rowdiff on three data sets with different levels of sequence variability and thus graph density. our first data set consists of all fungi sequences from refseq release [ ], with annotations derived from the taxonomic ids of the sequences’ respective organisms. our .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / second and third data sets are derived from the cohort of , publicly available human rna-seq exper- iments used in [ ]. we constructed annotated de bruijn graphs from the rna-seq data set in the same manner as in [ ], using a k value of , albeit with two samples discarded due to their withdrawal from the sequence read archive. we will refer to this data set as rna-seq (k= ). the third data set is constructed using the graph cleaning approach implemented in metagraph [ ], using a k value of . we will refer to this data set as rna-seq (k= ). for evaluating construction time and representation size, we shuffled the samples in each data set and generated subsets of increasing size. we evaluated rowdiff against mst [ ], employed in mantis [ ], which, to the best of our knowledge, is the most compact annotation representation method to date. similarly to rainbowfish [ ], mst reduces the original annotation matrix to a set of unique rows and consists of two components: a vector, mapping indexes of rows of the annotation matrix to its unique rows (color classes) and unique rows compressed in a minimum spanning tree. in mantis, this mapping vector is included into a hash table storing the k-mers of the de bruijn graph, which is usually at least an order of magnitude larger than the compressed annotation. thus, to make a fair comparison, we exclude the large contribution of mantis’ graph representation, and only consider the mapping vector, using the same representation as in rainbowfish [ ]. thus, we refer to the mst annotation representation as rainbow-mst. note that rainbow-mst forms a graph annota- tion representation which, similarly to rowdiff, can be used with any de bruijn graph representation with indexed k-mers. . representation size we now compare the representation size for rowdiff and other state-of-the-art graph annotation compres- sion methods. figure shows the representation size for the rna-seq (k= ) and refseq (fungi) data sets. on the rna-seq (k= ) data set, rowdiff-multibrwt effectively takes advantage of the topology of the graph annotation and the similarity of rows of the annotation matrix and achieves a nearly -fold size reduction compared to multi-brwt applied on non-sparsified columns. compared to the rainbow-mst method, rowdiff-multibrwt achieves a -fold size reduction. rainbow-mst computation on the subsets with more that samples could not be computed because mantis did not complete within the day limit of our compute cluster. for this reason, we also plotted the size of the rainbow-mst mapping vector, which, being a subset of the mst annotation data, represents a lower bound for rainbow-mst. number of sra samples s iz e, g b rainbow-mst rainbow-mst (mapping only) multibrwt rowdiff-rowsparse rowdiff-multibrwt (a) rna-seq (k= ) data set number of sra samples s iz e, g b rainbow-mst rainbow-mst (mapping only) multibrwt rowdiff-rowsparse rowdiff-multibrwt (b) rna-seq (k= ) data set number of taxonomy ids . . . . . . . s iz e, g b rainbow-mst rainbow-mst (mapping only) multibrwt rowdiff-rowsparse rowdiff-multibrwt (c) refseq (fungi) data set figure : representation size. the purple and red lines represent the size of the rowdiff annotation with and without multibrwt, respectively. the blue line indicates the size of the rainbow-mst annotation. the orange line represents the size of the rainbow-mst mapping vector and represents a lower bound on the rainbow-mst representation size. rainbow-mst computation on the rna-seq (k= ) data set with > samples did not complete within the day limit of our compute cluster. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / on the refseq (fungi) data set, rowdiff takes advantage of the longer stretches of vertices with iden- tical annotations and achieves a -fold size reduction relative to rainbow-mst. notably, this significant difference comes from the fact that virtually all of the space used by rainbow-mst is taken by the mapping vector on this data set. on the rna-seq (k= ) data set, rowdiff-multibrwt achieves a . -fold size reduction relative to multi-brwt and a . -fold reduction relative to rainbow-mst. the rowsparse format stores the indices of set bits in each row in a compressed integer vector. this type of annotation is faster to construct and to query than the multi-brwt representation, but its footprint is significantly larger on denser datasets, such as rna-seq (k= ). . effects of graph density on compression in this section we analyze how the density of the annotated graph affects rowdiff compression. in a first experiment we take a random subset of entries from the refseq fungi data set and build graphs and corresponding annotations for k-mer sizes ranging from to . table shows how as the sparsity of the graph increases (with increasing k-mer length), the compression ratio |a|/|a∗| sharply increases. table : compression ratio vs graph density. the sparser the graph the higher the compression ratio. k-mer size average node degree compression ratio |a|/|a∗| . . . . . . . . . . in a second experiment, we test how the maximum path length m affects the annotation size for graphs of various densities. table shows the annotation size on the rna-seq (k= ), rna-seq (k= ), and refseq (fungi) data sets for various values of m. while increasing the maximum path length has negligible effect on the denser rna-seq graphs (with average node degrees of . and . respectively), it reduces the annotation size by a factor of up to . on the much sparser refseq (fungi) graph (with an average node degree of . ). table : annotation size vs maximum path length m for rna-seq (k= , ) and refseq fungi (k= ). a sharp decrease in annotation size can be observed for the sparse refseq (fungi) graph. sizes shown in gb for rna-seq and mb for refseq. dataset m= m= m= m= m= rna-seq rna-seq refseq fungi .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . compression of single columns in this experiment, we measure how rowdiff compresses individual columns of the annotation matrix. figure shows the compression factor −|a∗·,i|/|a·,i| achieved by rowdiff on two datasets representing two different extreme cases of sequence variability. the de bruijn graph constructed from assembled genomes refseq (fungi) contains significantly fewer branches and bubbles than the graph constructed from reads rna-seq (k= ), thus its annotation is signifi- cantly better compressed by rowdiff, with an average reduction factor of . . . . . . reduction factor: (original - rowdiff) / original (a) rna-seq (k= ) data set . . . . . . reduction factor: (original - rowdiff) / original (b) rna-seq (k= ) data set figure : histogram of the column reduction factor − |a∗·,i|/|a·,i|. on the denser rna-seq (k= ) graph, the reduction factor peaks at . , while on rna-seq (k= ) the reduction factor peaks at . . the columns are stored as sd compressed vectors. . construction time in figure , we compare the construction times for building rowdiff and mst [ ]. the construction time annotation columns c o n st ru ct io n ti m e, m in rainbow-mst (mst only) rowdiff-multibrwt figure : construction time for the rowdiff and mst annotation representations on the rna-seq (k= ) data set with threads. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for rowdiff-multibrwt includes the rowdiff transform from original columns to the rowdiff format (with m = ) in addition to the time for conversion of the transformed columns to the multi-brwt binary matrix representation. for mst, the time does not include construction of the mapping vector and includes only the time for compression of the unique annotation rows, which is a lower bound on the total construction time for the mst method. note that the construction time for rowdiff-multibrwt grows linearly in the number of columns of the annotation matrix, and superlinearly for mst. . query performance in this experiment, we measured the time needed for querying the rna-seq (k= ) annotation for human transcripts. the query is performed with the algorithm optimized for long paths (see section . ). first, we construct a list of annotation rows that have to be reconstructed from the rowdiff format and a list of all diff rows for querying in the rowdiff matrix. then, all these rows are queried and the original annotation rows are reconstructed. table shows the time taken for reconstruction of the original annotations for and random human transcripts, which includes the time for querying the diff rows and reconstruction of the original annotations. since rowdiff requires traversing the de bruijn graph to get rowdiff paths, the query time for rowdiff depends on the traversal performance of the underlying representation of the de bruijn graph. in this experiment, we used the succinct de bruijn graph representation available in metagraph [ ]. table : time for querying and random human transcripts with rowdiff-rowsparse and rowdiff- multibrwt. the second column shows the total number of original annotation rows reconstructed for the query. all benchmarks were performed with a single thread on intel(r) xeon(r) gold cpu @ . ghz. query data query time # rows rowdiff rowdiff reconstructed rowsparse multibrwt transcripts , . sec sec transcripts , sec sec conclusions in this paper, we introduced rowdiff, a new technique for compacting graph labels by leveraging the likely similarities in annotations of nodes adjacent in the graph. we designed a parallel construction algorithm with linear time complexity in the number of node-label pairs and small memory footprint. in addition, the algorithm can efficiently be distributed and parallelized, making it applicable on arbitrarily large graphs. rowdiff reduced the size of graph annotations by - to -fold when used in combination with multi-brwt relative to mantis-mst, the most efficient state-of-the-art representation. although the row reconstruction method inevitably leads to an increase in ad hoc row query time due to the larger number of required annotation matrix queries, this limitation is alleviated in practice due to the tendency of real-world sequences to feature k-mers which co-occur on matching rowdiff paths. the optimization of anchor assignment is a clear direction for future development of these methods. the anchor assignment method we have presented is designed to reduce the row reconstruction time by setting an upper bound on the traversal length. however, given that there is a trade-off between the size and the query time of the final representation, designing an objective function and a corresponding algorithm to best optimize these measures is a non-trivial task. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / moving beyond the representation of binary relations, a simple extension of the rowdiff method can be used as an efficient way to represent genomic coordinates for indexes of reference genomes. by representing a coordinate at each anchor node, the coordinates of all other nodes in that anchor’s corresponding rowdiff path can be computed via their traversal distance to the anchor. each improvement in the compression of sequence graphs and their associated annotations opens up further opportunities for their real-world applicability. when handling large annotations, even a -fold difference in the representation size can make a previously unapproachable annotation accessible to the available hardware. with rowdiff, we have demonstrated that there still is great potential for improving the representation of annotations on sequence graphs. acknowledgements mikhail karasikov and harun mustafa are funded by the swiss national science foundation grant no. “scalable genome graph data structures for metagenomics and genome annotation” as part of swiss national research programme (nrp) “big data”. a. k. and d. d. are funded from eth core funding to gunnar rätsch. references [ ] stephens, z. d. et al. big data: astronomical or genomical? plos biology , e ( ). [ ] cox, a. j., bauer, m. j., jakobi, t. & rosone, g. large-scale compression of genomic sequence databases with the burrows–wheeler transform. bioinformatics , – ( ). [ ] ondov, b. d. et al. mash: fast genome and metagenome distance estimation using minhash. genome biology , ( ). [ ] breitwieser, f., baker, d. & salzberg, s. l. krakenuniq: confident and fast metagenomics classifica- tion using unique k-mer counts. genome biology , ( ). [ ] bradley, p., den bakker, h. c., rocha, e. p., mcvean, g. & iqbal, z. ultrafast search of all deposited bacterial and viral genomic data. nature biotechnology , ( ). [ ] karasikov, m. et al. metagraph: indexing and analysing nucleotide archives at petabase-scale. biorxiv ( ). [ ] chikhi, r. & rizk, g. space-efficient and exact de bruijn graph representation based on a bloom filter. algorithms for molecular biology , ( ). [ ] benoit, g. et al. reference-free compression of high throughput sequencing data with a probabilistic de bruijn graph. bmc bioinformatics , ( ). [ ] bowe, a., onodera, t., sadakane, k. & shibuya, t. succinct de bruijn graphs. in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) ( ). [ ] iqbal, z., caccamo, m., turner, i., flicek, p. & mcvean, g. de novo assembly and genotyping of variants using colored de bruijn graphs. nature genetics , – ( ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / [ ] pandey, p. et al. mantis: a fast, small, and exact large-scale sequence-search index. cell systems ( ). url http://dx.doi.org/ . /j.cels. . . . [ ] muggli, m. d. et al. succinct colored de bruijn graphs. bioinformatics ( ). [ ] muggli, m. d., alipanahi, b. & boucher, c. building large updatable colored de bruijn graphs via merging. bioinformatics , i –i ( ). [ ] raman, r., raman, v. & satti, s. r. succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. acm transactions on algorithms (talg) , –es ( ). [ ] elias, p. efficient storage and retrieval by content and address of static files. journal of the acm (jacm) , – ( ). [ ] fano, r. m. on the number of bits required to implement an associative memory (massachusetts institute of technology, project mac, ). [ ] almodaresi, f., pandey, p. & patro, r. rainbowfish: a succinct colored de bruijn graph rep- resentation. in schwartz, r. & reinert, k. (eds.) th international workshop on algorithms in bioinformatics (wabi ), vol. of leibniz international proceedings in informatics (lipics), : – : (schloss dagstuhl–leibniz-zentrum fuer informatik, dagstuhl, germany, ). url http://drops.dagstuhl.de/opus/volltexte/ / . [ ] karasikov, m. et al. sparse binary relation representations for genome graph annotation. journal of computational biology , – ( ). [ ] bingmann, t., bradley, p., gauger, f. & iqbal, z. cobs: a compact bit-sliced signature index. in international symposium on string processing and information retrieval, – (springer, ). [ ] harris, r. s. & medvedev, p. improved representation of sequence bloom trees. bioinformatics , – ( ). [ ] marchet, c. et al. data structures based on k-mers for querying large collections of sequencing data sets. genome research , – ( ). [ ] mustafa, h. et al. dynamic compression schemes for graph coloring. bioinformatics , – ( ). [ ] almodaresi, f., pandey, p., ferdman, m., johnson, r. & patro, r. an efficient, scalable and exact rep- resentation of high-dimensional color information enabled via de bruijn graph search. in international conference on research in computational molecular biology, – (springer, ). [ ] gog, s., beller, t., moffat, a. & petri, m. from theory to practice: plug and play with succinct data structures. in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) ( ). [ ] o’leary, n. a. et al. reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. nucleic acids research ( ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / spacepharer: sensitive identification of phages from crispr spacers in prokaryotic hosts zhang r., mirdita m., levy karin e., norroy c., galiez c., , and söding j. , quantitative and computational biology, max planck institute for biophysical chemistry, göttingen, germany univ. grenoble alpes, cnrs, grenoble inp/institute of engineering univ. grenoble alpes, grenoble, france soeding@mpibpc.mpg.de summary: spacepharer (crispr spacer phage-host pair finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match crispr spacers in genomic or metagenomic data. spacepharer gains sensitivity by comparing spacers and phages at the protein level, optimizing its scores for matching very short sequences, and combining evidence from multiple matches, while controlling for false positives. we demonstrate spacepharer by searching a comprehensive spacer list against all complete phage genomes. availability and implementation: spacepharer is available as an open-source (gplv ), user-friendly command-line software for linux and macos: spacepharer.soedinglab.org. i. introduction viruses of bacteria and archaea (phages) are the most abundant biological entities in nature. however, little is known about their roles in the microbial ecosystem and how they interact with their hosts, as cultivating most phages and hosts in the lab is challenging. many prokaryotes ( % of bacteria and % of archaea) possess an adaptive immune system against phages, the clus- tered regularly interspaced short palindromic repeat (crispr) system [ ]. after surviving a phage infec- tion, they can incorporate a short dna fragment ( - nt) as a spacer in a crispr array. the transcribed spacer will be used with other cas components for a targeted destruction of future invaders. some crispr- cas systems require a - nucleotide long, highly con- served protospacer-adjacent motif (pam) flanking the viral target to prevent autoimmunity. multiple spac- ers targeting the same invader are not uncommon, due to either multiple infection events or the primed spacer acquisition mechanism identified in some crispr sub- types. crispr spacers have been previously exploited to identify phage-host relationship [ , , , ]. these methods compare individual crispr spacers with phage genomes using blastn [ ] and apply stringent filtering criteria, e.g. allowing only up to two mismatches. they are thus limited to identifying very close matches. how- ever, a higher sensitivity is crucial because phage refer- ence databases are very incomplete and often will not contain phages highly similar to those to be identified. to increase sensitivity, ( ) we compare protein coding se- quences because phage genomes are mostly coding, and, to evade the crispr immune response, are under pres- sure to mutate their genome with minimal changes on the amino acid level; ( ) we optimized a substitution matrix and gap penalties for short, highly similar protein frag- ments; ( ) we combine evidence from multiple spacers matching to the same phage genome. ii. methods input. spacepharer accepts spacer sequences as multiple fasta files each containing spacers from a sin- gle prokaryotic genome or as multiple output files from the crispr detection tools piler-cr [ ], crt [ ], minced [ ] or crisprdetect [ ]. phage genomes are supplied as separate fasta files or can be downloaded by spacepharer from ncbi genbank [ ]. optionally, additional taxonomic labels can be provided for spacers or phages to be included in the final report. algorithm. spacepharer is divided into five steps (figure a, supp. materials). ( ) preprocess in- put: scan the phage genome and crispr spacers in six reading frames, extract and translate all putative coding fragments of at least nt, with user-definable transla- tion tables. each query set q consists of the translated orfs q of crispr spacers extracted from one prokary- otic genome, and each target set t comprises the puta- tive protein sequences t from a single phage. we refer to similar q and t as hit, and an identified host-phage relationship q−t as match. ( ) search all q’s against all t’s using the fast, sensitive mmseqs protein search [ ], with vtml substitution matrix [ ], gap open cost of and extension cost of (figure s ). we op- timized a short, spaced k-mer pattern for the prefilter stage ( ) with six informative (‘ ’) positions. in addition, align all q−t hits reported in previous search on nucleotide level and prioritize near-perfect nucleotide hits (supp. materials). ( ) for each q−t pair, compute the p-value for the best hit pbh from first-order statis- tics. ( ) compute a combined score scomb from best-hit p-values of multiple hits between q and t using a modi- fied truncated-product method (supp. materials). ( ) compute the false discovery rate (fdr = fp /(tp + fp)) and only retain matches with fdr < . . for that purpose, spacepharer is run on a null model database and the fraction of null matches with scomb below a cut- off (empirical p-value) is used to estimate the fdr. ( ) scan nt upstream and downstream of the phage’s pro- tospacer for a possible pam. output is a tab-separated text file. each host-phage match spans two or more lines. the first starts with ‘#’: prokaryote accession, phage accession, scomb, number of hits in the match. each following line describes an indi- .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://spacepharer.soedinglab.org https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / crispr locus second best hit best hit pbh pbh pbh pam protospacer threshold p x search ... ... ( ) mmseqs search of six-frame translated orfs ( ) p-value of best hit per q-t ( ) compute combined score q t q t ( ) scan for possible pams ( ) select true matches by fdr hit list q q q # p re di ct io ns tp fp fn tn -log(p) s ort empirical p-values . extracted spacer sets (fasta, piler-cr, crt, minced, crisprdetect) phage genomes a % % % % , , , , , #true positives f a ls e d is co ve ry r a te blastn eukaryotic viral control blastn inverted phage control spacepharer eukaryotic viral orf control spacepharer inverted phage orf control b b l a s t n s p a ce p h a r e r species genus family order class phylum f re q u e n cy incorrect correct c fig. . (a) spacepharer algorithm. a query set q consists of -frame translated orfs (q) from crispr spacers, and a target set t consists of -frame translated orfs (t) of phage proteins. ( ) search all qs against all ts using mmseqs . align the q −t hits on nucleotide level and prioritize near-perfect nucleotide hits. ( ) for each q −t pair, compute the p-value for the best hit from first-order statistics. ( ) compute score scomb by combining the best-hit p-values from multiple hits between q and t using a modified truncated-product method. ( ) estimate the fdr by searching a null database. ( ) scan for possible protospacer adjacent motif (pam). (b) performance comparison between spacepharer (blue) and blastn (red) using inverted phage sequences (solid lines) or eukaryotic viral orfs as null set (dashed lines) demonstrated by expected number of true positive (tp) predictions at different false discovery rates (fdrs). (c) performance comparison between blastn (left), spacepharer using the weighted lowest common ancestor procedure (lca, right) at fdr = . , evaluated by the number of correct (blue) and incorrect (red) predictions, for all the host predictions made at each taxonomic rank or below. vidual hit: spacer accession, phage accession, pbh, spacer start and end, phage start and end, possible ’ pam| ’ pam, possible ’ pam| ’ pam on the reverse strand. if requested, the spacer–phage alignments are included. if taxonomic labels are provided, taxonomic reports based on the weighted lowest common ancestor (lca) procedure described in [ ] are created for host lcas of each phage genome or phage lcas of each spacer as ad- ditional tab-separated text files. iii. results datasets. we split a previously published spacer dataset [ ] of , unique spacers from , prokaryotic genomes randomly into an optimization set ( %, , genomes) and a test set ( %, , genomes). the performance of spacepharer was eval- uated on the spacer test set against a target database of , phage genomes. we used two null databases: , eukaryotic viral genomes and the inverted trans- lated sequences of the target database. viral genomes were downloaded from genbank in / . the performance of spacepharer in figure c was evaluated on a validation dataset of spacers from , bacterial genomes against phage genomes with anno- tated host taxonomy [ ]. for each phage, we predicted the host based on the host lca. prediction quality. at fdr = . , spacepharer predicted to times more prokaryote-phage matches than blastn (figure b, figure s ). spacepharer predicted the correct host for more phages than blastn at all taxonomic ranks, while including most of the blastn predictions, at better precision (figure c, figure s ). if the host or a close relative of a phage is absent in the database (either because the host is uniden- tified or the host lacks a crispr-cas system), the pre- dicted host may be correct only at a higher rank than species. run time. spacepharer took minutes to pro- cess the test dataset on × -core . ghz cpus, times faster than blastn ( minutes). iv. conclusion spacepharer is . to × more sensitive than blastn in detecting phage-host pairs, due to searching with protein sequences, optimizing short sequence com- parisons, and combining statistical evidence, and it is fast enough to analyze large-scale genomic and metagenomic datasets. funding elk is a febs long-term fellowship recipient. the work was supported by the erc’s horizon frame- work programme [‘virus-x’, project no. ] and the bmbf complifesci project horizontal meta. conflict of interest: none declared .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references [ ] altschul, s.f. et al ( ). basic local alignment search tool. j. mol. biol., ( ), – . [ ] benson, d.a. et al ( ). genbank. nucleic acids res., (d ), d –d . [ ] biswas, a. et al ( ). crisprtarget: bioinformatic prediction and analysis of crrna targets. rna biol., ( ), – . [ ] biswas, a. et al ( ). crisprdetect: a flexible algorithm to define crispr arrays. bmc genom., ( ), . [ ] bland, c. et al ( ). crispr recognition tool (crt): a tool for automatic detection of clustered regularly interspaced palindromic repeats. bmc bioinform., ( ), . [ ] burstein, d. et al ( ). major bacterial lineages are essentially devoid of crispr-cas viral defence systems. nature communica- tions, ( ), . [ ] edgar, r.c. ( ). piler-cr: fast and accurate identification of crispr repeats. bmc bioinform., ( ), . [ ] edwards, r.a. et al ( ). computational approaches to predict bacteriophage–host relationships. fems microbiol. rev., ( ), – . [ ] mirdita, m. et al ( ). fast and sensitive taxonomic assignment to metagenomic contigs. biorxiv. doi: . / . . . . [ ] müller, t. et al ( ). estimating amino acid substitution mod- els: a comparison of dayhoff’s estimator, the resolvent approach and a maximum likelihood method. mol. biol. evol., ( ), – . [ ] paez-espino, d. et al ( ). uncovering earth’s virome. nature, ( ), – . [ ] shmakov, s.a. et al ( ). the crispr spacer space is domi- nated by sequences from species-specific mobilomes. mbio, ( ), e – . [ ] skennerton, c. ( ). minced - mining crisprs in environmen- tal datasets. https://github.com/ctskennerton/minced. [ ] steinegger, m. and söding, j. ( ). mmseqs enables sensitive protein sequence searching for the analysis of massive data sets. nat. biotechnol., ( ), – . [ ] stern, a. et al ( ). crispr targeting reveals a reservoir of common phages associated with the human gut microbiome. genome res., ( ), – . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/ctskennerton/minced https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / sequence-specific minimizers via polar sets sequence-specific minimizers via polar sets hongyu zheng , carl kingsford , and guillaume marçais∗ computational biology department, carnegie mellon university, pittsburgh, usa february , abstract minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. in contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. we propose the concept of polar sets, complementary to the existing idea of universal hitting sets. polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. this allows for direct optimization of sketch size. we propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. a reference implementation and code for analyses under an open-source license are at https://github.com/kingsford- group/polarset. introduction the minimizer (roberts et al., a,b) methods, also known as winnowing (schleimer et al., ), are methods to sample positions or k-mers (substrings of length k) from a long string. thanks to its versatility, this method is used in many bioinformatics programs to reduce memory requirements and computational resources. read mappers (li and birol, ; jain et al., b,a), k-mer counters (erbert et al., ; deorowicz et al., ), genome assemblers (ye et al., ; chikhi et al., ) and many more (see marçais et al. ( ) for a review) use minimizers. in most cases, sampling the smallest number of positions, as long as a string is roughly uniformly sampled, is desirable as it leads to sparser data structures or less computation as fewer k-mers need to be processed. minimizers have such a guarantee of approximate uniform sampling: given the parameters w and k, it guarantees to select at least one k-mer in every window of w consecutive k-mers. it achieves this goal by selecting the smallest k-mer (the “minimizer”) in every w-long window, where smallest is defined by a choice of an order o on the k-mers. even though every minimizer scheme satisfies the constraint above, depending on the choice of the order o the total number of selected k-mers may vary significantly. consequently, research on minimizers has focused on finding orders o that obtain the lowest possible density, where the density is defined as the number of selected k-mers over the length of the sequence. in particular, most research concentrates on the average case: what is the lowest expected density given a long random input sequence? (marçais et al., , ; ekim et al., ; orenstein et al., ). in practice, many tools use a “random minimizer” where the order is defined by choosing at random a permutation of ∗to whom correspondence should be addressed. gmarcais@cs.cmu.edu .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/kingsford-group/polarset https://github.com/kingsford-group/polarset https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / all the k-mers (e.g., by using a hash function on the k-mers). this choice has the advantage of being simple to implement and providing good performance on the average case. here we investigate a different setup that is common in bioinformatics applications. instead of the average density over a random input we try to optimize the density for one particular string or sequence. when applying minimizers in computational genomics, in many scenarios the sequence is known well in advance and it does not change very often. for example, a read aligner may align reads repeatedly against the same reference genome (e.g., the human reference genome). in such cases, optimizing the density on this specific sequence is more meaningful than on a random sequence. moreover, the human genome has markedly different properties than a random sequence and optimization for the average case may not carry over to this specific sequence. in the read aligner example, a minimizer with lower density leads to a smaller index to save on disk and fewer seeds to consider in the seed-and-extend alignment algorithm while preserving the same sensitivity thanks to the approximate uniform sampling property. the idea of constructing sequence sketches tailored to a specific sequence has been explored before (chikhi et al., ; deblasio et al., ; jain et al., b), but it remains less understood than the average case. random sequences have nice properties that allow for simplified probabilistic analysis. consequently, different analytic tools are needed to analyze sequence-specific minimizers. in fact, minimizers designed to have low density in the average case often offer only modest improvements on sequences of interest such as reference genomes (zheng et al., a). the current theory for minimizers with low density in average is tightly linked to the theory of universal hitting sets (uhs) (orenstein et al., ; marçais et al., ; kempa and kociumaka, ). as the name suggests, a uhs is a set of k-mers that “hits” every w-long window of every possible sequence (hence the universality; it is an unavoidable set of k-mers). universal hitting sets of small size generate minimizers with a provable upper-bound on their density. universal hitting sets are less useful in the sequence-specific case as the requirement to hit every window of every sequence is too strong, and uhss are too large to provide a meaningful upper-bound on the density in the sequence-specific case. new theoretical tools are needed to analyze the sequence-specific case. frequency-based orders are examples of sequence-specific minimizers (chikhi et al., ; jain et al., b). in these constructions, k-mers that occur less frequently in the sequence compare less than k- mers that occur more frequently. the intuition is to select rare k-mers as they should be spread apart in the sequence, hence giving a sparse sampling. this intuition is only partially correct. first, there is no theoretical guarantee that a frequency-based order gives low density minimizers, and there are many theoretical counter- examples. second, in practice, frequency-based orders often give minimizers with lower density, but not always. for example, winnowmap (jain et al., b) uses a two-tier classification (very frequent vs. less frequent k-mers) as it performs better than an order strictly following frequency of occurrence. another approach to sequence-specific minimizers is to start from a uhs u and to remove as many k-mers from u as long as it still hits every w-long window of the sequence of interest (deblasio et al., ). because this procedure starts with a uhs that is not related to the sequence, the amount of possible improvement in density is limited. additionally, given the exponential growth in size of the uhs with k, current methods are computationally limited to k ≤ , which is limiting in many applications. the construction proposed here takes a different approach and introduces polar sets. the polar sets concept can be seen as complementary to the universal hitting sets: while a uhs is a set of k-mers that intersects with every w-long window at least once, a polar set is a set of k-mers that intersect with any window at most once. the name “polar set” is an analogy to a set of polar opposite magnets that cannot be too close to one another. that is, our construction builds upon sets of k-mers that are sparse in the sequence of interest, and consequently the minimizers derived from these polar sets have provably tight bounds on their density. our main contribution is theorem that gives an upper bound and a lower bound on the density obtained by a minimizer created from a polar set. these bounds are expressed in term of the “total link energy” of the polar set on the given sequence. the link energy is a new concept that measures how well spread apart the elements of the polar sets are in the sequence: the higher the energy, the more spread apart the k-mers are. then we show that the link energy is almost exactly the improvement in density one gains from using a minimizer created from the polar set compared to a random minimizer. in the following sections we also show that the problem of finding a polar set with maximum total link energy is, unsurprisingly, np-hard, and we describe a heuristic to create polar sets with high total link energy. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / finally, we show that our implementation of this heuristic generates minimizers that have specific density on the human reference genome much lower than any other previous methods, and, for some parameter choices, relatively close to the theoretical minimum. methods . overview we set the stage by defining important terms and concepts, then giving an overview of the main results, which are then proved formally in the following sections. the sequence s is a string on the alphabet Σ of size σ = |Σ|. the parameters k and w define respectively the length of the k-mers and the window size. we assume that s is relatively long compared to these parameters: |s|� w + k. definition (minimizer and windows). a minimizer is characterized by (w,k,o) where o is a complete order of Σk. a window is a sequence of length (w + k − ) consisting of exactly w k-mers. given a window as input, the minimizer outputs the location of the smallest k-mer according to o, breaking ties by preferring the leftmost k-mer. the minimizer (w,k,o) is applied to the sequence s by finding the position of the smallest k-mer in every window of s. because two consecutive windows in s have a large overlap, the same k-mer is often selected in these two windows, hence the minimizer returns a sampling of positions in the sequence s. the specific density of the minimizer on s is defined as the number of selected positions over the length |s|. the density is between /w, because at least one k-mer in every window must be picked, and , because it is a sampling of the positions of s. therefore the goal is to find orders o that have a density as close to /w as possible. a minimizer with density /w is a perfect minimizer. for simplicity, when stating the density of a minimizer we ignore any additive term that is o( /w) (i.e., asymptotically negligible compared to /w). a random minimizer is defined by choosing at random one of the permutations of all k-mers. the expected density of a random minimizer is /(w + ) (schleimer et al., ; roberts et al., b; zheng et al., a). equivalently, the expected distance between adjacent selected k-mers is (w + )/ . the random minimizers will serve as a baseline to compare to. defining orders. for practical reasons, we define orders by defining a set u and considering orders that are compatible with u: an order o is compatible with u if for o every element of u compares less than any element not in u. that is, only the smallest elements for o are specified (the elements of u) and a minimizer using an order compatible with u will preferentially select the elements of u. there exist many orders that are compatible with u as the relative order between the elements within u is not specified. universal hitting sets. a set u is a universal hitting if for every one of σw+k− possible windows (recall σ is the size of the alphabet), it contains a k-mer from u. in the average case, minimizers compatible with u have densities upper bounded by |u|/σk, because only k-mers from the universal hitting set can be selected. supplementary section s provides a more detailed discussion of why this bound provided by universal hitting sets does not always apply for sequence-specific minimizer analysis, and why universal hitting sets do not specialize well. short sequences. on a short random sequence (in a sense made precise by lemma ) most k-mers are unique (i.e., they occur only once in the sequence s). therefore, it is likely that there is a set u of unique k-mers of s that are exactly w bases apart in s, and a minimizer compatible with u is perfect. unfortunately most sequences of interest (e.g., reference genomes) are too long, too repetitive and in general do not satisfy the hypothesis of lemma . for most sequences it is not possible to find a set of “perfect seeds” of k-mers spaced exactly w apart. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / polar sets. an polar set is a relaxed version of a perfect set: any pair of k-mers m and m from a polar set a are always more than w/ bases apart in s (see the more general definition ). the intuition behind this definition is that for a minimizer compatible with a, any k-mer from a selected by the minimizer is at distance ≥ (w + )/ from the previous and the next selected k-mer. hence, k-mers selected from a are at least as sparse, and usually more sparse than k-mers selected using a random minimizer in expectation. section . gives a formal definition of the link energy of a polar set and theorem gives upper and lower bounds using this link energy for the density of a minimizer compatible with a polar set. this theorem shows that the link energy of the polar set a is a measure of how much reduction in density is obtained by using a minimizer compatible with a rather than a random minimizer. hence, designing a polar set with high link energy is a method to find minimizers with provably low density. section . introduces layered polar sets, which are an extension to polar sets, and builds a heuristic method to create such sets. . polar sets and link energy . . key definitions k-mers in uhs polar k-mers (slackness s= ) all windows contain at least one uhs k-mer some windows have multiple all windows contain at most one polar k-mer some windows have none perfect seeds all windows contain exactly one seed k-mer figure : comparing universal hitting sets, perfect seeds (compatible minimizers become perfect minimizers) and polar sets. each block indicates a k-mer, and each segment indicates a window of length (w = ). to provide a better contrast with universal hitting sets, we show polar sets with slackness s = (see definition ). we now define polar sets, the key component for our proposed methods. definition (polar set). given sequence s and parameters (w,k,s) with ≤ s < / , a polar set a of slackness s is a set of k-mers such that every two k-mers in a appears at least ( −s)w bases apart in s. this can be viewed as a complementary idea to the universal hitting sets or a relaxed form of perfect sets. as discussed in the introduction, a universal hitting set requires the set to hit every w consecutive k-mers at least once, while a polar set with s = requires the set to hit every w consecutive k-mers at most once. a set of perfect seeds, if it exists, is both a polar set with zero slackness and a universal hitting set. see figure for a more concrete example. the condition s < / is critical for our analysis. specifically, this condition is required to obtain a lower bound on the specific density of compatible minimizers, not just an upper bound. definition (link energy). given sequence s, parameters (w,k) and a polar set a, if two k-mers on s are l ≤ w bases apart and are both in a, the link energy of the pair is defined as l/(w + ) − ≥ . the total link energy of a is the sum of link energy across all eligible pairs. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / any two k-mers from a in s must be more than w/ bases apart, so two k-mers cannot form a link if there is a third k-mer from a between them. with s = , the link energy is fixed to be w/(w + ) − = − /(w + ) ≈ for each eligible pair, and the total link energy is approximately the number of pairs that form a link, which in turn is the number of k-mer pairs in the polar set that are exactly w bases away on s. in the following sections, we introduce and discuss the backbone of the polar set framework, which revolves around closer inspection of how a random minimizer works on a specific sequence, and drawing contrast between sequence-specific minimizers and non-sequence-specific minimizers. we use the term “non- sequence-specific minimizers” to refer to constructions of minimizer that does not specifically target a certain sequence, but rather aim to minimize density, the expected specific density on a random string. . . perfect minimizer for short sequences a perfect minimizer is a minimizer that achieves density of exactly /w. while the only known examples of perfect minimizers are in the asymptotic case where w � k (marçais et al., ), perfect sequence-specific minimizers exist with high probability for short sequences. lemma . if |s| < ( − �) √ wσk/ , with at least � probability a random sequence of length |s| has a perfect minimizer. proof. the optimal minimizer is constructed with fixed interval sampling. more specifically, we take every w k-mer in s and denote the resulting k-mer set u, then construct a minimizer compatible with u. the resulting minimizer is perfect if and only if the k-mers in u only appear in the selected locations. there are |s|/w selected locations and ( − /w)|s| locations not selected, and for each pair of selected and not selected locations, the k-mer at these two locations are identical with probability σ−k (see supplementary section s ). by union bound, the probability that the sequence violates the polar set condition is at most |s| σ−k/w < ( −�) , and the sequence has a perfect minimizer with probability at least −( −�) > �. . . context energy and energy savers contexts provide an alternative way to measure the density of a minimizer (zheng et al., a). these play a central role on the analysis of polar sets. definition (charged contexts). a context of s is a substring of length (w + k), or equivalently (w + ) consecutive k-mers, or equivalently consecutive windows. a context is charged if the minimizer selects a different k-mer in the first window than in the second window. see top left of figure for examples of charged contexts. intuitively, a charged context corresponds to the event that a new k-mer is picked, and counting picked k-mers is equivalent to counting charged contexts. lemma (specific density by charged contexts). for a given sequence s and a minimizer, the number of selected locations by the minimizer equals the number of charged contexts plus . given a context c, define e(c) as the probability that c will be charged with a random minimizer (one with a random ordering of k-mers), which we call the energy of c. lemma . the expected number of picked k-mers in s under a random minimizer is + e (s), where e (s) = ∑ c e(c) is called the initial energy of s and the summation is over every context of s. this is proved by combining the linearity of expectation and lemma . this implies that the total energy of a sequence is directly related to the specific density of random minimizers, which is number of picked locations in s divided by number of k-mers in s. e(c) admits a simple formula: lemma . e(c) = /u(c) if the last k-mer in the context is unique, /u(c) otherwise, where u(c) denotes the number of unique k-mers in c. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / proof. consider an imaginary minimizer with w′ = w+ and identical k. the context of a (w,k)−minimizer is a window of the imaginary minimizer, and it is charged if and only if the imaginary minimizer picks either the first or the last k-mer. if the imaginary minimizer does not pick either end, the two constituent windows of the context share the same minimal k-mer, and the context is not charged. with a random minimizer, the probability that the first k-mer is picked in the imaginary window is /u(c). the probability that the last k-mer is picked is /u(c) if the last k-mer is unique, otherwise, because the minimizer break ties by preferring leftmost k-mer. the two events are mutually exclusive, so e(c) is the sum of these two terms. if all k-mers in a context are unique, e(c) = /(w + ) is guaranteed, which we call the baseline. if this holds for all windows, a random minimizer will have specific density of /(w + ), similar to applying a random minimizer to a random sequence. as lower u(c) only increases e(c), e(c) < /(w + ) only if the last k-mer in c is not unique and there are over (w + )/ unique k-mers in the context. definition . a context c is called an energy saver if e(c) < /(w + ), and its energy deficit is defined as /(w + )−e(c). the energy deficit of s, denoted d(s), is the total energy deficit across all energy savers: d(s) = ∑ c max( , /(w + ) −e(c)). in general, the value of d(s) is very small due to the fact that energy saver contexts (those with e(c) < /(w + )) are rare. lemma . for a random context, the probability that it is an energy saver is at most wσ−k. proof. we bound the probability that the last k-mer in a context is not unique. the probability that the last k-mer equals a specific k-mer in another location is σ−k (see supplementary section s ). applying union bound over w other k-mers (as each context has (w + ) k-mers) we get the desired result. there are examples of sequences where energy saver contexts are abundant. an extreme scenario is when the sequence s is has a period of w, and has w distinct k-mers. in this case, all contexts become energy saver contexts. these scenarios are rare in practice. similarly, we can define energy spenders and energy surplus as follows: definition . a context c is called an energy spender if e(c) > /(w + ), and its surplus is defined as e(c) − /(w + ). the energy surplus of s, denoted x(s), is the total energy surplus across all energy spenders: x(s) = ∑ c max( ,e(c) − /(w + )). contexts with energy surpluses are more common than energy savers, but still fairly rare in a random sequence with suitable choice of w and k: lemma . for a random context, the probability that it is an energy spender is at most w(w + )σ−k/ . proof. a context becomes an energy spender if the last k-mer is unique, and some k-mers appears twice. we bound the probability that some k-mers in the context appear twice. following previous arguments, any two k-mers in a given context are identical to each other with probability σ−k, and we apply a union bound of size w(w + )/ (enumerating over pairs of k-mers) to obtain the desired result. . . density bounds with polar sets with the proper tools, we now state the main theorem of the polar sets. theorem . given a sequence s and a polar set a on s, let e (s) be the initial energy of s, d(s) be the total energy deficit, x(s) be the total energy surplus, and l(s,a) be the total link energy from the polar set. the number of selected k-mers over s for a random minimizer compatible with a is at most + e (s) + d(s) −l(s,a), and at least + e (s) −x(s) −l(s,a). proof. we first prove the upper bound part. we start by elevating the energy of every energy saver context to the baseline /(w + ). by definition, this increases the total energy of s by d(s), so number of selected k-mers is now upper bounded by + e (s) + d(s). formally, ∑ e(x) ≤ + e (s) + d(s). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / a context has two windows. always charged never charged contexts without polar k-mers (singleton: contexts charged, covered, l= ) charged if different k-mer selected in these windows. not charged otherwise. minimizer selection a b c could be charged (linked polar k-mers: + = contexts charged, covered, l= / ) contexts with polar k-mers as random minimizer out of is charged. ex. a d(s). for a ballpark estimate, we assume s is a random sequence, and assume the slackness parameter s = in construction of the polar set. in this setup, each link has exactly − /(w + ) ≈ energy. as seen in lemma , a context is an energy saver with probability wσ−k, and its deficit is at most /(w + ) − /w ≈ /w, meaning d(s) ≈ σ−k|s|. this further means we need the number of links to be at least σ−k|s| to provably beat a random minimizer. on the other hand, ignoring the effect of d(s), in order to beat the specific density of a random minimizer by �/(w + ), total link energy of �|s|/(w + ) is needed. assuming no slackness, this means the number of links need to be at least �|s|/(w − ). intuitively, � portion of the sequence needs to be covered by links between close enough k-mers in polar set. a proper polar set requires s > / for the main theorem to hold. when s ≤ / , only the upper bound part of the theorem holds with an alternative definition of link energy. we will discuss the alternative definition in section . . , and further discuss generalization of polar sets in supplementary section s . . . . hardness of optimizing polar sets the link energy formulation of polar sets allows us to cast the problem in graph theoretical framework. consider an undirected, weighted graph where every unique k-mer is a vertex. an edge connects two k-mers with the following: if these two k-mers ever appear within fewer than ( −s)w bases of each other in s, the weight is −∞. otherwise, the weight of this edge is the total link energy by selecting only these two k-mers, which might establish several links given each k-mer may appear in s multiple times. there can also be self-loops with weights, given a k-mer may appear close to itself on the reference sequence. the problem of finding optimal polar sets becomes the problem of finding an induced subgraph with maximum weight. the general maximum induced subgraph problem is well known to be np-hard via reduction from max- clique. in supplementary section s , we provide an explicit proof that shows optimization of polar sets, even with an alphabet of three, is np-hard. . constructing polar sets in this section, we propose a practical extension to polar sets, and formally introduce our heuristics. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / final coverage (not-covered k-mers) layer layer layer selected and not-covered locations covered locations figure : examples of layered polar sets, with three layers. without layered polar sets, the k-mers from layer and could not be selected as in the polar set because of self-collision. the whole sequence is covered in this case (every window contains a polar k-mer from one layer). layer is the one with highest priority and our layered heuristics construct it first. . . layered polar sets assume we have already constructed a polar set a that covers some segments of the reference sequence. here, covered means that every window contains a k-mer from the polar set, or equivalently, a acts as a universal hitting set on these segments. now, to cover the rest of the reference, we shall extend a so more k-mers become polar k-mers. it is natural to consider generating a polar set over the uncovered portion of the reference sequence, then merge this set with a. this however leads to problems. let a′ be a polar set over the uncovered portion of the reference sequence. a∪a′ might not always be a valid polar set, because a k-mer m′ ∈ a′ may appear in the already-covered part of the reference sequence, and appear close to another k-mer m ∈ a, thus violating the polar set condition for a∪a′. on the other hand, the reason we set up the constraint for polar sets is to ensure that k-mers in the polar set will always be selected by any compatible minimizer. in other words, we want to ensure we know exactly the set of k-mers that will be selected. the issue was that m′ ∈ a′ might not always be selected by a compatible minimizer. however, from the perspective of constructing efficient minimizers, we do not need m′ to be selected everywhere, as in some places the reference sequence is already covered with k-mers in a. by forcing m < m′ for any m ∈ a, we ensure that m′ will only be selected outside the segments covered by a. applying this argument to all k-mers in a′, we can essentially ignore the sequence segments already covered by a when constructing a′, as long as the ordering is satisfied. this gives a way to progressively construct the layers of polar sets: at each layer we only need to consider regions of the reference sequence that are not yet covered by previous layers. formally: definition . a layered polar set is a list of sets of k-mers {ai}, for ≤ i ≤ m. with slackness s, the layered polar set condition is satisfied if for any k-mer in aj, for each of its appearance at location t in the reference sequence, either of the following holds: • it is at least ( −s)w bases apart from any k-mer in {a ,a , · · · ,aj}. • it is covered: there are two k-mers in {a ,a , · · · ,aj− } (importantly does not include aj), appearing at location l and h, satisfying l < t < h and h− l ≤ w. similarly, a compatible order for {ai} is an order that places all k-mers from a first in arbitrary order, then those in a , ..., then those in am and finally those not in any of {ai} in a random order. the link energy l({ai},s) is similarly defined over the pairs of close k-mer appearances that are not covered. more formally: .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / definition . for a layered polar set, if two k-mers in the layered polar sets, not necessarily from the same layer, appear l ≤ w bases apart in s, and neither are covered, the link energy between them is l/(w+ )− > . l({ai},s) is the total link energy across all pairs. these definitions of layered polar sets and link energy have two important properties. first the link energy is non-decreasing as more layers are added to the set. and, second, an almost identical argument proves the same bounds for layered polar sets as for polar sets in theorem . see figure for a concrete example of layered polar sets. . . polar set heuristic we consider a simple heuristic to generate a polar set. the core idea is to select as many k-mers as possible from the set of k-mers that appear exactly w bases away from each other. we cannot select all of them as it may violate the polar set condition due to some k-mers appearing multiple times. because reference sequences are long strings (in the range of billions of bases for mammalian genomes), we consider algorithms that scale well with the length of the reference sequence, preferably close to linear. fix an offset o ∈ [ ,w − ], we start by listing all locations t such that t = o mod w in the reference sequence s. we then randomly shuffle the locations, and for each location t in this random order, add the k-mer at location t to the polar set. when we add a k-mer m to the polar set, we also locate and remove all k-mers in the polar set that appear fewer than ( −s)w bases away from m. additionally, if a k-mer appears multiple times in the list, it is considered only once at the first encounter. this is to prioritize k-mers that appear less often; frequent k-mers are expected to be processed early given their multiple occurrences, and are more likely to be absent in the final polar set as they have more chances to be removed due to conflicts. jain et al. ( b) has explored a similar idea in building tiered random minimizers using a biased hash function. our algorithm also has a variant, which we call “monotonic”. in this variant, we require that adding a new k-mer m and removing the k-mers conflicting with m actually increases the link energy. otherwise, the k-mer is skipped and no conflicting k-mers are removed. this variant is slower but results in more efficient polar sets. we filter k-mers before they are considered for addition to polar sets. k-mers that collides with itself (appears fewer than ( − s)w bases away from its own copy) cannot be in the polar set. we also filter out k-mers by their frequency in the reference sequence (see section . . for the threshold value). algorithm pseudocode for polar set heuristics function polarset(s,w,k) start with an empty set a ←{} and a random offset o shuffle list of locations t = o mod w for ≤ t < |s| for each t in the list and the k-mer mt at location t do skip if mt is filtered, or has been processed previously obtain list l of occurrences of mt via suffix array obtain list of conflicting k-mers via linked blocks remove all conflicting k-mers and add mt to the polar set a end for return a end function algorithm shows the pseudocode for the non-monotonic variant of the heuristic. the monotonic variant is similar. we describe the data structures in section . . , and analyze the time complexity in section . . . . . layered heuristics and hyperparameters we construct layered polar sets with a similar algorithm. the properties of layered polar sets guarantee that new layers cannot decrease the final link energy of the polar set. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / we rerun the polar set heuristic multiple times, each time with a new random offset o. each round is run with the current layers of polar sets, and the resulting polar set is added as a new layer. the algorithm for each layer is mostly identical to the single-layer version, with a few changes. • when processing a k-mer, we skip all of its occurrences that are covered by existing layers of the polar set. • we skip k-mers at non-covered locations t that is fewer than ( − s)w bases away from a k-mer in a previous layer. these k-mers cannot be in the layer without violating the layered polar set condition. • at the end of each round, we remove all k-mers selected in the current layer that do not form a link with any k-mers. we also gradually increase the threshold of k-mer frequency at each round to prioritize low frequency k-mers. in our experiments, we use a total of rounds, with last two rounds being monotonic. the frequency threshold is set at the value to include % of locations of the reference in the first round, gradually increasing to % in the last round. the slackness s is also a tunable parameter, which determines when a pair of k-mers is considered in collision. lower value of s ensures the distance between adjacent polar k-mers are large and have higher link energy for every pair of linked k-mers, but results in smaller number of k-mers selected, implying fewer links. higher value of s means larger polar sets covering more of the reference sequence and more links formed, but adjacent polar k-mers may be closer to each other resulting in lower energy per link. in our experiments, we use a fixed slackness s = . after parameter search. this results in approximately % less efficient links (average link energy compared to theoretical maximum), but higher total link energy due to inclusion of more links. a more thorough parameter tuning might suggest a gradually increasing value of s between rounds. . . supporting data structures our heuristics require some data structures to operate efficiently both in theory and in practice. suffix array. in order to quickly index k-mers and obtain the list of occurrences of a k-mer, we precompute the suffix array, the inverse suffix array and the heights table (also known as the lcp array) of the reference sequence. all can be computed in linear time. this allows us to find the list of t locations that share the same k-mer as location t, in o(t) time. linked blocks. the layered polar set property ensures that in any stretch of w/ bases, at most one k-mer at one location is selected into any layer of the layered polar sets, excluding covered locations. we use a data structure called linked blocks to represent the set of these selected locations of k-mers. let h = bw/ c, we divide the locations in the reference sequence into h-long blocks, and use an array of length |s|/h to represent these blocks. each value in the array c[b] is either − , meaning there are no selected location within this block spanning location [bh, (b + )h), or a nonnegative integer j, indicating that the k-mer at location bh + j is selected. with linked blocks we can do the following operation quickly: definition . peekl(x) returns the closest selected location to the left of x, up to w bases. this is because we only need to query up to three blocks. adding a location and removing a location also only involves a single block. similarly we can define peekr(x). with this data structure, we can implement many critical operations in the aforementioned heuristics. the step of filtering k-mers, more specifically determining whether a k-mer collides with itself, is done using this data structure, in similar fashion to bucket sorting. by maintaining two linked blocks, one for the current layer and one for all previous layers, we can determine whether a location is covered by the previous layers, and list collisions on the current layer. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / calculating link energy. in the monotonic variant of our heuristics, we need to calculate the total link energy before and after adding a k-mer. in our implementation, we update the link energy of the polar set as we add and remove locations to the linked blocks, using the following alternative formula for link energy: l({ai},s) = acov/(w + ) −aele −aseg. here, acov is the number of contexts that contain a k-mer from the polar set, aele is the number of non- covered location of selected k-mers, and aseg is the number of continuous segments of windows that contain a k-mer from the polar set. when adding and removing a location to the linked blocks, the changes to these three values are calculated using linked block primitives in constant time, so we can update the link energy in constant overhead. as a sanity check, we see that when adding an isolated k-mer, acov increases by (w + ) and the other two values increase by , resulting in a net link energy gain of zero, consistent with the original definition. we can also compute the link energy of the polar k-mers in bottom part of figure using this formula, where acov = ,aele = and aseg = , resulting in the total link energy of / . . . time complexity analysis we now analyze the time complexity of the layered polar sets heuristic, assuming no monotonic rounds for now. let n be the length of the reference sequence, and assume a constant-sized alphabet. we assume a word of constant size can hold an integer in [ ,n], and that accessing an element in an array of length n takes constant time. these conditions hold for genomes and -bit machines. this means the primitive operations on linked blocks take constant time, and operations involving the suffix array also take constant time. consider a worst case scenario: by iterating k-mers that appear exactly w bases away from each other, we iterated over all k-mers in the reference sequence. assume a k-mer m occurs t times in the reference sequence. in filtering phase, we first fetch the list of t locations in o(t) time using the suffix array, and we want to determine if there are two elements whose difference is less than ( − s)w. this can be done using the linked blocks in o(t) time. in the case of layered polar sets, we also want to determine if each of the locations is covered by previous layers, and if it is fewer than ( − s)w bases away from a location in a previous layer. as we use one linked block for all previous layers, this can be done in o(t) time. the filtering phase thus finishes in o(t) time. the main algorithm is split into three parts: detecting k-mers that are close to m in the reference sequence, removing those k-mers from the polar set, and adding m to the polar set. detecting and listing k-mers that are close to m takes o(t) time, as each location reports only four collisions at most, two to the left and two to the right. removing a k-mer that occurs t ′ times takes o(t ′) time, but since each k-mer is only added and removed once in one round, this amortizes to o(t) time. adding m to the polar set also takes o(t) time. the singleton detection step (removing k-mers forming no links) also takes o(t) time for checking if m is a singleton. as each k-mer is only visited once in the main algorithm, and in the worst case scenario every k-mer in s is visited, we conclude that the layered polar set heuristics runs in ∑ o(t) = o(n) time for each layer, and as a special case the (non-layered) polar set heuristics runs in o(n). the monotonic variant of the heuristic can in theory run in o(n ) time, but it is not significantly slower in practice. results all the experiments are run using the human reference genome hg . to facilitate the performance com- parison across a range of parameter values of w and k, we report the density factor (marçais et al., ) instead of the density. the density factor is the density multiplied by (w + ). regardless of the value of w, the random minimizer has an expected density factor of and a perfect minimizer has a density factor of ≈ . . energy deficit and energy surplus first, we calculate the average energy deficit x(s)/|s| and average energy surplus d(s)/|s|. the results are in figure a. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / value of k (w= ) . . . . . . . . . d e n si ty f a c to r value of k (w= ) . . . . . . . . . d e n si ty f a c to r zero energy surplus energy deficit a value of k (w= ) . . . . . d e n si ty f a c to r value of k (w= ) . . . . . . d e n si ty f a c to r random minimizers lower bound fixed interval sampling miniception layered polar sets b figure : left: energy surplus and deficit for short (w = ) and for long (w = ) windows, computed on the human reference sequence hg . the difference between the two lines is the difference between the upper and lower bound of theorem . it is very small and the bounds are very good estimates in practice. right: density factor for the proposed methods, for short and long windows, computed on hg . the bottom orange dashed line is the theoretical minimum density (perfect minimizers). the reference genome is more repetitive than a purely random sequence. however, empirically the energy surplus and deficit are still small, well below . measured in density factor, implying a relative error of at most % when estimating specific density with link energy. thus, when constructing efficient minimizers by (layered) polar sets, using link energy to estimate specific density is efficient and accurate. for reference, on a random sequence the average energy surplus and deficit are below − in absolute value, for the parameter range we are interested in. . evaluating polar set heuristics we next evaluate our proposed algorithms for layered polar sets. we implemented the algorithm with python . experiments are run in parallel and the longest ones finish within a day. the peak memory usage stands at gb, which happens at the start loading the precomputed suffix array using python pickle. we compare our results against some other candidates: • random minimizers. achieves density factor of in theory and in practice, as indicated in last section. • lower bound. this corresponds to the density factor for perfect minimizers. while our theory predicts existence of perfect minimizers matching the lower bound with large value of k, this rarely happens with practical parameter values. • fixed interval sampling. this method uses every w k-mers from s as the set u to define a compatible minimizer. • the miniception (zheng et al., a), a practical algorithm that provably achieves lower density in many scenarios. the hyperparameter k is set to max( ,k −w) for our experiments. we do not include existing algorithms for constructing compact universal hitting sets because these methods do not scale to values of k > . our heuristics work the best when k-mers do not appear too frequently, or roughly speaking, when σk > n where n is the length of the reference sequence. this choice of parameter is common in bioinformatics analysis. with the sequence at the size of human reference genome, our heuristics work well starting at k = . additionally, the miniception achieves comparable performance with leading uhs-based heuristics, so its performance also serves as a viable proxy. we consider two scenarios, first with short windows (w = ) and second with long windows (w = ). the results are shown in figure b. our experiments indicate that our simple heuristics yield efficient mini- mizers, greatly outperforming random minimizers and the miniception, while maintaining a consistent edge over fixed interval sampling methods, in both short windows and long windows settings. the improvement .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / is more pronounced when the windows are long. given our layered polar set heuristics consist of multiple rounds, in supplementary section s . we show the progression of density factors through rounds, demon- strating that the layered heuristics are particularly effective at low values of k. we next show that in building sequence-specific minimizers using layered anchor sets, we do not sacrifice their performance in the general case measured by (expected) density. in supplementary section s . , we sketch a random sequence using the sequence-specific minimizers we built for hg . as expected, the performance closely matches that of a random minimizer. discussion . limits and future of polar sets while the concept of polar sets is interesting and leads to improvements in state-of-the-art sequence-specific minimizer design, we should acknowledge its limitations. first, it cannot be used in designing non-sequence- specific minimizers when w > k. arguably, this means the method is more tailored for sequence-specific minimizers. see supplementary section s for proof and more discussion on non-sequence-specific polar sets. our experimental results show that the performance of minimizers based on polar sets greatly improves as k grows. when each k-mer appears many times in the reference sequence, it becomes hard to select many k-mers without violating the polar set condition. for comparison, in supplementary section s . we show the results when we apply the heuristics to human chromosome sequence only, which is about / as long as the whole human reference genome. improvements across the board for the heuristic algorithms and the fixed interval sampling methods are observed. the repetitiveness of human reference genome also means much more difficult optimization of specific density. in supplementary section s . , we show the results when we apply the heuristics to build sequence-specific minimizers on a random sequence that are as long as the chr sequence. it is significantly easier to reach the theoretical minimum specific density of /w in this setup compared to the previous one. with better computing power and more efficient algorithms, it is desirable to compute an optimal polar set. thanks to our link energy formulation, the problem of optimal polar set can be formed with integer linear programming (ilp), each k-mer being a binary variable. for moderately-sized reference sequences, an optimal polar set can be found. however, no such convenient formulation exists for layered polar sets, and it is an interesting question whether there is a tractable optimization problem for minimizers in general. . practicality of sketches-by-optimization the polar sets can be used wherever universal hitting sets are used, in most cases. given that our heuristics for layered polar sets only produce a small number of layers, implementation of a compatible minimizer with layered polar sets is not fundamentally different from that with a universal hitting set. the fixed interval sampling method is very similar to previously proposed methods (khiste and ilie, ; almutairy and torng, ; frith et al., ), where the sketch of a reference sequence is simply the set of k-mers appearing at locations divisible by w. polar sets might not be able to directly replace fixed interval sampling, however it can be readily expanded into a set of seeds that covers the whole reference sequence. these approaches are currently relatively underused, compared to more traditional approach of minimiz- ers like lexicographical, random or slight variants of either one. a significant reason for their unpopularity is the fact that using these methods requires looking up a table of k-mers, be it a set of polar k-mers or universal hitting k-mers, for every k-mer in the query sequence. in contrast, for a random minimizer imple- mented using a hash function, no lookup is required during the sequence sketch generation process. since these lookup tables are usually the result of sequence-specific optimization, we say these methods fall into the category of “sketches-by-optimization”. this contrast leads to interesting tradeoffs in efficiency. for example, using a polar-set-compatible minimizer generates a more compact sequence sketch, but might take more time at query compared to using a random minimizer, due to the time spent in loading and querying the set of polar k-mers. we believe better implementation of k-mer lookup tables and better optimization of sequence sketches, possibly in a joint manner, will popularize sketches-by-optimization. existing methods already take step towards this goal. jain et al. ( b) uses a compact lookup table to index frequent k-mers, and liu et al. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ( ) uses a bloom filter to perform approximate query over fixed interval samples. techniques like k-mer bloom filters (pellow et al., ) might also further help the performance. . alternative measurements of efficiency throughout this manuscript our goal has been the optimization of specific density. low density results in smaller sequence sketches, and for many applications this is desirable. however, depending on the way one uses the sequence sketch, alternative measurements of efficiency may be desirable (also see discussion in edgar ( )). for example, in k-mer counting, minimizers are used to place k-mers into buckets. in this case, the specific density is less relevant, and we are more concerned about the number of buckets, and the load balance between different buckets (marçais et al., ; nyström-persson et al., ). for read mapping, smaller sequence sketches have its own advantage, while some may prefer reducing the number of matches, or reducing the false positive seed matches in general. we believe many of these objectives are correlated with each other, and we are interested in both further exploring benefits of a small sequence sketch, and optimization techniques for alternative measurements of efficiency. conclusion inspired by deficiencies with current theory and practice around sequence-specific minimizers, we propose the concept of polar sets, a new approach to construct sequence-specific minimizers with the ability to directly optimize the specific density of the resulting sequence sketch. we also propose simple and efficient heuristics for constructing (layered) polar sets, and demonstrate via experiments on the human reference genome the superior performance of minimizers constructed by our proposed heuristics. while there are still concerns around the practical utility, we believe the polar set framework will be a valuable asset in design and analysis of efficient sequence sketches. funding this work has been supported in part by the gordon and betty moore foundation’s data-driven discovery initiative through grant gbmf to c.k., by the us national institutes of health (r gm ), and the us national science foundation (dbi- ). this work was partially funded by the shurl and kay curci foundation. this project is funded, in part, under a grant (# ) with the pennsylvania department of health. the department specifically disclaims responsibility for any analyses, interpretations or conclusions. conflict of interests: c.k. is a co-founder of ocean genomics, inc. g.m. is v.p. of software development at ocean genomics, inc. references almutairy, m. and torng, e. ( ). comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches. plos one , ( ), e . blackburn, s. r. ( ). non-overlapping codes. ieee transactions on information theory , ( ), – . chikhi, r., limasset, a., and medvedev, p. ( ). compacting de bruijn graphs from sequencing data quickly and in low memory. bioinformatics , ( ), i –i . deblasio, d., gbosibo, f., kingsford, c., and marçais, g. ( ). practical universal k-mer sets for minimizer schemes. in proceedings of the th acm international conference on bioinformatics, computational biology and health informatics , bcb ’ , pages – , new york, ny, usa. acm. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / deorowicz, s., kokot, m., grabowski, s., and debudaj-grabysz, a. ( ). kmc : fast and resource-frugal k-mer counting. bioinformatics , ( ), – . edgar, r. c. ( ). syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. biorxiv . ekim, b., berger, b., and orenstein, y. ( ). a randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. biorxiv: . . . . erbert, m., rechner, s., and müller-hannemann, m. ( ). gerbil: a fast and memory-efficient k-mer counter with gpu-support. algorithms for molecular biology , ( ), . frith, m. c., noé, l., and kucherov, g. ( ). minimally-overlapping words for sequence similarity search. biorxiv . jain, c., rhie, a., hansen, n., koren, s., and phillippy, a. m. ( a). a long read mapping method for highly repetitive reference sequences. biorxiv , page . . . . jain, c., rhie, a., zhang, h., chu, c., walenz, b. p., koren, s., and phillippy, a. m. ( b). weighted minimizer sampling improves long read mapping. bioinformatics , (supplement ), i –i . kempa, d. and kociumaka, t. ( ). string synchronizing sets: sublinear-time bwt construction and optimal lce data structure. in proceedings of the st annual acm sigact symposium on theory of computing , pages – . khiste, n. and ilie, l. ( ). e-mem: efficient computation of maximal exact matches for very large genomes. bioinformatics , ( ), – . levenshtein, v. i. ( ). maximum number of words in codes without overlaps. problemy peredachi informatsii , ( ), – . li, h. and birol, i. ( ). minimap : pairwise alignment for nucleotide sequences. bioinformatics , ( ), – . liu, y., zhang, l. y., and li, j. ( ). fast detection of maximal exact matches via fixed sampling of query k-mers and bloom filtering of index k-mers. bioinformatics , ( ), – . marçais, g., pellow, d., bork, d., orenstein, y., shamir, r., and kingsford, c. ( ). improving the performance of minimizers and winnowing schemes. bioinformatics , ( ), i –i . marçais, g., deblasio, d., and kingsford, c. ( ). asymptotically optimal minimizers schemes. bioin- formatics , ( ), i –i . marçais, g., solomon, b., patro, r., and kingsford, c. ( ). sketching and sublinear data structures in genomics. annual review of biomedical data science, ( ), – . mykkeltveit, j. ( ). a proof of golomb’s conjecture for the de bruijn graph. journal of combinatorial theory, series b , ( ), – . nyström-persson, j. t., keeble-gagnère, g., and zawad, n. ( ). compact and evenly distributed k-mer binning for genomic sequences. biorxiv . orenstein, y., pellow, d., marçais, g., shamir, r., and kingsford, c. ( ). compact universal k-mer hitting sets. in algorithms in bioinformatics , lecture notes in computer science, pages – . springer, cham. pellow, d., filippova, d., and kingsford, c. ( ). improving bloom filter performance on sequence data using k-mer bloom filters. journal of computational biology , ( ), – . roberts, m., hunt, b. r., yorke, j. a., bolanos, r. a., and delcher, a. l. ( a). a preprocessor for shotgun assembly of large genomes. journal of computational biology , ( ), – . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / roberts, m., hayes, w., hunt, b. r., mount, s. m., and yorke, j. a. ( b). reducing storage requirements for biological sequence comparison. bioinformatics , ( ), – . schleimer, s., wilkerson, d. s., and aiken, a. ( ). winnowing: local algorithms for document finger- printing. in proceedings of the acm sigmod international conference on management of data, sigmod ’ , pages – . acm. ye, c., ma, z. s., cannon, c. h., pop, m., and yu, d. w. ( ). exploiting sparseness in de novo genome assembly. bmc bioinformatics , , s . zheng, h., kingsford, c., and marçais, g. ( a). improved design and analysis of practical minimizers. bioinformatics , (supplement ), i –i . zheng, h., kingsford, c., and marçais, g. ( b). lower density selection schemes via small universal hitting sets with short remaining path length. arxiv preprint arxiv: . . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / supplementary materials s a technical lemma on k-mer repetition here we prove a technical lemma on repetitive occurrence of k-mers. similar versions of this can be found in (chikhi et al., ). recall σ is the size of the alphabet. lemma s . given a random sequence and a pair of locations i < j, the probability that the k-mer starting at i equals the k-mer starting at j is exactly σ−k. proof. if j − i ≥ k, the two k-mers do not share bases, so given they are both random k-mers independent of each other, the probability is σ−k. otherwise, the two k-mers intersect. we let d = j − i, and use mi to denote the k-mer starting at location i. we use s to denote the substring from the start of mi to the end of mj with length k + d (or equivalently, the union of mi and mj). if mi = mj, the p th character of mi is equal to the pth character of mj, meaning sp = sp+d for all ≤ p < k. this further means s is a repeating sequence of period d, so s is uniquely determined by its first d characters and there are σd possible configurations of s. the probability a random s satisfies mi = mj is then σ d/σk+d = σ−k. s universal hitting sets and related analyses universal hitting sets have been an important component in constructing practical minimizers. in this section, we provide a more formal and technical discussion on universal hitting sets. in section s . , we for- mally define uhs and discuss why existing heuristics to construct uhs are not adequate for sequence-specific minimizer. in section s . and section s . , we discuss the two existing methods to analyze compatible minimizers of uhses, and show that these approaches both have issues that make them unfit for our goal. in section s . we discuss how uhses can in fact be treated as special cases of polar sets, which may inspire new developments in this line of research. s . definitions and inelasticity of uhs definition s (universal hitting sets). let u be a set of k-mers. if u intersects with every w consecutive k-mers, it is a uhs over k-mers with path length w and relative size |u|/σk. a decycling set is a set of k-mers that intersect with any sufficiently long strings. any universal hitting sets must be a decycling set, so lower bound on the size of decycling sets applies to all universal hitting sets. lemma s (minimal decycling sets). any uhs over k-mers with finite path length has relative size Ω( /k). with a universal hitting set, it is guaranteed that any compatible minimizer will only select k-mers within the uhs on any sequence. currently, the most popular approach for constructing efficient minimizers is via construction of a compact universal hitting set, followed by constructing a compatible minimizer. these universal hitting sets are usually constructed by expanding from a minimal decycling set. as we have shown before (zheng et al., b), the mykkeltveit mds (mykkeltveit, ), the mds that is predominantly used as the starting point already covers all windows of length o(k ). empirically, with larger value of w only a few k-mers needs to be added to satisfy the universal hitting condition. as a result, uhses constructed for different references look like each other, and the compatible minimizers do not specialize well. a related concern about using uhses on specific sequences is on handling of repetitive k-mers. as we have discussed, repetitive k-mers are prevalent in human reference genome. any universal hitting set always contains homomers like aaa · · ·a as it is required to cover a sequence of all as. this argument also extends to other repetitive k-mers. such homomers, or repetitive k-mers, would then be preferred when using compatible minimizers for sequence sketching. this problem of prioritizing repetitive k-mers is also present in fixed interval sampling. meanwhile, existing literature (li and birol, ; jain et al., b) suggests it is in fact beneficial to not select these k-mers for read mapping, while proposing different remedies to this issue. our proposed methods also have the effect of avoiding repetitive k-mers, as these k-mers likely don’t pass the filtering step. s .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / s . analysis via density upper bound there are two existing ways to analyze the density of compatible minimizers. the first is via the following lemma, as we have mentioned in the main text: lemma s . if u is a uhs over k-mers, any compatible minimizer has density at most |u|/σk. this lemma is universally applicable and it does not depend on the ordering within u. however, this is an upper bound which becomes non-informative with w > k and sufficiently large k. because any universal hitting set is at least as large as a minimal decycling set (lemma s ), and a random (w,k)−minimizer achieves density of approximately /(w + ), lemma s at best tells us the compatible minimizer is no worse than a random one. s . analysis via probability of single uhs contexts there is a second approach to analysis of compatible minimizers from universal hitting sets (marçais et al., ). the key lemma reads as follows (slightly rephrased): lemma. if u is a uhs over k-mers, let sp(u) be the probability that a context contains only one element in u. under certain assumptions, the expected density of a random minimizer compatible with u is ( − sp(u))/(w + ). we now show this lemma depends on assumptions that highly depends on the structure of u. we start with some notations, slightly different from the original paper. fix a context, let mi denote the ith k-mer in the context. we also let zi = (mi ∈ u), let h denote the event that the context is charged, and let z = ∑w i= zi. let c(n,k) be the binomial coefficients. the proof involves the following equation (we only list the first term - there are four analogous terms): p(h | z = j) = c(w − ,j − ) c(w + ,j) p(h | z = j,z = ,zw = ) + · · · which involves a counting argument: given z = ∑w i= zi = j, there are c(w + ,j) different configurations of z, and c(w− ,j− ) of them satisfies z = and zw = . however, by invoking this counting argument, it is implicitly assumed that every configuration satisfying ∑ zi = j happens with the same probability, as stated (again, we only keep the terms with z = and zw = and hide the rest of terms): assumption. let p(z) be the probability of generating a random context and observing zi = (mi ∈ u). if∑ zi = ∑ z′i, p(z) = p(z ′). if this is true, we also have p(z | z = j) = /c(w + ,j). we now recover the statement as follows: p(h | z = j,z = ,zw = ) = ∑ z=j,z = ,zw= p(h | z)p(z | z = j,z = ,zw = ) = ∑ z=j,z = ,zw= p(h | z)/c(w,j − ) p(h | z = j) = ∑ ∑ z=j p(h | z)p(z | z = j) = ∑ z=j,z = ,zw= p(h | z)p(z | z = j) + · · · = ∑ z=j,z = ,zw= p(h | z)/c(w + ,j) + · · · = c(w − ,j − ) c(w + ,j) p(h | z = j,z = ,zw = ) + · · · s .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / the assumption is true in expectation if the uhs itself is a random subset of Σk, which is not the case as that set also has to satisfy the uhs condition. for a general set u, the probability that a k-mer is in u is highly dependent on whether the preceding intersecting k-mers are in u, and the assumption is likely not valid in most scenarios. finally, universal hitting sets may be constructed in a specific way to enable better analysis of compatible minimizers, as seen in (zheng et al., a). we do not discuss these, as they do not apply to other universal hitting sets. s . uhs as improper polar sets the alternative formula for link energy, as described in section . . , allows us to define the link energy of any subset of k-mers, not just those satisfying the polar set condition. the main theorem for polar set still holds, but only the upper bound part. interestingly, if we plug in a universal hitting set, we get acov = n,aele = |u|,aseg = and the link energy of n/(w + ) −|u|− , where n is the number of k-mers in the reference sequence and |u| is the total number of times a k-mer in uhs appear in the reference sequence. plugging this into the main polar set theorem, we recover the specific density upper bound |u|/n for universal hitting sets, up to an error of d(s)/n. in this sense, universal hitting sets can be seen as a specific and extreme case of an improper polar set. s np-completeness of optimal polar set in this section, we show a reduction from the problem of maximal independent set to the problem of optimal polar set, with an alphabet of . let g = (v,e) be the instance for maximal independent set, and without loss of generality, let |v | = d. we use Σ = {x, , } as the alphabet, and for the polar set instance, we let w = d + , k = d and s = . this means, we want to find subset of d−mers that form many links exactly d + bases away, but no two d−mers in the polar set can be fewer than d + bases from each other. with s = , link energy is equivalent to number of links up to a scaling factor, so we are optimizing number of links that can be formed. we now construct the query string for polar set, which we divide into three sections. disqualification gadget. given an arbitrary d−mer z ∈ Σd, we let the disqualification gadget be the following string: dq(z) = x d+ zxdzx d+ with presence of dq(z), z cannot appear in the polar set, because it appears twice exactly d bases away in the disqualification gadget. the x d+ section on both ends of the gadget is to prevent d-mers within the gadget to form links with adjacent gadgets or sections, as xd is not in the polar set. disqualification section. we append a disqualification gadget to the query string for every d-mer (there are at most k = n . of them), except all d-mers containing only and . vertex section. for each vertex v in g, let a be its binary representation. we add x d+ axd+ ax d+ to the query string. edge section. for each edge (u,v) in g, let a,b be the binary representation of the two ends. we add x d+ axdbx d+ to the query string. the final query string is formed by the concatenation of three sections. theorem s . the maximal independent set can be solved by solving the optimal polar set of aforementioned query string. proof. we claim any polar set of the query string corresponds to an independent set v ′ of g, with |v ′| links. all d-mer in the polar set are those representing vertexes in g, as other d-mers (those containing x) cannot appear due to the disqualification section. for each d-mer in the polar set, we get one link from the s .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / vertex section of the query string. if (u,v) ∈ e, the two d-mers representing u and v cannot be selected into the polar set at the same time, because in the edge section these two d-mers are apart by exactly d bases, violating polar set condition. on the other hand, all independent sets of g can be represented by a polar set, with total links |v ′| using the same argument. we conclude that the optimal polar set of the query string is representation of a maximal independent set of g, which proves the statement. this reduction also implies hardness of approximately solving optimal polar sets. s on non-sequence-specific polar sets for the sake of simplicity, in this section we only discuss polar sets with s = . the discussion about s > is highly similar. as we have discussed, the density of a minimizer is the expected specific density over a random sequence. equivalently, it equals the specific density on the de bruijn sequence of order at least w + k. therefore, one may construct polar sets on the de bruijn sequence of sufficient order, to build non-sequence-specific minimizers. however, this is impossible with long windows: lemma s . no non-trivial polar set exists when w > k and s is the de bruijn sequence of order w + k. proof. we simply show no k-mers can be in the set. for every k-mer m, the sequence mm exists within s, because s is the de bruijn sequence of order at least k. picking m violates the condition for polar set because it appears twice with k < w bases apart in s. polar sets exist on de bruijn sequences of order w +k, when w ≤ k. with w = k, these polar sets become non-overlapping k-mers (levenshtein, ), that is, the set of k-mers where no proper prefix of a k-mer equals a proper suffix of another k-mer. the problem of finding large set of non-overlapping k-mers is hard in general, although constructive algorithms exist (blackburn, ) for constant factor approximation. with w < k we obtain minimally-overlapping k-mers, a concept that has also been studied in other contexts (frith et al., ). we believe the concept of non-sequence-specific polar sets is of both practical and theoretical interest. s supplementary experiments and figures s . density factor of layered polar sets by round to show that our proposed layered anchor set heuristics is useful, in figure we plot the density factor after each round of optimization on the human reference genome hg . all algorithms are run for a total of rounds, with last two being monotonic rounds. we select to ensure the resulting sets are not too complicated and can be computed in reasonable amount of time. with more rounds, many of the results can be further improved. s . viability of sequence-specific minimizers on non-target sequences to validate that optimization of sequence-specific density does not come at the cost of higher (non-sequence- specific) density, we generate the sequence-specific minimizers for hg reference genome, then apply these minimizers on a random sequence. figure shows the results. we expect these to perform close to random minimizers when σk � n where n is the length of the reference sequence. in these cases, most k-mers in a random sequence is not seen in the reference sequence, and optimized sequence-specific minimizers behave just like random minimizers in most cases. the performance for the miniception is almost identical to that in hg , and is not shown in this plot. the layered polar sets is also arguably more robust at lower values of k, as its density stays close to that of a random minimizer. s .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / round no. (hg , w= ) . . . . . es tim at ed d en si ty f ac to r round no. (hg , w= ) . . . . . . es tim at ed d en si ty f ac to r start lower bound k= k= k= k= k= k= k= k= k= k= k= figure : density factor of layered anchor sets after each round of the optimization, corresponding to the experiments shown in figure b. value of k (w= ) . . . . . . d en si ty fa ct or value of k (w= ) . . . . . . d en si ty fa ct or random minimizers lower bound fixed interval sampling layered polar sets figure : performance of sequence-specific minimizers on random sequences (optimized on hg ) with w = (left) and w = (right). this is different from figure : here the specific density is measured on a unrelated random sequence. s .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / s . experiments on human chromosome to show the effect of reference sequence length on the performance of sequence-specific minimizers, in figure we show the performance plot when we build sequence-specific minimizers for chr only. the human chromosome sequence is around % of the whole hg sequence, and consistent with our theory, the time and memory spent to run these experiments on chr are around % of that for hg ones. value of k (w= ) . . . . . d en si ty fa ct or value of k (w= ) . . . . . . d en si ty fa ct or random minimizers lower bound fixed interval sampling miniception layered polar sets figure : performance of sequence-specific minimizers, optimized and tested on human chromosome with w = (left) and w = (right). s . building sequence-specific minimizers on random sequences to further show that human reference genome is highly repetitive and construction of efficient sequence- specific minimizers is hard in such setup, we run the algorithms to generate sequence-specific minimizers on a random sequence of length , similar to that of chromosome . figure shows the performance of layered polar sets and fixed interval sampling method. compared with figure , we observe it is much easier to build efficient minimizers on a random sequence, and to match the theoretical lower bound, even given the reference sequences has similar length. s .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / value of k (w= ) . . . . . d en si ty fa ct or value of k (w= ) . . . . . . d en si ty fa ct or random minimizers lower bound fixed interval sampling miniception layered polar sets figure : performance of sequence-specific minimizers, optimized and tested on a −long random sequence with w = (left) and w = (right). this is different from figure : here the specific density is measured on the same sequence the minimizers optimize on. s .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / introduction methods overview polar sets and link energy key definitions perfect minimizer for short sequences context energy and energy savers density bounds with polar sets hardness of optimizing polar sets constructing polar sets layered polar sets polar set heuristic layered heuristics and hyperparameters supporting data structures time complexity analysis results energy deficit and energy surplus evaluating polar set heuristics discussion limits and future of polar sets practicality of sketches-by-optimization alternative measurements of efficiency conclusion a technical lemma on k-mer repetition universal hitting sets and related analyses definitions and inelasticity of uhs analysis via density upper bound analysis via probability of single uhs contexts uhs as improper polar sets np-completeness of optimal polar set on non-sequence-specific polar sets supplementary experiments and figures density factor of layered polar sets by round viability of sequence-specific minimizers on non-target sequences experiments on human chromosome building sequence-specific minimizers on random sequences “single-subject studies”-derived analyses unveil altered biomechanisms between very small cohorts: implications for rare diseases “single-subject studies”-derived analyses un- veil altered biomechanisms between very small cohorts: implications for rare diseases dillon aberasturi - ,i, nima pouladi , ,i, samir rachid zaim - , colleen kenost - , , joanne berghout - , , walter w. piegorsch , , , *, yves a. lussier - * center for biomedical informatics and biostatistics (cb ), dept. of medicine, graduate interdisci- plinary prog. in statistics & data science, ctr for appl. genetics and genomic medic., bio institute; university of arizona, tucson, az, usa. dept of biomedical informatics; university of utah, ut, usa * to whom correspondence should be addressed; i these authors contributed equally abstract motivation: identifying altered transcripts between very small human cohorts is particularly challenging and is compounded by the low accrual rate of human subjects in rare diseases or sub-stratified common disorders. yet, single-subject studies (s ) can compare paired transcriptome samples drawn from the same patient under two conditions (e.g., treated vs pre-treatment) and suggest patient-specific respon- sive biomechanisms based on the overrepresentation of functionally defined gene sets. these improve statistical power by: (i) reducing the total features tested and (ii) relaxing the requirement of within- cohort uniformity at the transcript level. we propose inter-n-of- , a novel method, to identify meaningful biomechanism differences between very small cohorts by using the effect size of “single-subject-study”- derived responsive biomechanisms. results: in each subject, inter-n-of- requires applying previously published s -type n-of- -pathways mixenrich to two paired samples (e.g., diseased vs unaffected tis- sues) for determining patient-specific enriched genes sets: odds ratios (s -or) and s -variance using gene ontology biological processes. to evaluate small cohorts, we calculated the precision and recall of inter-n-of- and that of a control method (glm+egs) when comparing two cohorts of decreasing sizes (from vs to vs ) in a comprehensive six-parameter simulation and in a proof-of-concept clinical dataset. in simulations, the inter-n-of- median precision and recall are > % and > % in cohorts of vs distinct subjects (regardless of the parameter values), whereas conventional methods outperform inter-n-of- at sample sizes vs and larger. similar results were obtained in the clinical proof-of-concept dataset. availability: r software is available at lussierlab.net/bssd. contact: lussier.y@gmail.com, piegorsch@math.arizona.edu introduction empirical evidence unveils a methodological gap when comparing tran- scriptomic differences in biomechanisms within very small human cohorts due to variations in heterogenicity, uncontrolled biology (age, gender, etc.), and diversity of environmental factors (nutrition, sleep, etc.). (griggs, et al., ; liu, et al., ; schurch, et al., ; soneson and delorenzi, ). paradoxically, rare diseases are common: % preva- lence in the population (elliott and zurynski, ) and % of children who attend disability clinic (guillem, et al., ). as timely and sizeable patient accrual of rare or micro-stratified diseases are prohibitive, there lies an opportunity for empowering clinical researchers with feasible sta- tistical designs that enable smaller cohorts. on the other hand, well-controlled isogenic studies (e.g., cellular mod- els) can yield differentially expressed genes (degs) between two small samples. we and others have applied the power of the isogenic framework through the comparison of two sample transcriptomes from one subject in single-subject studies (s ). while transcript-level differences between .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / table . abbreviations abbreviation term deg differentially expressed gene inter-n-of- “responsive pathway effect size”-based cross cohort comparison egs enriched gene set of responsive pathways between two conditions within a single-subject-study (e.g., cancer vs control tissue) fet fisher’s exact test fdr false discovery rate geo gene expression omnibus glm+egs generalized linear models with enriched gene sets go-bp gene ontology biological processes gs gene set (calculated from go-bp) its information theory-based similarity between go-bps log fc log base transformation of the transcripts fold-change mle maximum likelihood estimator or , s -or odds ratio: s -prioritized transcripts enriched in go-bp pca principal component analysis pik ca phosphatidylinositol- , -bisphosphate -kinase catalytic subunit alpha gene; hgnc: rsem rna-seq normalization by expectation maximization s single-subject studies tp tumor protein p gene; hgnc: two-sample remains inaccurate (vitali, et al., ; zaim, et al., ), gene set-level (pathway/biosystem) s have been shown to accurately dis- cover altered biomechanisms from paired transcriptome samples drawn from the same patient under two conditions (e.g., tumor-normal, treated- untreated) (ozturk, et al., ; vitali, et al., ). the results of the s gene set analyses have been validated in various contexts such as cellu- lar/tissular models (balli, et al., ; gardeux, et al., ; gardeux, et al., ), retrospectively in predicting cancer survival (li, et al., ; schissler, et al., ; schissler, et al., ) circulating tumor cells (schissler, et al., ), biomarker discovery simulations (zaim, et al., ), and therapeutic response (li, et al., ). despite the success of these models to derive effect sizes and statistical significance in single- subject studies of transcriptomes, these samples are isogenic or quasi-iso- genic, and thus do not necessarily generalize to a group of subjects (co- hort-level signal). to address the latter, we reported that determining sin- gle cohort-level significance by combining gene set signal (e.g., pathways) from s analyses can be more accurate than conventional deg analyses followed by gene set enrichment analysis (gsea) (subramanian, et al., ) in small cohort simulations (zaim, et al., ) and in previously published datasets (li, et al., )]. however, these methods still used simplistic cohort-level assumptions of centrality (median) and did not ex- plore comparing signal divergence between two cohorts. to address the methodological gap, we therefore hypothesized that sin- gle-subject transcriptomic studies of gene sets increase the transcriptomic signal-to-noise ratio within subject and lead to an improved signal be- tween small patient cohorts, as small as vs subjects per group. while technically different from the analysis of the standard two factor interac- tions in conventional cohort statistics, the proposed framework is concep- tually related to a statistical interaction in that a within-single-subject anal- ysis (subject-specific transcriptome dynamics) is followed by within- group agreement for characterizing factor (e.g., cancer vs paired normal tissue) and between group comparisons (factor ; e.g., responsive vs un- responsive to therapy). the strategy improves the statistical power by: (i) reducing the total features tested (gene set-level rather than transcript- level), (ii) relaxing the requirement of within-cohort uniformity at the tran- script level as the coordination is conducted at the gene set-level, and (iii) reducing confounding factors through the paired sample design of s - analyses within subject. the novel bioinformatic method identifies mean- ingful biomechanism differences between very small cohorts by using sin- gle-subject-study-derived effect sizes for gene sets. additionally, we show through both a simulation and a real data case example that within cohorts of varying sizes ( to subjects) this method outperforms traditional meth- ods, which are based on generalized linear modeling followed by common gene set enrichment or overlap analysis. we then apply this novel method to the effect sizes of two different single-subject analyses to illustrate the flexibility and utility of the proposed method for a variety of inputs. methods fig. . overview of the gene set analyses (inter-n-of- ) that leverage effect sizes and variances from single-subject studies to conduct subsequent group compari- sons. single-subject studies details are provided in figure . table defines abbreviations and figure provides an overview of the proposed new method (inter-n-of- ). to motivate the development of transcriptome analytics between very small human samples, by nature het- erogenicity, we first demonstrate the limitation of a generalized linear model to degs between tp and pik ca breast cancer samples. next, we describe two new methods inter-n-of- (mixenrich) and inter- n-of- (noiseq) and compare them to a generalized linear model (im- plemented in limma) (i) in simulation studies with parameters estimated from empirical analyses of real datasets and (ii) in a proof-of-concept study of breast cancer subjects. also, the evaluation of the proposed new methods is conservative as it is conducted against a reference standard built with a distinct generalized linear model (edger) using all samples. . datasets we obtained , gene sets from gene ontology biological processes (go-bp) (downloaded on / / ). for the determining realistic sim- ulation parameters, we used two datasets (i and ii) that are composed of paired samples. (i) we downloaded estrogen-stimulated and unstimulated mcf breast cancer cells sample replicates provided by (liu, et al., ) that were from the gene expression omnibus (geo) (edgar, et al., ) on .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / / / . the sequences within the sequence read archive files for the m reads of mcf cells were aligned using hg as the reference ge- nome and the resulting rna-seq counts were processed into fpkm units (fragments per kilobase of transcript per million mapped reads). (ii) we obtained samples of paired breast cancer tumor and tissue- matched normal rna-seq expression profiles (factor ) from the same subjects (n= ) from the cancer genome atlas (tcga) breast inva- sive carcinoma data collection (cancer genome atlas, ; ciriello, et al., ) using the genomic data commons tools (grossman, et al., ) (obtained / / ). as a proof-of-concept application of the proposed methods, we sampled small groups of subjects from a subset of the tcga breast cancer dataset comprising subjects with somatic (tumor) mutations in either tp (n = ) or pik ca (n = ), but not both. tp and pik ca (factor ) have been reported as the two most commonly mutated genes observed in breast cancer and differ as follows: (i) in ex- pression patterns (cancer genome atlas, ), (ii) cancer subtypes (van keymeulen, et al., ), (iii) clinical outcomes (kim, et al., ), and (iv) responsiveness to specific therapies (andre, et al., ). these data were downloaded using the r package tcga stat(n= cases; files) (wan, et al., ). data access and preparation: (a) for the single-subject studies, we ap- plied a three-stage filtering of the transcripts in which - within each sample pair – (i) we removed all transcripts with mean expression less than counts, (ii) found the union of all genes remaining amongst all pairs, and (iii) excluded all genes not present in the union of these two steps ( , genes remaining). we added to expression counts to eliminate “zeros”. (b) for the generalized linear model-based analyses, we applied a dif- ferent filtering process to the raw data where we eliminated all the tran- scripts with counts for each subject and then calculated the coefficient of variation (cv) for each transcript. we selected the transcripts with cvs within the top percentile of those remaining ( , genes remaining). . proposed s -anchored responsive pathway effect size methods for comparing very small human cohorts the following paragraphs will develop the methodology by which we con- duct single-subject studies prior to cross-cohort comparison to discover the effect size of responsive pathways in each subject and increase the features signal-to-noise ratio. table summarizes the variables. identification of overrepresented gene sets for each subject: as il- lustrated in panel a of figure , we applied to each of the tumor-normal pairs the n-of- -pathways mixenrich method that we had previously de- veloped and validated (berghout, et al., ; li, et al., ; zaim, et al., ). briefly, this method models the absolute value of the log trans- formed fold change (fc) for each gene across the two paired transcrip- tomes being studied and uses a probabilistic gaussian mixture to assign a posterior probability that the gene is differentially expressed between tu- mor and normal conditions. within the simulation, prioritized transcripts were defined as those with a posterior probability of being differentially expressed higher than . . within the tcga breast cancer dataset, said definition included having both a posterior probability of being differen- tially expressed higher than . and an absolute-valued log fc higher than log ( . ). genes were assigned to gene sets using the gene ontology (ashburner, et al., ) biological process (go-bp) hierarchy, filtered to those terms with gene set size between - genes, with subsumption to maximize interpretability. these degs were used to determine the overrepresented, or enriched, gene sets of interest using a two-sided fisher’s exact test (fet) (fisher, ) with a false discovery rate (fdr) of %. the output of this analysis generated lists of gene sets, with table . variable definitions variable definition 𝑔!",$! the number of degs within gene set gs for subject k% in cohort k 𝑔′!",$! the number of genes not differentially expressed in gene set gs for subject k% in cohort k ℎ!",$! the number of degs not in gene set gs for subject k% in cohort k ℎ′!",$! number of genes neither differentially expressed nor in gene set gs for subject 𝑘& in cohort k 𝑁 number of gene sets 𝑃(⋅) probability of event (⋅) occurring 𝑝,!",' p-value for gene set gs produced by the inter-n-of- 𝑄!",$! continuity-corrected log s -or corresponding to gene set gs for subject k% in cohort k 𝑄/!",$! the mean continuity-corrected log s -or of in gene set gs for subject k% in cohort k 𝑆) the number of subjects in a cohort k (e.g. those with a pik ca or with tp somatic mutation) 𝜃) expected value of the continuity corrected log s -or for the molecular-defined cohort k var 𝑄!",$! variance associated with continuity-corrected log s - or corresponding to gene set gs for subject k% in co- hort k 𝑊!",' the test statistic for the inter-n-of- for gene set gs 𝑍 a standard normal random variable each list representing a single subject’s tumor-normal pair and comprising go-bp terms accompanied by contingency table counts which were used to calculate an odds ratio (s -or) as the effect size. we also applied noiseq to each of the tumor-normal pairs (tarazona, et al., ) as shown in panel b of figure . for these applications of noiseq with no replicates, the “pnr” and “v” parameters were set to . and . to prevent the method from producing any errors re- lated to setting the size of the inherent multinomial distributions to an in- teger too large for r to handle. the criteria for identifying genes as differ- entially expressed for noiseq were the same as those used for n-of- - mixenrich. as shown in panel c of figure (next page), we subse- quently used this information to construct contingency tables and calculate the natural log odds ratio for inter-n-of- . this process generated two dif- ferent applications of inter-n-of- , n-of- -mixenrich and noiseq, to conduct the single-subject analyses preceding the cohort comparison. comparing enriched gene sets across distinct cohorts: we first combined the data within two distinct cohorts into single statistics whose null reference distributions were at least approximately normal. these within-cohort statistics were contrasted via scaled subtraction in a manner reminiscent of the two-sample t-test to establish the difference in gene set enrichment between the two cohorts. let 𝑔𝑠 ∈ { ,…,𝑁} index the specific gene set being studied where n is the total number of gene sets, kj indexes a specific subject in cohort 𝐾 composed of 𝑆) individuals with subjects numbered 𝑗 ∈ { ,…,𝑆)}, and 𝐾 ∈ {𝐴,𝐵} indexes a specific cohort. let 𝜟 signify quantities relating to the difference between the two cohorts. the inter-n-of- analytics for combining information within a cohort considers the abstract contingency table shown as table where the cell counts are representative for the gene set indexed by gs and the subject indexed by 𝑘&. we obtain degs from the application of a chosen single-subject analy- sis method (either n-of- -mixenrich or n-of- -noiseq) for a specific gene set gs in individual kj of cohort 𝐾 to fill out the contingency table .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / table : notation for 𝟐 𝒙 𝟐 contingency table cross-classifying deg status with gene set status deg not deg gene set gs 𝑔!",$! 𝑔′!",$! not gene set gs ℎ!",$! ℎ′!",$! fig. . overview of two single-subject study methods conducted from one sample per condition without replicate generating effect sizes and variance for each gene set. we apply single-subject studies to each subject to identify either prioritized tran- scripts (panel a) or degs (panel b) between paired tumor-normal samples. we iden- tify patient specific enriched gene sets and associated effect sizes in the form of natural log odds ratios through a fet (panel c). each effect size is approximately normally distributed with known variance and mean, simplifying subsequent analyses between cohorts. the gene set-level variance enables the extraction of more information from each individual subject than typical variance estimators that work across subjects and thereby leads to increased statistical power. the n-of- -mixenrich method was previ- ously described and validated (berghout, et al., ; li, et al., ; zaim, et al., ). noiseq is also considered as an alternative meriting evaluation because of its performance in prior single-subject studies evaluations (zaim, et al., ). with counts in the format shown in table . we apply a continuity cor- rection by adding . to each of the cells in the contingency table to pro- vide a small-sample adjustment in the odds ratio (agresti and kateri, ). the natural log s or, denoted as 𝑄!",$!, equation ( ), is approx- imately normally distributed with variance var 𝑄!",$! given in equation ( ) (woolf, ). 𝑄!",$! = 𝑙𝑛j 𝑔!",$! + ⋅ ℎ′!",$! + ℎ!",$! + ⋅ 𝑔′!",$! + m ( ) var 𝑄!",$! = 𝑔!",$! + + 𝑔′!",$! + + ℎ!",$! + + ℎ′!",$! + ( ) we average the q!",$! values within their respective cohorts to obtain the average ln ors 𝑄/!",) = 𝑆) o𝑄!",$! *" &+, ∼ 𝑁 jθ),o var 𝑄!",$! 𝑆) - *" &+, m ( ) when the null hypothesis 𝐻.:θ/ = 𝐸[𝑙𝑛(or/)] = 𝐸[𝑙𝑛(or )] = θ is true then 𝑊!",' = 𝑄/!",/ − 𝑄/!", zvar[𝑄/!",/\ + var[𝑄/!", \ ∼ 𝑁( , ) ( ) at least approximately. the corresponding two-sided p-value for gene set gs is 𝑝,!",' = ⋅ 𝑃[𝑍 > _𝑊!",'_\ ( ) where z represents a standard normal random variable. an fdr adjust- ment via the benjamini-hochberg method (benjamini and hochberg, ) is then applied to the 𝑝,!",' across all the go terms tested in the particular application. to ensure that the method positively identifies gene sets that are enriched in at least one of the cohorts, we set all fdr adjusted p-values to . if both cohort means of the log odds ratios are negative. this step ensures interpretable results since impoverished go terms with significantly fewer-than-expected degs are not well understood from a biological context. . description of the generalized linear models and ap- plication of inter-n-of- methods for small cohort com- parison and their evaluation in the breast cancer data table . three experimental designs used for the generalized linear models. in the analysis of subsets of the tcga breast cancer data, genes were declared differ- entially expressed if their abs(log fc) > log ( . ) and their fdr-adjusted p-value < . . within the simulation, genes were declared differentially expressed if their fdr-adjusted p-values < . . name level what is compared results simple transcript tp _tumoral – pik ca_tumoral fig. panel a interaction transcript (tp _tumoral – tp _normal) – (pik ca_tumoral – pik ca_normal) fig. panel b glm+egs gene set ) find degs using interaction contrast ) enrichment via fet fig. - generalized linear model (glm) designs: for the cohort analyses, we applied a generalized linear model as implemented in limma (smyth, et al., ). preceding application of the generalized linear model, we per- formed trimmed mean of m values (tmm) normalization (robinson and oshlack, ) on the data pre-processed for cohort analysis. we applied the voom normalization (law, et al., ) via the limma function voom- withqualityweights in r. we used the three different designs described in table for these gen- eralized linear model-based analyses, which were called the simple de- sign, the interaction design, and glm+egs respectively. we blocked by subject for each of these glm designs, and all fdr adjustments of p- values were done using the benjamini-hochberg false discovery rate (fdr) method (benjamini and hochberg, ). reference standard construction of enriched pathways using edger generalized linear model followed by gene set enrichment: after .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / pre-processing for cohort analyses, we applied generalized linear models as implemented in the r software package edger (robinson, et al., ) at fdr< % to the entire tcga breast cancer dataset to construct three reference standards corresponding to the three designs discussed in table . each reference standard evaluated the analyses of the tcga breast can- cer cohorts (tp vs pik ca) and used the same filter thresholds for clas- sifying transcripts as differentially expressed. in the glm followed by enrichment of gene set (glm+egs), the prioritized interacting transcripts are followed by a fet at fdr< %. subsampling of the tcga breast cancer cohort and application of glm and inter-n-of- methods: for each of the values 𝑆/ = s = s ∈ { , , , , , , } we ran subsamples of the total cohorts where we ran- domly selected without replacement 𝑆 subjects with tp and 𝑆 subjects with pik ca, without requiring non-redundancy of the random sam- plings. we applied the glm+egs method and the n-of- -mixenrich and noiseq versions of the inter-n-of- method to each of the selected co- horts (tp vs pik ca). for each of the three methods, fdr< % adjust- ment of the p-values was done with respect to all , go terms tested. for random subsamples of size 𝑆/ = 𝑆 = 𝑆 ∈ { , , ,… } of sub- jects, we applied the two transcript-level analyses using generalized linear models as implemented in limma. the performance of these transcript- level applications of limma were assessed and illustrated in figure to demonstrate the necessity and benefit of transforming from transcript- level to gene set-level analyses. accuracy measures within tcga breast cancer dataset: for each method, we calculated the precision and recall using the following func- tions. when a method produced no positive predictions for the gene sets, we assigned values of zero to the precision and recall of the given method. otherwise, we calculated the precision and recall using powers' calcula- tions with adjustments of adding . to numerators and . to denomina- tors to avoid divisions by zero (powers, ). in addition, we have pre- viously published extensions to conventional accuracy scores that we termed "similarity venn diagrams" and "similarity contingency tables" (gardeux, et al., ). in these approaches, identical as well as highly similar go-bp terms between the prediction set and the reference standard account for true positive results. we calculated the precision and recall of the gene set level analyses using information theoretic similarity (its) (tao, et al., ). for precision, we included in the intersection those predicted go-bps which had an its similarity of . or higher with any of the go terms in the reference standards, while the denominator re- mained as all predicted go-bps. similarly, for recall we included in the intersection the reference standard go-bps which had an its similarity score of . or higher with any of the predicted go terms, while the de- nominator remained as the total positive reference standard go-bp terms. of note, we previously reported that this its> . similarity criteria is highly conservative since ~ . pairs of go-bp terms are similar at its> . ( , pairs among , , non-identical combinations of go-bps) (gardeux, et al., ). . simulation of small cohort comparisons to compare glms to inter-n-of- methods data generation for simulation: the overall scheme for the simulation began by constructing two cohorts of paired tumor-normal rna-seq ex- pression profiles. we calculated simulation parameters to most realisti- cally create these expression values as described below (table ). to cal- culate statistical interactions between two factors, we had to design two cohorts of subjects and each subject with two sample conditions. we sought to recreate the tcga breast cancer conditions with these parame- table . simulation parameter values. only the balanced cohort size and the proportion of subjects with coordinated degs were varied. all other parameters were held constant. datasets were generated for each parameter configuration leading to a total of datasets. parame- ters how estimated values control samples randomly sample without replacement from tcga breast cancer normal samples na log fc dis- tribution of non-differ- entially ex- pressed genes ) calculate log fcs of randomly paired mcf unstimulated breast cancer sam- ples ) split log fcs into deciles by baseline ex- pression a) all deciles containing are combined into one category ) sample with replacement from decile containing transcript name in first random pair na gamma pa- rameters of log fcs of degs ) run n-of- -mixenrich (fig. ) on within- subject tumor-normal pairs in tp and pik ca cohorts to identify degs ) mles for gamma parameters fit to abso- lute log fcs of degs a) used egamma function in envstats r package (millard, et al., ) scale pa- rameter = . shape pa- rameter = . proportion of degs in enriched go-bps ) split enriched go terms from edger ref- erence standard into deciles based on size ) calculated degs median proportion for deciles containing go-bps (size: , ) (go size ): . (go size ): . proportion of subjects with coor- dinated degs ) split log fcs of degs within edger ref- erence standard into categories a) > . , b) < - . , or c) neither ) assign the maximum proportion of sub- jects per categories (a) or (b) for each transcript ) find the median proportion of subjects across all transcripts . , . , . balanced cohort size na , , , , , go-bp terms ) enriched: go: ( genes) ) enriched: go: ( genes) ) control: go: ( genes) ) control: go: ( genes) na ters, using the observed median values in the tcga dataset as the medians of the simulation parameters and varying the parameters around said me- dians. the tcga dataset did not comprise repeated samples in the same condition, and thus we utilized the unstimulated mcf cell lines with seven replicates to estimate the variation expected between two paired normal tissues. in our previous pathway expression studies ((yang, et al., ) and data not shown) where we compared two cohorts, about two- thirds of the observed responsive gene set patterns - as shown in figure - consisted of a gene set responsive in one subject cohort and unresponsive in the other cohort. these paired tumor-normal samples represented within-subject samples were constructed to have a proportion of the transcripts with altered ex- pression between the tumor and normal states. through the use of ran- domly sampling without replacement, we generated the normal tissue samples for these pairs after filtering out all genes in the tcga breast cancer normal tissues, which were not present within the mcf breast cancer dataset (leaving , genes). for each sampled normal breast tissue sample, we generated transcript .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / expression for a paired breast cancer sample of that subject rather than sampling the corresponding breast cancer sample from the tcga data. to produce a paired tumor expression value for a non-differentially expressed gene, we first followed the steps outlined in table to randomly generate empirical log fold changes (log fc) and then we set the gene’s expres- sion as the product of the gene’s paired normal expression and raised to the exponent of the log fc value. to generate the expression value for an altered transcript in a tumor sample, we randomly sampled a log fc from a gamma distribution with parameters described in table and set said gene’s expression to the product of the gene’s normal expression and raised to the exponent of the log fc value. we generated only positive log fcs for the degs to improve the glm's ability to detect them as dif- ferentially expressed cross subjects. we specified a gamma distribution for these positive log fcs since all the absolute-valued log fc distribu- tions we examined possessed significant right-skew. we chose to evaluate the methods using the go terms described in table . in simulation cohort a, of these go-bps would be seeded with altered transcripts, thus enriched, and would serve as controls. in cohort b, none of the go terms were enriched, thereby setting up an interaction effect between the within-subject and between-subject factors. within the two enriched go terms in cohort a, we randomly selected the proportions of genes specified in table to have altered expression. we used ber- noulli random variables with probabilities of success outlined in table to designate subjects within cohort a, which would share all their ran- domly selected degs. the remaining subjects within cohort a had all their degs randomly vary across subjects. it was hypothesized that the percentage of subjects with shared altered transcripts would strongly in- fluence the performance of the glm+egs method since limma assumes the presence of coordination of gene expression across subjects. thus, we varied the expected proportion of subjects with shared degs within cohort a ( . , . , . ) along with the sizes of the two cohorts ( , , , , , ) while holding all other parameters constant. we consequently gener- ated datasets for each parameter combination leading to a total of datasets for our downstream simulations. data preprocessing within simulation: (a) for the generalized linear model analyses, we preprocessed the simulated data by removing all genes with mean expression values less than across all the simulated tran- scripts and subsequently added to each of the expression counts. (b) for the single-subject analyses, we applied a two-stage pre-processing method in which we (i) removed all the transcripts with mean expression less than within each sample-pair and (ii) found the union across all pairs of genes remaining and eliminated any genes not contained within. the re- maining genes for the single-subject analyses then had added to their expression counts to eliminate any remaining zeroes. application of methods to simulated data: the glm+egs and the two versions of the inter-n-of- method were applied to each of the generated datasets as described previously. the benjamini-hochberg false discov- ery rate (fdr) (benjamini and hochberg, ) adjustments of the p- values generated for each technique were performed with respect to only the selected go terms that were tested for each combination of dataset and method. gobps were declared positive for a method if their associ- ated fdr adjusted p-values for said method were below . . accuracy measures within the simulation: to estimate the overall per- formance of each method within the simulation, we calculated the number of true positives, true negatives, false positives, and false negatives occur- ring within the enriched and control go terms across all resampling of each combination of parameters. when any of the methods made no positive predictions for the gene sets, we artificially assigned values of to the precision and recall of the given method. otherwise, we calculated the precision and recall through the use of their traditional formulae (powers, ). accuracy scores are thus available for each combina- tion of parameters for each go term size ( and ). results fig. . at the transcript level, limited accuracies of generalized linear models for calculating conventional simple contrast or interactions in small heterogenic breast cancer cohorts. while glms can deliver degs in small cohorts for isogenic cellular and animal models, we recapitulate in the tcga datasets that small human cohorts are underpowered statistically. we calculated the precision and recall scores associated with each of the random sub-samplings of cohort sizes vs , vs , …, to vs for tp vs pik ca and report median accuracies. the left panel used a simple linear contrast of the tumor levels on the molecular subtypes. the right panel used a linear contrast corresponding to the interaction between the molecular subtypes (tp vs pik ca) and tumor status (breast cancer vs normal breast). discoveries were performed with limma while the reference standard was constructed with edger. we showed that using a two-step process, where we first enrich the signal- to-noise ratio by applying s -analyses to paired data in single-subjects be- fore combining across subjects, can capture stable signal and yield results comparable to those in the reference standard, even as cohort size de- creases. by contrast, traditional techniques for identification of gene set- level biomechanisms that differentiate between two cohorts rapidly lose power and yield unreliable results as the sample size decreases below subjects per cohort. the transcriptomic analyses of tcga data in figure recapitulates that small human cohorts are particularly difficult to analyze using glms due to their heterogenic conditions and lack of controlled environment. thus, small human cohorts present a stark contrast to isogenic controlled experiment cell lines or animal models where the high signal to noise ratio makes transcriptomic analyses possible for very small sample sizes. these unsurprising results provide the justification for the development of the proposed glm+egs and inter-n-of- methods conducted at the gene set level. they also attest to the intrinsic lack of signal within the tcga breast cancer data for such transcriptomic analyses. the performance results for subsets of the tcga breast cancer data shown in figure establish that the two versions of the proposed inter- n-of- method degrade more gracefully in performance with decreasing cohort size than traditional generalized linear model-based methods, thereby allowing them to outperform for smaller cohort sizes. figure shows that the niche where the inter-n-of- methods outperform in terms of median precision and recall extends to all cohort sizes below vs , with the glm+egs method achieving higher median performance scores for vs and above. the sizes of the crosses suggest a further boon for the developed methods beyond this better ‘on average’ performance. the in- ter-n-of- methods tend to have very small tight crosses suggesting low .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / fig. . at the gene set–level, two inter-n-of- methods outperform a glm fol- lowed by enrichment in small heterogenic human cohorts. while inter-n-of- methods (inter-n-of- (noiseq) and inter-n-of- (mixenrich)) outperform the glm followed by enrichment in gene sets for sample sizes of vs and smaller, the glm+egs shows better accuracy at sample sizes vs and above. of note, glm+egs shows large variations in performance measures within the samples of size vs sug- gesting that despite its improved median accuracy it remains unreliable at that level. in all cases, the discovery of differentially responsive gene sets (inter-n-of- methods) or enriched gene sets (glm+egs) substantially outperform the accuracies of transcript- level analyses shown in fig. . while the inter-n-of- and glm+egs methods iden- tify related signals, the reference standard designed by a distinct glm+egs approach favors the accuracies of the latter. in addition, inter-n-of- methods can assess the ef- fect size of responsive gene sets in each subject, which can be illustrated as box plots of gene set response. in contrast. glm+egs methods are limited to a single descrip- tion of over-representation calculated on interacting transcripts of the entire study. we calculated the precision and recall scores associated with each of the random sub- sampling of cohort sizes vs , vs , vs , vs , vs , vs , vs for tp and pik ca subjects with the glm+egs and inter-n-of- methods: (i) inter-n-of- (noiseq), and (ii) inter-n-of- (mixenrich). the arms extend from the lower quartile to the upper quartile of the respective performance measure, and the two arms cross at the median for the precision and recall for that technique at the indicated cohort size. variation in performance and greater consistency. the glm+egs method on the other hand possesses very large crosses until cohort size vs , sug- gesting wild swings in performance across the different subsets evaluated. in addition, even the gene set-level glm+egs method outperforms tran- script-level glm analyses (fig. vs fig. ). figure also establishes that the n-of- -mixenrich version of the inter-n-of- method outperforms the noiseq version in terms of consistency and median precision and recall. although these differences remain small for larger cohort sizes of vs and above, they increase gradually with decreasing cohort sizes. the simulations indicates that the proposed inter-n-of- methods out- perform glm+egs for small sample sizes within parameters derived from cancer datasets and extended to. investigate other conditions. fig. shows that the two inter-n-of- methods are unaffected by changes in the expected proportion of subjects within cohorts with shared degs since their performance scores typically oscillate randomly around a fixed point given a fixed cohort size. these fixed points come closer to the perfect score of . precision and . recall with increasing cohort size, suggesting that mainly the cohort size affects the inter-n-of- method. the n-of- - mixenrich version of the inter-n-of- method generally performs the best out of all three methods, with its precision always staying % or higher and its recall staying % or above for all parameter configurations. the noiseq version of the inter-n-of- method suffers from a higher rate of false negatives for the two smallest tested cohort sizes of and and so displays significantly less recall than the n-of- -mixenrich version of the inter-n-of- method, although it does display similar levels of precision. thus, this simulation also unveils the reason for which inter-n-of- (noiseq) did not perform as well. both cohort size and the expected pro- portion of subjects within groups with coordinated degs affect the per- formance of the glm+egs method. increasing either of these parameters significantly improves the performance of the glm+egs method, with the single exception of the vs cohort size where glm+egs produces precision and recall for all specifications of the proportion of subjects within group with coordinated degs. at the anti-conservative levels for these parameters, the glm+egs method matches the performance of the two versions of the inter-n-of- method. however, decreasing either pa- rameter quickly leads the glm+egs method to underperform. for cohort sizes of vs and lower, the glm+egs method fails to match the per- formance of the two versions of the inter-n-of- method and so supports the superiority of inter-n-of- in such small sample sizes for breast can- cer-like data. discussion as stated in the introduction, empirical evidence suggests the existence of a methodological gap when comparing transcriptomic differences in bio- mechanisms within very small human cohorts due to variations of hetero- genicity, uncontrolled biology (age, gender, etc.), and diversity of envi- ronmental factors (nutrition, sleep, etc.).as expected, state of the art gen- eralized linear models decline in performance with sample sizes less than (soneson and delorenzi, ). smaller datasets require variances to be as low as those observed between technical replicates or with the isogenic conditions of cellular and animal models. yet, even in such isogenic con- ditions, two studies have recommended at least biological replicates for applying generalized linear models (liu, et al., ; schurch, et al., ). examining two-factor interactions in transcriptomes (cohorts × tu- mor status) further inflates the required sample size by a factor of (brookes, et al., ; fleiss, ; leon and heo, ). traditional co- hort-based methods impose sample size requirements which simply can- not be met within the framework imposed by rare diseases, prompting the need to develop new methods. on the other hand, we and others have shown it is possible to obtain statistical significance of gene set-level effect size measures from single samples without replicates taken in two conditions, namely single-subject studies (s ) (li, et al., ; li, et al., ; schissler, et al., ; vitali, et al., ). we have shown evidence from breast cancer studies and sim- ulations that the s -anchored inter-n-of- addresses this methodological gap. their slow decay in performance when contrasted with the abrupt decay of glm+egs establishes the superiority of these methods for sam- ple sizes of 𝑆/ = 𝑆 ∈ { , , , , } when applied to our tcga breast can- cer dataset. comparison of the median precision and recall of the three considered techniques shows that on average our methods exhibit greater power and importantly less variable performance than glm+egs at these low cohort sizes. furthermore, our simulation study confirmed that both versions of the inter-n-of- provide substantially improved recall over the glm+egs method at small cohort sizes while still maintaining equiva- lent levels of precision. the simulation results also establish that the ex- pected proportion of subjects with coordinated degs within cohorts plays a critical role in determining the range of cohort sizes in which the devel- oped methods outperform traditional generalized linear model-based tech- niques. in datasets where the proportion of subjects within cohorts sharing their degs is lower than %, the inter-n-of- methods continue to out .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / figure . comparison of accuracy of glm+egs and inter-n-of- methods within the simulation. we generated subject tumor-normal pairs for a variety of co- hort sizes ( vs , vs , vs , vs , vs , vs ) and expected proportion of sub- jects with shared degs in cohort a ( . , . , . ). we simulated datasets for each parameter configuration and applied the proposed developed inter-n-of- meth- ods and glm+egs method to each. we calculated the total number of true positives, false positives, false negatives, and true negatives across all iterations and used them to calculate the precision and recall for each combination of method, parameter con- figuration, and go term size. separate graphs are made for each parameter configura- tion and plot the resulting precision and recall measures for each method for the gene sets of size . the results for gene sets of size were very similar to the above results and so were excluded. the n-of- -mixenrich version of the inter-n-of- method performs excellently and achieves near perfect scores for cohort sizes above . the noiseq version of the inter-n-of- method often fails to identify positive signal for cohort sizes of or smaller, but otherwise achieves performance scores near those of the n-of- -mixenrich version of the inter-n-of- method. the two versions of inter- n-of- appear to be unaffected by changes in the expected proportion of subjects with shared degs since their performance scores within each graph oscillate around the same general area and show no overall trend. the glm+egs method often struggles to identify positive signal for smaller cohort sizes, although increasing the expected proportion of subjects within cohorts with coordinated degs improves the recall of the method and decreases the minimum sample size needed for it to perform near perfectly. the glm+egs method always shows excellent precision and control of the overall fdr for all except the cohort sizes of . perform the glm+egs method for cohort sizes larger than . several limitations were observed. ( ) this study focuses on parameters related to cancers, where there are substantial differences between normal paired tissue to cancer tissue. while single-subject studies have been shown to be effective in viral response (gardeux, et al., ; gardeux, et al., ) or response to therapy (li, et al., ), it remains to be demon- strated that the downstream inter-n-of- methods can outperform tran- script-level methods in those biological conditions. ( ) the simulation does present some inconsistencies with observations made within the tcga breast cancer subsets. this can probably be explained by the fact that the breast cancer analyses used a reference standard that favored glm+egs over inter-n-of- methods by design. ( ) we explored only one type of difference within gene set response between cohorts in the simulations: a cohort responsive vs unresponsive. we are thus undertaking the complementary analysis to compare the more general paradigm of gene sets more responsive in one cohort than in the other. ( ) finally, alt- hough the developed methods allow for a more accurate testing of inter- actions in datasets with small sample sizes, the importance of balancing confounders between the two cohorts should not be overstated. the small samples used within these analyses prevent randomization from balancing key covariates and confounders between cohorts. future studies could model unbalanced covariates through data or knowledge fusion with ex- ternal datasets. ( ) transcript independence assumptions in the calculation of the single-subject odds ratio and its variance (inter-n-of- methods) may be transgressed. however, many such assumptions are routinely overlooked in related analyses, such as bh-fdr (benjamini and hochberg, ) with similar limitations later rectified as the by- fdr (benjamini and yekutieli, ). when viewed under that perspective, computational biology may progress by first proving new models and then addressing their biases in subsequent studies. ( ) other unbiased ap- proaches to generating gene sets could have been utilized (e.g., co-expres- sion network from independent datasets, protein interaction networks, etc.). ( ) of note, few datasets are available with two measures in different conditions per subject and more than one clinical cohort of subjects. sim- ilar to physics where experimentalist and theory influence one another, our work presents improvements on solving an experimental design that is infrequently used and merits more consideration for increasing the sig- nal-to-noise ratio in the study of rare and infrequent diseases. ( ) prospec- tive biologic validation of results is also required in future studies as we have done with single-subject studies in the past (gardeux, et al., ). another consideration concerns that how glm+egs and inter-n-of- evaluate different phenomena. the glm+egs method primarily discov- ers go terms enriched for transcripts – primarily require the coordination of signals at the transcript-level before the enrichment across subjects be- longing to similar classness. the inter-n-of- , on the other hand, assesses whether the proportion of responsive transcripts within a given go term measured in each subject significantly differs across cohorts at the gene set-level. in other words, in the inter-n-of- , the transcripts contribution to the gene set signal may be different between subjects, while in the glm_egs methods a transcript-level coordination is required. the inter- n-of- favors clinical applications where gene set mechanisms are causal to the disease. cancer is one such condition where numerous genetic and transcriptomic root causes may differ between subjects and yet converge to comparable cellular and clinical phenotypes. in conclusion, the proposed s -anchored effect size-methods demon- strate the utility of within-subject paired sample designs for better control- ling within-patient background genetic variation and thereby identifying clearer signal with small numbers of subjects. these approaches first sim- plify the heterogenicity between subjects with better controlled single- subject studies reminiscent of experimental isogenic models (e.g., cell lines or mice models). these results motivate further studies of new ex- perimental designs, where paired within-subject samples allow analyses of datasets previously considered too small. the new design not only pre- sents opportunities in terms of performance within small subject cohorts, but also in terms of utility. the use of single-subject methods within the inter-n-of- creates an avenue for examining subject variability within co- horts. by examining the single-subject results one can directly see the degree of concordance and discordance amongst subjects and answer questions pertaining to whether specific subjects possess the overall ob- served signal. thus, the inter-n-of- presented here represents not just a new method that performs better within small sample sizes, but also an example for how to borrow knowledge from gene sets for more powerful measures of dispersion in a single subject to conduct studies of rare or infrequent diseases and analyses on patient variability within and across cohorts. in addition, precision therapies designed for increasingly sub- stratified common disorders can benefit from the proposed methods. the strategies and methods presented here open a new frontier that may greatly enrich our understanding of the genetic foundations of rare diseases. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / acknowledgements we acknowledge branden lau for performing alignment of the mcf sra files. funding this work was supported in part by the university of arizona health sciences center for biomedical informatics and biostatistics, the bio institute, and the nih (u ai and r ai ). this article did not receive sponsor- ship for publication. conflict of interest: none declared. references agresti, a. and kateri, m. categorical data analysis. springer berlin heidelberg; . andre, f., et al. alpelisib for pik ca-mutated, hormone receptor-positive advanced breast cancer. n engl j med ; ( ): - . ashburner, m., et al. gene ontology: tool for the unification of biology. nature genetics ; ( ): . balli, m., et al. autologous micrograft accelerates endogenous wound healing response through erk-induced cell migration. cell death & differentiation : - . benjamini, y. and hochberg, y. controlling the false discovery rate - a practical and powerful approach to multiple testing. j r stat soc b ; ( ): - . benjamini, y. and yekutieli, d. the control of the false discovery rate in multiple testing under dependency. annals of statistics : - . berghout, j., et al. single subject transcriptome analysis to identify functionally signed gene set or pathway activity. in, psb. world scientific; . p. - . berghout, j., et al. single subject transcriptome analysis to identify functionally signed gene set or pathway activity. pac symp biocomput ; : - . brookes, s.t., et al. subgroup analyses in randomized trials: risks of subgroup- specific analyses;: power and sample size for the interaction test. journal of clinical epidemiology ; ( ): - . cancer genome atlas, n. comprehensive molecular portraits of human breast tumours. nature ; ( ): - . ciriello, g., et al. comprehensive molecular portraits of invasive lobular breast cancer. cell ; ( ): - . edgar, r., domrachev, m. and lash, a.e. gene expression omnibus: ncbi gene expression and hybridization array data repository. nucleic acids research ; ( ): - . elliott, e.j. and zurynski, y.a. rare diseases are a'common'problem for clinicians. australian family physician ; ( ): . fisher, r.a. the logic of inductive inference. journal of the royal statistical society ; ( ): - . fleiss, j. the design and analysis of clinical experiments. . new york, john wiley& sons . gardeux, v., et al. concordance of deregulated mechanisms unveiled in underpowered experiments: ptbp knockdown case study. bmc medical genomics ; ( ): - . gardeux, v., et al. concordance of deregulated mechanisms unveiled in underpowered experiments: ptbp knockdown case study. bmc med genomics ; suppl (s ):s . gardeux, v., et al. a genome-by-environment interaction classifier for precision medicine: personal transcriptome response to rhinovirus identifies children prone to asthma exacerbations. journal of the american medical informatics association ; ( ): - . gardeux, v., et al. towards a pbmc “virogram assay” for precision medicine: concordance between ex vivo and in vivo viral infection transcriptomes. journal of biomedical informatics ; : - . griggs, r.c., et al. clinical research for rare disease: opportunities, challenges, and solutions. mol genet metab ; ( ): - . grossman, r.l., et al. toward a shared vision for cancer genomic data. n engl j med ; ( ): - . guillem, p., et al. rare diseases in disabled children: an epidemiological survey. arch dis child ; ( ): - . kim, j.y., et al. clinical implications of genomic profiles in metastatic breast cancer with a focus on tp and pik ca, the most frequently mutated genes. oncotarget ; ( ): - . law, c.w., et al. voom: precision weights unlock linear model analysis tools for rna-seq read counts. genome biology ; ( ):r . leon, a.c. and heo, m. sample sizes required to detect interactions between two binary fixed-effects in a mixed-effects linear regression model. computational statistics & data analysis ; ( ): - . li, q., et al. n-of- -pathways mixenrich: advancing precision medicine via single- subject analysis in discovering dynamic changes of transcriptomes. bmc med genomics ; (suppl ): . li, q., et al. kmen: analyzing noisy and bidirectional transcriptional pathway responses in single subjects. j biomed inform ; : - . liu, y., zhou, j. and white, k.p. rna-seq differential expression studies: more sequence or more replication? bioinformatics ; ( ): - . millard, s.p., kowarik, a. and kowarik, m.a. package ‘envstats’. . ozturk, k., et al. the emerging potential for network analysis to inform precision cancer medicine. j mol biol ; ( pt a): - . powers, d.m. evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. arxiv preprint arxiv: . . robinson, m.d., mccarthy, d.j. and smyth, g.k. edger: a bioconductor package for differential expression analysis of digital gene expression data. bioinformatics ; ( ): - . robinson, m.d. and oshlack, a. a scaling normalization method for differential expression analysis of rna-seq data. genome biology ; ( ):r . schissler, a.g., et al. dynamic changes of rna-sequencing expression for precision medicine: n-of- -pathways mahalanobis distance within pathways of single subjects predicts breast cancer survival. bioinformatics ; ( ): - . schissler, a.g., et al. analysis of aggregated cell–cell statistical distances within pathways unveils therapeutic-resistance mechanisms in circulating tumor cells. bioinformatics ; ( ):i -i . schissler, a.g., piegorsch, w.w. and lussier, y.a. testing for differentially expressed genetic pathways with single-subject n-of- data in the presence of inter-gene correlation. stat methods med res ; ( ): - . schurch, n.j., et al. how many biological replicates are needed in an rna-seq experiment and which differential expression tool should you use? rna ; ( ): - . smyth, g.k., et al. limma: linear models for microarray data. in bioinformatics and computational biology solutions using r and bioconductor. statistics for biology and health. . soneson, c. and delorenzi, m. a comparison of methods for differential expression analysis of rna-seq data. bmc bioinformatics ; ( ): . subramanian, a., et al. gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. proc natl acad sci u s a ; ( ): - . tao, y., et al. information theory applied to the sparse gene ontology annotation network to predict novel gene function. bioinformatics ; ( ):i - . tarazona, s., et al. data quality aware analysis of differential expression in rna- seq with noiseq r/bioc package. nucleic acids research ; ( ):e - e . van keymeulen, a., et al. reactivation of multipotency by oncogenic pik ca induces breast tumour heterogeneity. nature ; ( ): - . vitali, f., et al. developing a ‘personalome’for precision medicine: emerging methods that compute interpretable effect sizes from single-subject transcriptomes. briefings in bioinformatics ; ( ): - . wan, y.w., allen, g.i. and liu, z. tcga stat: simple tcga data access for integrated statistical analysis in r. bioinformatics ; ( ): - . woolf, b. on estimating the relation between blood group and disease. ann hum genet ; ( ): - . yang, x., et al. single sample expression-anchored mechanisms predict survival in head and neck cancer. plos comput biol ; ( ):e . zaim, s.r., et al. evaluating single-subject study methods for personal transcriptomic interpretations to advance precision medicine. bmc medical genomics ; ( ): . zaim, s.r., et al. emergence of pathway-level composite biomarkers from converging gene set signals of heterogeneous transcriptomic responses. pac symp biocomput ; : - . .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / in-silico structural and molecular docking-based drug discovery against viral protein (vp ) of marburg virus: a potent agent of mavd in-silico structural and molecular docking-based drug discovery against viral protein (vp ) of marburg virus: a potent agent of mavd sameer quazi , , javed malik , arnaud martino capuzzo , , kamal singh suman , , zeshan haider . gen-lab biosolutions private limited, bangalore, karnataka, india . department of genetics, indian academy degree college, bangalore, karnataka, india. . department of zoology, guru ghasidas vishwavidyalaya, bilaspur, chhattisgarh, india. . department of veterinary sciences, university of milan, italy. . centre of agricultural biochemistry and biotechnology (cabb), university of agriculture faisalabad, pakistan .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / abstract the marburg virus (marv) is a highly etiological agent of hemorrhagic fever in humans. marv has spread across the world, including america, australia, europe, and different asia countries. however, there is no approved vaccine to combat marv, combined with a high mortality rate, which makes antiviral drugs against marv urgent. the viral protein (vp ) is a core protein of marv that involves multiple functions of the infection cycle. this research used an in-silico drug design technique to discover the new drug-like small molecules that inhibit vp replication. first, several combinations of ~ showed that structure-based similarity above % was retrieved from an online "pubchem" database. molecular docking was performed using autodock . , and ligands were selected based on docking / s score lower than reference cid_ and rmsd value between - . finally, about compounds showed greater bonding producing hydrogen, van der waals, and polar interactions with vp . after evaluating their binding energy strength and admet analysis, only cid_ and cid_ were finalized, which showed the most vital binding energy and a strong inhibitory effect with marv's vp . the higher binding energy, suitable admet, and drug similarity parameters suggest that these "cid_ and cid_ " candidates have incredible latency to inhibit marv replication; hence, these strengths led to the treatment of mavd. keyword: marburg virus, vp , database screening, molecular docking, admet profiling, in-silico drug discovery .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . introduction marburg virus (marv) is an enveloped virus that is a member of the filoviridae family. marv is a non-segmented and single-strand negative rna of kb to kb size genome (bausch et al. ; carroll et al. ). marv has induced intermittent diseases in limited numbers of persons in africa for ten years following its first discovery in . two significant events in the democratic republic of the congo demonstrated approximately % mortality (towner et al. ; zhu et al. ). thus, marv is caused by the disease commonly known as homographic fever in humans and animals (mehedi et al. ). no commercially authorized vaccinations or therapeutics are presently approved to manage marv infections, and work on marv is therefore desperately required (anthony and bradfute ). marv contained seven different genes across the entire genome. each gene contained the open read frame (orf) compatible with a wide range of - nucleotide lengths at the flanking ends (wang et al. ). the five structural proteins, including nucleoprotein n.p., the viral proteins (v.p.) and , the glycoprotein, and the rna-dependent rna polymerase (l), are playing an essential role in the infectivity of marv (biacchesi et al. ). n.p. performs a pivotal function in the growth and development of virions in marv. n.p. combines with some other viral proteins, particularly vp , vp , vp , and vp , as an essential component of the virus assembly machinery to coordinate the replication process (bamberg et al. ; becker et al. ). so, it arranges as the scaffold of nucleocapsid development into a helical tubular structure. sequence homology reveals that n.p. includes a preserved n-terminal region appropriate for assembling self-assembly and single-stranded rna (ssrna) and a mostly unorganized c-terminal region containing a part necessary for the flourishing of virions (dolnik et al. ; kolb et al. ). further, through multipurpose vp , which plays an essential function in the synthesis of viral rna, assembly, and structure of the virus, marv often counteracts immune response. marv vp communicates with many innate antiviral defense elements, particularly mechanisms that contribute to the ifn formation of the rig-i (retinoic acid-inducible gene-i) like receptor (ramanan et al. ). the fgi- ( -( -( -(aminomethyl)- -benzofuran- -yl) vinyl)- h-benzimidazole- -carboximidamide) is the small drug-like compound that has .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / previously classified and reported as an effective drug against vp of marv (warren et al. ). the vp is considered a vital target to synthesize the antiviral drug due to the important role of vp in the transcription of marv. the fgi- drug was selected to screen small molecules from the pubchem database using structure similarity-based filtration (more than % similarity) to find novel compounds. the cadd and virtual high throughput screening perform a critical function in drug discovery (lyne ). the bioinformatics techniques, including structure-based drug-like compounds screening from online databases, molecular docking, and molecular dynamic simulation, could be utilized to block the p active site of vp . the current research was designed to novel drug-like substances with greater contact, binding energy, and inhibition effect at the p site of marv vp by using computational strategies. the final small molecules of drug-like compounds would have more effective and substantial latent to stop the replication of marv in the host, which could ultimately help develop and design new drugs to cure and target mavd. . materials and methods . . amino acid sequence retrieval and analysis the amino acid sequences of vp protein were retrieved from the national center for biotechnology (ncbi) (https://www.ncbi.nlm.nih.gov/) database. ncbi is a significant and leading public biomedical database and contains different tools for analyzing genomic and molecular information in computational biology (jenuth ). furthermore, the protein's primary sequence was analyzed using an online bioinformatics-based tool expasy-protparam (https://web.expasy.org/protparam/). the protparam tool was used to analyze the different physical and chemical parameters of protein, including the molecular weight, isoelectric point, atomic composition, estimated half-life, amino acid composition, aliphatic index, and grand average hydrophobicity (garg et al. ). . . structure prediction, evaluation, and validation of protein the sequence of vp protein was utilized to identify the template having more significant similarity in the protein sequence. the protein sequence was used with the basic .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / local alignment search tool (blast) in protein data bank (pdb) and selected the best structure with the highest similarity in sequence. the three-dimensional structure ( d) of the vp protein was developed by using modeller v . software. the modeller software is a desktop- based computational tool used to indicate the homology-based d structure of the protein. the most favorable and accurate d model was selected based on the dope score (eswar et al. ). the quality of the d structure of vp was assessed and validated by using an online freely available procheck tool (https://servicesn.mbi.ucla.edu/procheck/). the procheck software highlights the stereochemistry of protein (laskowski et al. ). . . formation of coordinate files the d structure of the vp protein was modified by using bioinformatics software's discovery studio visualizer and autodock . . the structure was optimized by removing the water molecules from the vp , hydrogen and polar hydrogen atoms, the addition of kollman charges and fixed the receptor atoms. finally, vp structure was saved in "pdbqt" format file (haider et al. ; quazi et al. ). . . selection of ligand and database virtual screening the ". sdf" file of the fgi- antiviral drug was downloaded from the pubchem database. the active site (p ) of vp was categorized by using an online dogsite (https://proteins.plus/help/dogsite). the p "ala , , lys , leu- , phe- , ile- , gln- , val- , ser- , lys- , val- , pro- , ile- , and cys- ", of selected for the molecular docking with fgi- antiviral drug-using autodock . software. after that, fgi- was set and screen other drug-like compounds from pubchem databases. the pfizer law was used to evaluate the drug-like properties of each compound. the different parameters of lipinski's rule like m.w. < da, logp < , hbd < , and hba < , were used to screen the drug-like small molecules (chen et al. ; lipinski ). selected compounds were nominated for further analysis. every selected drug-like compound's energy minimization was completed using autodock . software and saved files into a "pdbqt" file separately for further molecular docking. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . . molecular docking the finally selected drug-like compounds were docked with the p site of vp of marv using a desktop autodock . software. the molecular docking was carried out on a computer system that installed a window with an x operating system. the applications, including autodock . and mgl . . using python . , were used for these experiments (france, scotti, and scotti ). the protein-ligand interaction investigation was accomplished utilizing discovery studio visualizer and pymol software's, respectively (d studio and ; inwood et al. ). for molecular docking, the receptor and ligand were used after their energy minimization, and both the structure were saved in ". pdbqt" files. the grid chart of all kinds of atom's energy was generated using autogrid algorithm of autodock . . a grid box was drawn based on ap site for ligand in every dock for vp marv utilizing a grid chart of × × points, × × grid spacing points, and . Å and . Å, individually. the docking was completed by selecting the parameters that our previous study described (haider et al. ; quazi et al. ). the "s" value is showed the docking score between the respective receptor and ligand. the more negative "s" indicates the strong binding affinity of the ligand with the receptor. the rmsd (root mean square deviation) value is utilized the docked conformations between the ligand and receptors. all the docking score and rmsd values of each ligand were calculated using the default scoring parameter in the autodock . (haider et al. ). the "s" score and binding energy of finalized compounds were compared with the values of fgi- . the small molecule with binding energy like or greater than the fgi- was selected. the finally selected compounds were considered for further analysis. . . admet profiling the admet properties of finally selected drug-like compounds were checked to utilize an available admetsar (immd.ecust.edu.cn/admetsar ) tool. this admetsar expects multiple toxic effects, including mutagenicity, annoyance behaviour, and competitiveness. the admet profile's drug-like properties help pick healthy human antiviral medicines (fatima et al. ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . results . . amino acid sequence retrieval and analysis the vp primary sequence of amino acids (a.a.) was obtained from the ncbi database. the stability of the protein structure has relied on the three-dimensional conformation of the protein. the protein sequence of the target protein was developed based on physical and chemical properties. the physicochemical properties estimated using expasy- protparam showed that the molecular weight (w.w.) of protein . , isoelectric point (pi), . , and grand average hydropathicity - . . all the physical and chemical properties of vp protein are shown in table . . . structure prediction, evaluation, and validation of protein the d structure of the vp protein was predicted homology-based. the template "c gh a" showed % sequence similarity downloaded from pdb. the homology modelling was done by using modeller desktop software. the finest d structure of vp out of ten structures was chosen based on a dope and g.a. score (- . and ) (figure ). the geomaterial analysis predicted d vp was performed using pricheck tool that showed most of the a.a. approximately a. an out were situated in the protein's favorable region that made the . % out of %. moreover, the d structure of vp was considered more reliable, efficient, and stable for further study (figure ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . the d structure of vp by using template "c gh a" predicted by pymol.ol. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table . physio-chemical characteristics of vp marv protein physical and chemical properties vp amino acid arrangement #. composition (%) mw . alanine . a.a # arginine . isoelectric point . asparagine . instability index . aspartic acid . total number of negative atoms cysteine . total no of positive atoms glutamine . aliphatic index . glutamic acid . grand average of hydropathicity - . glycine . half-life h histidine . atomic composition isoleucine . carbon atoms leucine . hydrogen atoms lysine . nitrogen atoms methionine . oxygen atoms phenylalanine . sulphur atoms proline . molecular formula c h n o s serine . complete no of atoms threonine . tryptophan . tyrosine . valine . . c c -b y -n c . in te rn a tio n a l lice n se a va ila b le u n d e r a (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r, w h o h a s g ra n te d b io r xiv a lice n se to d isp la y th e p re p rin t in p e rp e tu ity. it is m a d e t h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d f e b ru a ry , . ; h ttp s://d o i.o rg / . / . . . d o i: b io r xiv p re p rin t https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure . assessment of ramachandran's plot of marv vp shows that . % a. a is present in favorable regions, while about . % a.a, an extant in allowed areas, and . % a.a, exists in an outlier region. is a, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . . selection of ligand and database virtual screening the vp protein was docked through an antiviral medication named fgi- . the results indicated that fgi- and the marv were recognized as being correlated with one another. this analysis showed that the fgi- formed a complex with vp by "s" score of - . , rmsd of . , and binding energy of - . (table ). the interaction analysis showed that gln formed a strong contact by hydrogen bonding, and gln made a strong connection through polar interaction with fgi- . while lys , val , ile , are involved in van der waals interactions (figure ). in our research, compounds with > % structural similarity to fgi- were chosen through virtual screening from the broad online pubchem database. out of the compounds, pfizer did cross-validation and applied the law of five to all the combinations in the sample. the out of drugs like small molecules were placed into another database for docking with the vp protein after the most feasible energy minimization. figure . (a) d structural description of vp marv (showing in blue colour) formed a complex with fgi- (showing in yellow colour) (b) d ligand complex between the " fgi- he ne - ed ng re % ne aw re gy a - .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / " and p site of vp marv (c) d ligand complex between the " fgi- " and p of vp marv. . . molecular docking the molecular docking rules play an important role in creating modern drugs against various lethal illnesses (ursic-bedoya et al. ). all of the hits that were docked against the p position of vp marv by utilizing the auto dock. subsequently, there have been two compounds recorded with a minimum s/docking-score than fgi- . the successful docked top compounds with lower s-score, and rmsd value was selected for further evaluation. the binding relationship of such two-hit compounds with vp was determined using drug discovery studio tools. the best locations were defined in the specified order of preference, constructed on the minimum binding energy in the greatest cluster, no. of hydrogen connections formed with a.a, residues of p (table ). that was done to ensure that the compounds were attached exactly in the correct binding position. successful inhibitors have shown an important correlation with the p site of marv vp (figure ). figure . (a) representation of d complex of vp (showing in blue colour) interacted with novel inhibitor cid_ (showing in orange colour) (b) representation of d complex of vp (showing in blue colour) connected with novel inhibitor cid_ (showing in tints white colour). of nst p o op he ug ce, ns re nt ith of nts . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . . admet profiling the molinspiration server was used to crisscross the drug-like parameters of suggested small molecules against marv vp . the selected compounds showed a zero violation against the pfizer's law of five and recognized the properties of drug including m.w., hbd, hba, logp and tpsa (table ). further, to evaluate the properties of drugs safety in the living organism. the term admet is the abbreviation of absorption, digestion, metabolism, excretion and toxicity. the admet analysis was performed by using admetsar server. the adme analysis of all the finally selected compounds showed zero violation against the use in a living organism (table ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table : the predicted favorable docking results and pfizer's properties of finalized drugs compound against vp . pubchem # name docking score rmsd binding energy kcal/mol pfizer's properties cid_ h-benzimidazole- - carboximidamide, -( -( - (aminoiminomethyl)- - benzofuranyl) ethenyl) - . . - . mw= . , logp= - . , hbd= , hba = , and tpsa= . cid_ -[ -[ -( -aminophenyl)- - furyl] phenyl]-n-isopropyl- h-benzimidazole- - carboxamidine h-benzimidazole- - carboximidamide, -[ -[ - ( -aminophenyl)- -furanyl] phenyl]-n-( -methylethyl)- - . . - . mw= . , logp= . , hbd = , hba = , tpsa= . cid_ -dibenzofuran- -yl- h- benzoimidazole- - carboxylic acid amide - . . - . mw= . , logp= . , hbd = , hba= , tpsa= . *hbd= hydrogen bond donor, hba =hydrogen bond acceptor . c c -b y -n c . in te rn a tio n a l lice n se a va ila b le u n d e r a (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r, w h o h a s g ra n te d b io r xiv a lice n se to d isp la y th e p re p rin t in p e rp e tu ity. it is m a d e t h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d f e b ru a ry , . ; h ttp s://d o i.o rg / . / . . . d o i: b io r xiv p re p rin t https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table : adme analysis of finalized drugs compounds against vp . pubchem # blood-brain barrier human intestinal absorption caco permeability p- glycoprotein inhibitor renal organic cation transporter absorption cid_ cid_ cid_ positive (+) positive (+) positive (+) positive (+) positive (+) positive (+) negative (-) negative (-) negative (-) non-inhibitor inhibitor non-inhibitor non-inhibitor non-inhibitor non-inhibitor pubchem # cyp a cyp c cyp d cyp c cyp a metabolism cid_ cid_ cid_ non-inhibitor inhibitor inhibitor non-inhibitor non-inhibitor non-inhibitor non-inhibitor non-inhibitor non-inhibitor non-inhibitor non-inhibitor inhibitor non-inhibitor non-inhibitor non-inhibitor pubchem # ames analysis carcinogenic analysis toxicity cid_ cid_ cid_ non-poisonous non-poisonous non-poisonous non-dangerous non-dangerous non-dangerous . c c -b y -n c . in te rn a tio n a l lice n se a va ila b le u n d e r a (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r, w h o h a s g ra n te d b io r xiv a lice n se to d isp la y th e p re p rin t in p e rp e tu ity. it is m a d e t h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d f e b ru a ry , . ; h ttp s://d o i.o rg / . / . . . d o i: b io r xiv p re p rin t https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . . analysis of receptor-ligand interaction the s/docking-score tests the contact strength of vp against drugs compounds; therefore, the drug-like small molecules are chosen based on the s-score and the binding energy of an outstanding drug compound. the following compounds, cid_ and cid_ , have a solid binding with the p active sites vp . the receptor-ligand complex of cid_ and cid_ with vp showed that hydrogen bonding, van der waals, formed a stable complex. the lys made a hydrogen bond along with other amino acids of p site formed van der waals interaction with cid_ ligand (figure a). while val formed a hydrogen bond along with other amino acids of p site formed van der waals interaction with cid_ ligand (figure b). the d and d pockets configurations of the particular drug-like small molecules are shown in figure . figure . the d and d representation the analysis of receptor binding interaction (a) shows the d interaction of ligand cid_ with vp (b) represents the d pocket of vp ds; gy nd ex er no ile als of ws .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / with ligand cid_ (c) shows the d interaction of ligand cid_ with vp . (d) represents the d pocket of vp with ligand cid_ . figure . d molecules structure of selected drug-like compounds (a) represents the d structure of h-benzimidazole- -carboximidamide, -( -( -(aminoiminomethyl)- - benzofuranyl) ethenyl) drug-like compound (b) represents the d structure of -[ -[ -( - aminophenyl)- -furyl] phenyl]-n-isopropyl- h-benzimidazole- -carboxamidine h- benzimidazole- -carboximidamide, -[ -[ -( -aminophenyl)- -furanyl] phenyl]-n-( - methylethyl) drug-like compound (c) represents the d structure of -dibenzofuran- -yl- h- benzimidazole- -carboxylic acid amide drug-like compound . d - - - - - d .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . discussion many research-based studies have been conducted to discover effective therapeutic vaccines against marv. but unfortunately, effective treatment for marv is not yet available. nowadays, marv is considered a global problem and it is still necessary to discover a less expensive and effective antiviral drug against marv (brown et al. ). traditional drug development approaches are largely costly and unproductive for solving evolving public health challenges (velmurugan, mythily & rao ). therefore, the most appropriate approaches should be implemented that could easily cope with this adverse circumstance. in silico drug design strategies are becoming the popular field in the pharmaceutical industries due to fast, less expensive, and time-saving practices in identifying new drugs (geisbert, bausch, and feldmann ). marv's vp viral protein is a promising candidate for vaccine design against marv infection. due to the above arguments, current research has proposed small drug-like molecules that caused marv replication inhibition by binding firmly to the p site of vp marv and could be considered pharmacological compounds. the three- small drug-like molecules were also analyzed for admet properties with the admetsar server. all the compounds were chosen to have passed the admet properties. blood-brain barrier cells are endothelial cells that function as resistance and prevent the brain from absorbing any medicine. therefore, blood-brain barrier cells are considered an integral feature in the drug design discipline (alavijeh et al. ; cheng et al. ; stamatovic, keep and andjelkovic ). oral bioavailability is a significant factor for the pharmacological similarity of the active drug compound as a curative agent (varma et al. ). admet properties of beneficial drug- like small molecules have strong results for the similarity of effective treatment such as p- glycoprotein substrate (inhibitor / non-inhibitor), blood-brain barrier penetration (positive/negative), human intestinal preparation (positive/negative), renal transporter of organic cations (inhibitor / non-inhibitor) and caco permeability (positive/negative). cytochromop (cyp) is classified into isoenzymes and has remained active for the catabolism of several chemicals, including hormones, medicines, bile acids, carcinogens, etc. the admet research test is useful and efficient for scanning drug compounds and consisted of the following .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / parameters: ( ) blood-brain barrier penetration, ( ) human intestinal absorption, ( ) caco permeability absorption, ( ) non-toxic, ( ) non-carcinogenic, and ( ) non-inhibitor of the cyp enzyme. these admet parameters were significantly exceeded by the two compounds cid and cid (table ). . conclusion the current research focus was structure-based virtual screening using the pubchem online database, pfizer/lipinski's analysis, molecular docking, admet analysis, and evaluation of the interaction between ligands and the marv vp site p . the drug-like compounds cid and cid showed a strong connection with the p active site of marv vp creating hydrogen bonds, van der waals and polar interaction. the results suggest that they can hypothetically be applied against marv as a drug. the compounds mentioned may function as the novel, fundamentally distinct and potentially active pharmaceutical compounds against marv vp . the molecule structure of three drug-like compounds is shown in figure . our in-silico research found that two drugs as small molecules have the potential of a drug that can be guided as therapeutic drugs against marv by skillfully directing the p of vp through marv. consequently, concerning two small drug-like molecules cid_ and cid_ , the work we performed requires further investigations and future in vitro and in vivo experiments before a possible verification with the competent authorities. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / references anthony, scott m, and steven b bradfute. . "filoviruses: one of these things is (not) like the other." viruses ( ): – . bamberg, sandra et al. . "vp of marburg virus influences formation of infectious particles." journal of virology ( ): – . bausch, daniel g et al. . "marburg hemorrhagic fever associated with multiple genetic lineages of virus." new england journal of medicine ( ): – . becker, s et al. . "interactions of marburg virus nucleocapsid proteins." virology ( ): – . biacchesi, stéphane et al. . "genetic diversity between human metapneumovirus subgroups." virology ( ): – . carroll, serena a et al. . "molecular evolution of viruses of the family filoviridae based on whole-genome sequences." journal of virology ( ): – . chen, xiaoxia et al. . "analysis of the physicochemical properties of acaricides based on lipinski's rule of five." journal of computational biology ( ): – . d studio, and . . "discovery studio life science modeling and simulations." researchgate.net : – . dolnik, olga, larissa kolesnikova, lea stevermann, and stephan becker. . "tsg is recruited by a late domain of the nucleocapsid protein to support budding of marburg virus-like particles." journal of virology ( ): – . eswar, narayanan et al. . "comparative protein structure modeling using modeller." current protocols in bioinformatics ( ): – . fatima, shehnaz et al. . "admet profiling of geographically diverse phytochemical using chemoinformatic tools." future medicinal chemistry ( ): – . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / france, alex, marcus scotti, and luciana scotti. . molecular docking of fructose-derived nucleoside analogs against reverse transcriptase of hiv- . garg, vijay kumar et al. . "mfppi–multi fasta protparam interface." bioinformation ( ): . haider, zeshan et al. . "in-silico pharmacophoric and molecular docking-based drug discovery against the main protease (m pro) of sars-cov- , a causative agent covid- ." pak. j. pharm. sci ( ): – . inwood, william b et al. . "genetic evidence for an essential oscillation of transmembrane-spanning segment in the escherichia coli ammonium channel amtb." genetics ( ): – . jenuth, jack p. . "the ncbi." in bioinformatics methods and protocols, springer, – . kolb, ryan et al. . "inflammasomes in cancer: a double-edged sword." protein & cell ( ): – . laskowski, roman a, malcolm w macarthur, david s moss, and janet m thornton. . "procheck: a program to check the stereochemical quality of protein structures." journal of applied crystallography ( ): – . lipinski, christopher a. . "lead-and drug-like compounds: the rule-of-five revolution." drug discovery today: technologies ( ): – . lyne, paul d. . "structure-based virtual screening: an overview." drug discovery today ( ): – . mehedi, masfique, allison groseth, heinz feldmann, and hideki ebihara. . "clinical aspects of marburg hemorrhagic fever." future virology ( ): – . https://pubmed.ncbi.nlm.nih.gov/ . quazi, sameer et al. . "in-silico structural and molecular docking-based drug discovery .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / against viral protein (vp ) of marburg virus: a causative agent of mavd." biorxiv. ramanan, parameshwaran et al. . "structural basis for marburg virus vp –mediated immune evasion mechanisms." proceedings of the national academy of sciences ( ): – . towner, jonathan s et al. . "marburgvirus genomics and association with a large hemorrhagic fever outbreak in angola." journal of virology ( ): – . ursic-bedoya, raul et al. . "protection against lethal marburg virus infection mediated by lipid encapsulated small interfering rna." the journal of infectious diseases ( ): – . https://doi.org/ . /infdis/jit . wang, lin-fa et al. . "molecular biology of hendra and nipah viruses." microbes and infection ( ): – . warren, travis k et al. . "antiviral activity of a small-molecule inhibitor of filovirus infection." antimicrobial agents and chemotherapy ( ): – . zhu, tengfei et al. . "crystal structure of the marburg virus nucleoprotein core domain chaperoned by a vp peptide reveals a conserved drug target for filovirus." journal of virology ( ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / genome warehouse: a public repository housing genome-scale data meili chen , ,#, yingke ma , ,#, song wu , , , xinchang zheng , , hongen kang , , , jian sang , , , † , xingjian xu , , , †† , lili hao , , zhaohua li , , , zheng gong , , , jingfa xiao , , , zhang zhang , , , wenming zhao , , , yiming bao , , ,* national genomics data center, beijing institute of genomics, chinese academy of sciences / china national center for bioinformation, beijing , china cas key laboratory of genome sciences and information, beijing institute of genomics, chinese academy of sciences, beijing , china university of chinese academy of sciences, beijing , china # equal contribution. * corresponding author. e-mail: baoym@big.ac.cn (bao y). † current address: division of cancer epidemiology and genetics, national cancer institute, national institutes of health, bethesda, maryland , usa † † current address: college of computer science technology, inner mongolia normal university, hohhot, inner mongolia , china running title: chen m et al / genome assembly data repository total letter counts (title): total letter counts (running title): total word counts (abstract): total keywords: total word counts (from “introduction” to “conclusions” or “materials and methods”): total figures: .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / total tables: total supplementary figures: total supplementary tables: total supplementary files: .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / abstract the genome warehouse (gwh) is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission, storage, release, and sharing. as one of the core resources in the national genomics data center (ngdc), part of the china national center for bioinformation (cncb, https://bigd.big.ac.cn/), gwh accepts both full genome and partial genome (chloroplast, mitochondrion, and plasmid) sequences with different assembly levels, as well as an update of existing genome assemblies. for each assembly, gwh collects detailed genome-related metadata including biological project and sample, and genome assembly information, in addition to genome sequence and annotation. to archive high-quality genome sequences and annotations, gwh is equipped with a uniform and standardized procedure for quality control. besides basic browse and search functionalities, all released genome sequences and annotations can be visualized with jbrowse. by december , gwh has received , direct submissions covering a diversity of species, and has released of them. collectively, gwh serves as an important resource for genome-scale data management and provides free and publicly accessible data to support research activities throughout the world. gwh is publicly accessible at https://bigd.big.ac.cn/gwh/. keywords: genome submission; genome sequence; genome annotation; genome warehouse; quality control .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction genome sequences and annotations are fundamental information for a wide range of genome-related studies, including various omics data analysis such as genome [ ], transcriptome [ ], epigenome [ , ], and genome variation [ , ]. china, as one of the most biodiverse countries in the world, harbors more than % of the world’s known species [ ]. in the past decades, a large number of genome assemblies of featured and important animals and crops in china have been sequenced [ , – ], most of which were submitted to international nucleotide sequence database collaboration (insdc) members (national center for biotechnology information (ncbi), european bioinformatics institute (ebi), and dna data bank of japan (ddbj)) [ ]. with the rapid growth of genome assembly data, in china for example, large genome data size, slow data transfer rate due to limited international network transfer bandwidth, and language barrier for communication of technical issues have obstructed researchers from efficiently submitting their data to insdc members. all these call for a centralized genomic data repository within china to complement the insdc. here, we report the genome warehouse (gwh, https://bigd.big.ac.cn/gwh/), a centralized resource housing genome assembly data and delivering a series of genome data services. as one of the core resources in the national genomics data center (ngdc), part of the china national center for bioinformation (cncb, https://bigd.big.ac.cn/) [ ], the aim of gwh is to accept data submissions worldwide and provide an important resource for genome data quality control, data archive, rapid release, and public sharing (e.g., with insdc) in support of research activities from all over the world. to date, gwh has received a total of , genome submissions (including international submissions), demonstrating its increasingly important role in global genome data management and sharing. data model designed for compatibility with the insdc data model, each genome assembly in gwh is linked to a bioproject (https://bigd.big.ac.cn/bioproject) and a biosample (https://bigd.big.ac.cn/biosample), which are two fundamental resources for metadata .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / description in cncb-ngdc. full or partial (chloroplast, mitochondrion, and plasmid) genome assemblies with different assembly levels (complete, draft in chromosome, scaffold, and contig) are all acceptable and existing genome assemblies are allowed to be updated. accession numbers are assigned with the following rules (figure ): ( ) each genome assembly has an accession number prefixed with "gwh", followed by four capital letters and eight zeros (e.g., gwhaaaa ); ( ) genome sequences have the same accession number format as their corresponding genome assembly, with the exception that the eight digits start from and increase in order (e.g., gwhaaaa ); ( ) genes have similar accession pattern as those of genome sequences, with the addition of letter “g” between the gwh prefix and the four capital letters, and there are six digits at the end instead of eight (e.g., gwhgaaaa ); ( ) transcripts use the letter “t” to replace “g” in accession numbers for genes (e.g., gwhtaaaa ); ( ) proteins use the letter “p” to replace “g” in accession numbers for genes (e.g., gwhpaaaa ); ( ) if the submission is an update of existing submission in gwh, it will be assigned a dot and an incremental number to represent the version (e.g., gwhaaaa . ). database components gwh is a centralized resource housing genome-scale data, with the purpose to archive high-quality genome sequences and annotation information. gwh is equipped with a series of web services for genome data submission, release, and sharing, accordingly involving three major components, namely, data submission, quality control, and archive and release (figure ). data submission gwh not only accepts genome assembly associated data through an on-line submission system but also allows off-line batch submissions. users need to register first and then to provide complete description on submitted genome sequences. biological project and sample information should be provided (through bioproject and biosample, respectively) together with genome assembly sequence, annotation, and associated metadata. metadata mainly consist of a variety of information about .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / submitter, general assembly, file(s), sequence assignment, and publication (if available). after submission, gwh runs an automated quality control pipeline to check the validity and consistency of submitted genome sequence and genome annotation files. accession numbers are assigned to assemblies and sequences upon the pass of quality control. the updated assembly data can also be submitted to gwh. it should be noted that compatible with the insdc members (e.g., ncbi genbank), it is the responsibility of the submitters to ensure the data quality, completeness, and consistency and gwh does not warrant or assume any legal liability or responsibility for the data accuracy. quality control after metadata and file(s) are received, gwh automatically runs standardized quality control (qc) to check different types of errors in submitted genome sequences and annotations, and to scan for contaminated genome sequences (see details at https://bigd.big.ac.cn/gwh/documents) if needed (figure ), which roughly falls into qc steps: ( ) the component will check the consistency of file(s) according to filename and md code. ( ) for genome sequences, the component will check the legality of genome sequence id and sequence content, e.g., unique sequence id, sequence composition (a/t/c/g or degenerate base), sequence length (≥ bp). ( ) for genome annotations, the component will check gene structure completeness and consistency, e.g., unique id, a exon/cds/utr coordinate falling within the corresponding gene coordinate, strand consistency for all features (including gene/transcript/exon/cds/utr), codon validity (e.g., valid start/stop codon, no internal stop codon). ( ) finally, it will check the internal consistency of genome sequence and annotation, e.g., sequence id in genome annotation must match genome sequence id, a feature coordinate falling within the range of the corresponding genome sequence. ( ) genome sequences will also be scanned to check vectors, adaptors, primers, and indices (collected from univec database, ftp://ftp.ncbi.nlm.nih.gov/pub/univec/) using ncbi’s vecscreen (https://www.ncbi.nlm.nih.gov/tools/vecscreen/). if there is an error, a report will be .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / automatically sent to the submitter by email. to finish a successful submission, the submitter needs to fix all errors and resubmit files until they pass the qc process. archive and release gwh will assign a unique accession number to the submitted genome assembly upon the pass of quality control, allot accession numbers for each genome sequence, gene, transcript, and protein, generate and backup downloadable files of genome sequence and annotation in fasta, gff , and tsv formats. data generation is performed with in-house-writing scripts based on submitted genome sequence and annotation files. in order to ensure the security of submitted data, a copy of backup data is stored on a physically separate disk. gwh will release sequence data on a user-specified date, unless a paper citing the sequence or accession number is published prior to the specified release date, in which case the sequence will be released immediately. for the released data, gwh will generate web pages containing two primary tables: genome and assembly. the former shows species taxonomy information and genome assemblies, and the latter contains general information of the assembly (including external links to other related resources), statistics of genome assembly and its corresponding annotation. all released data are publicly available at gwh ftp site (ftp://download.big.ac.cn/gwh/). gwh provides data visualization for both genome sequence and genome annotation using jbrowse [ ]. it offers statistics and charts in light of total holdings, assembly levels, genome representations, citing articles, submitting organizations, sequencing platforms, assembly methods, and downloads. gwh provides user-friendly web interfaces for data browse and query using big search [ ], in order to help users find any released data of interest. for a released genome assembly, gwh also provides machine-readable apis (application programming interfaces) for publicly sharing and automatically obtaining information on its associated bioproject, biosample, genome, and assembly metadata and file paths. .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / global sharing of sars-cov- and coronavirus genomes during the covid- outbreak, gwh, in support of the novel coronavirus resource ( ncovr) [ , ] has received worldwide submissions of more than a thousand sars-cov- genome assemblies with standardized genome annotations [ ], and has released of them. to expand the international influence of data, of the released sequences have been shared, with the submitters’ permission, in genbank [ ] through a data exchange mechanism established with ncbi. in this model, gwh accessions are represented as secondary accessions in ncbi genbank records, which are retrievable by the ncbi entrez system. this model sets a good example for data sharing among different data centers. in addition, gwh offers sequences of the coronaviridae family to facilitate researchers to reach the data conveniently and thus to study the relationship between sars-cov- and other coronaviruses. to promote the data sharing and make all relevant information of the coronaviridae readily available, gwh integrates genomic and proteomic sequences as well as their metadata information from ncbi [ ], china national genebank database (cngbdb) [ ], national microbiology data center (nmdc) [ ] and cncb-ngdc. duplicated records from different sources are identified and removed to gain a non-redundant dataset. as of december , , the dataset has , nucleotide and , protein sequences of the coronaviridae. filters are implemented to narrow down the required coronaviridae sequences using multiple conditions, including country/region, host, isolation source, length, and collection date. both the metadata and sequences of the filtered results can be selected and downloaded as a separate file. the daily updated sequences and all sequences can also be downloaded from ftp (ftp://download.big.ac.cn/genome/viruses/coronaviridae/). data statistics by december, , gwh has received , direct submissions covering a broad diversity of species (table ) with different assembly levels (figure ). these genome assemblies link to bioprojects and , biosamples, and are .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / submitted by submitters from institutions (including international submitters from countries). there are a total of released submissions, which were reported in articles from journals. gwh has over , visits from countries/regions, with ~ , downloads. the amount of data, visits, and downloads in the gwh has been on the dramatic increase over the past years, clearly showing its great utility in genome-scale data management. summary and future directions collectively, gwh is a user-friendly portal for genome data submission, release, and sharing associated with a matched series of services. the rapid growth of genome assembly submissions demonstrates the great potential of gwh as an important resource for accelerating the worldwide genomic research. with the aim to fully realize the findability, accessibility, interoperability, and reusability (fair) of genome data [ ], gwh has made ongoing efforts, including but not limited to, improvement of web interfaces for data submission, presentation, and visualization, continuous integration of newly sequenced genomes, and development of useful online tools to help users analyse genome data (such as blast [ ]). therefore, we will put in more efforts to provide genome annotation services, especially for bacteria and archaea genomes, with the particular consideration that uniform standardized annotation determines the accuracy of downstream data analysis. besides, we will expand the coronaviridae dataset to other important pathogens to improve the ability of public health emergency response. finally, we plan to share and exchange all public genome assembly data with the insdc members to provide comprehensive data for researchers globally. credit author statement meili chen: methodology, software, investigation, data curation, writing - original draft, project administration. yingke ma: software, writing - original draft. song wu: software, data curation. xinchang zheng: data curation. hongen kang: software. jian sang: investigation, data curation. xingjian xu: software. lili hao: investigation. zhaohua li: data curation. zheng gong: data curation. jingfa xiao: .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / writing - review & editing. zhang zhang: writing - review & editing. wenming zhao: writing - review & editing. yiming bao: conceptualization, writing - review & editing, supervision. competing interests the authors have declared no competing interests. acknowledgments we thank profs. jingchu luo and weimin zhu for their helpful suggestions and a number of users for reporting bugs and sending comments. we also thank the ncbi genbank group, especially ilene mizrachi, karen clark, mark cavanaugh, and linda yankie, for their valuable advices on sequence contamination scanning and sars-cov- sequence exchange. this work was supported by strategic priority research program of chinese academy of sciences [xdb and xdb to yb; xdb to wz; xdb to jx; xda to zz]; national key research and development program of china [ yfe to yb; yfc , yfd , yfc , and yfc to wz; yfc to zz]; the th five-year informatization plan of chinese academy of sciences [xxh - to yb]; genomics data center construction of chinese academy of sciences [xxh- - to yb]; open biodiversity and health big data initiative of iubs [to yb]; the professional association of the alliance of international science organizations [anso-pa- - to yb]; national natural science foundation of china [ and to zz]; international partnership program of the chinese academy of sciences [ f kysb to zz]. orcid orcid: - - - (chen meili) orcid: - - - (ma yingke) orcid: - - - x (wu song) orcid: - - - x (zheng xinchang) orcid: - - - (kang hongen) .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / orcid: - - - (sang jian) orcid: - - - (xu xingjian) orcid: - - - (hao lili) orcid: - - - (li zhaohua) orcid: - - - (gong zheng) orcid: - - - (xiao jingfa) orcid: - - - (zhang zhang) orcid: - - - (zhao wenming) orcid: - - - (bao yiming) .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references [ ] liu y, du h, li p, shen y, peng h, liu s, et al. pan-genome of wild and cultivated soybeans. cell ; : - .e . [ ] guan y, chen m, ma y, du z, yuan n, li y, et al. whole-genome and time-course dual rna-seq analyses reveal chronic pathogenicity-related gene dynamics in the ginseng rusty root rot pathogen ilyonectria robusta. sci rep ; : . [ ] li r, liang f, li m, zou d, sun s, zhao y, et al. methbank . : a database of dna methylomes across a variety of species. nucleic acids res ; :d –d . [ ] xiong z, li m, yang f, ma y, sang j, li r, et al. ewas data hub: a resource of dna methylation array data and metadata. nucleic acids res ; :d –d . [ ] song s, tian d, li c, tang b, dong l, xiao j, et al. genome variation map: a data repository of genome variations in big data center. nucleic acids res ; :d –d . [ ] tang b, zhou q, dong l, li w, zhang x, lan l, et al. idog: an integrated resource for domestic dogs and wild canids. nucleic acids res ; :d –d . [ ] mcbeath j, mcbeath jh. biodiversity conservation in china: policies and practice. journal of international wildlife law & policy ; : – . [ ] fan h, wu q, wei f, yang f, ng bl, hu y. chromosome-level genome assembly for giant panda provides novel insights into carnivora chromosome evolution. genome biol ; : . [ ] xia q, zhou z, lu c, cheng d, dai f, li b, et al. a draft sequence for the genome of the domesticated silkworm (bombyx mori). science ; : – . [ ] lin t, xu x, ruan j, liu sz, wu sg, shao xj, et al. genome analysis of taraxacum kok-saghyz rodin provides new insights into rubber biosynthesis. natl sci rev ; : – . [ ] li c, song w, luo y, gao s, zhang r, shi z, et al. the huangzaosi maize genome provides insights into genomic variation and improvement history of maize. mol plant ; : – . [ ] arita m, karsch-mizrachi i, cochrane g. the international nucleotide sequence database collaboration. nucleic acids res ; :d –d . [ ] members c-n, partners. database resources of the national genomics data center, china national center for bioinformation in . nucleic acids res ; :d –d . [ ] buels r, yao e, diesh cm, hayes rd, munoz-torres m, helt g, et al. jbrowse: a dynamic web platform for genome visualization and analysis. genome biol ; : . [ ] zhao wm, song sh, chen ml, zou d, ma ln, ma yk, et al. the novel coronavirus resource. yi chuan ; : – . [ ] song s, ma l, zou d, tian d, li c, zhu j, et al. the global landscape of sars-cov- genomes, variants, and haplotypes in ncovr. genomics, proteomics & bioinformatics . [doi: https://doi.org/ . /j.gpb. . . ] .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / [ ] shean rc, makhsous n, stoddard gd, lin mj, greninger al. vapid: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to ncbi genbank. bmc bioinformatics ; : . [ ] sayers ew, cavanaugh m, clark k, ostell j, pruitt kd, karsch-mizrachi i. genbank. nucleic acids res ; :d –d . [ ] sayers ew, beck j, bolton ee, bourexis d, brister jr, canese k, et al. database resources of the national center for biotechnology information. nucleic acids res ; :d –d . [ ] chen fz, you lj, yang f, wang ln, guo xq, gao f, et al. cngbdb: china national genebank database. yi chuan ; : – . [ ] wu l, sun q, desmeth p, sugawara h, xu z, mccluskey k, et al. world data centre for microorganisms: an information infrastructure to explore and utilize preserved microbial strains worldwide. nucleic acids res ; :d –d . [ ] zhang z, song s, yu j, zhao w, xiao j, bao y. the elements of data sharing. genomics proteomics bioinformatics ; : – . [ ] altschul sf, madden tl, schaffer aa, zhang j, zhang z, miller w, et al. gapped blast and psi-blast: a new generation of protein database search programs. nucleic acids res ; : – . .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figure legends figure data model in gwh genome assembly accession number is prefixed with "gwh", followed by four capital letters (represented by xxxx) and zeros. for genome sequence accessions, eight digits increase in order. for gene sequence, transcript sequence, and protein sequence accessions, g, t, and p are followed by the gwh prefix, respectively, with six digits at the end that increase in order. figure major components in gwh data processing workflow figure statistics of genome assembly in gwh (as of december , ) .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / tables table total data holdings in gwh status type animals plants fungi bacteria archaea viruses metagenomes others total released assembly ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) species ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) unpublic assembly ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) , species ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) total assembly ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) , species ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) ( . %) . c c -b y -n c -n d . in te rn a tio n a l lice n se p e rp e tu ity. it is m a d e a va ila b le u n d e r a p re p rin t (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r, w h o h a s g ra n te d b io r xiv a lice n se to d isp la y th e p re p rin t in t h e co p yrig h t h o ld e r fo r th is th is ve rsio n p o ste d f e b ru a ry , . ; h ttp s://d o i.o rg / . / . . . d o i: b io r xiv p re p rin t https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / .cc-by-nc-nd . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / genome-wide prediction and integrative functional characterization of alzheimer’s disease-associated genes genome-wide prediction and integrative functional characterization of alzheimer’s disease-associated genes cui-xiang lin , #, hong-dong li , #, chao deng , weisheng liu , shannon erhardt , fang-xiang wu , xing-ming zhao , , jun wang , daifeng wang , , bin hu ,*, jianxin wang ,* hunan provincial key lab on bioinformatics, school of computer science and engineering, central south university, changsha, hunan , p.r. china department of pediatrics, mcgovern medical school, the university of texas health science center at houston, houston, tx , usa division of biomedical engineering, university of saskatchewan, saskatoon, sks n a , canada. institute of science and technology for brain-inspired intelligence, fudan university, shanghai , china key laboratory of computational neuroscience and brain-inspired intelligence, ministry of education, china department of biostatistics and medical informatics, university of wisconsin-madison, madison, wi , usa waisman center, university of wisconsin - madison, madison, wi, usa institute of engineering medicine, beijing institute of technology, beijing, , china. # authors contributing equally *correspondence: bh@bit.edu.cn; jxwang@mail.csu.edu.cn .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / abstract the mechanism of alzheimer’s disease (ad) remains elusive, partly due to the incomplete identification of risk genes. we developed an approach to predict ad-associated genes by learning the functional pattern of curated ad-associated genes from brain gene networks. we created a pipeline to evaluate disease-gene association by interrogating heterogeneous biological networks at different molecular levels. our analysis showed that top-ranked genes were functionally related to ad. we identified gene modules associated with ad pathways, and found that top-ranked genes were correlated with both neuropathological and clinical phenotypes of ad on independent datasets. we also identified potential causal variants for genes such as fyn and prkar a by integrating brain eqtl and atac-seq data. lastly, we created the alzlink web interface, enabling users to exploit the functional relevance of predicted genes to ad. the predictions and pipeline could become a valuable resource to advance the identification of therapeutic targets for ad. keywords: alzheimer’s disease; disease gene prediction; functional gene networks introduction alzheimer’s disease (ad) is a complex and progressive neurodegenerative disorder that accounts for the majority of all dementia cases . its clinical symptoms include progressive memory loss, personality change, and impairments in thinking, judgment, language, problem-solving, and movement . the two neuropathological hallmarks of ad are extracellular amyloid-β (aβ) plaques and intracellular neurofibrillary tangles (nfts), which are known to contribute to the degradation and death of neurons in the brain . the number .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / of patients with ad worldwide is currently rising. specifically, it is estimated that approximately million people are currently living with ad or other forms of dementia, and this number is expected to increase to over million by . ad not only causes suffering in both patients and their families but also places a severe burden on society. however, the drug development for ad is slowly progressing , partly due to the incomplete understanding of the neuropathological mechanisms. ad is partly caused by genetic mutations . its two subtypes, i.e., early-onset ad (eoad, onset age before years) and late-onset ad (load, onset age later than years), have different genetic risk factors. in eoad, rare mutations in app, psen and psen have been identified . load is markedly more complex, with apoe being a well-known risk gene for this subtype. most known or putative ad-associated genes were discovered through genome-wide association studies (gwas). previously, gwas identified clu, cr , and picalm, along with approximately more genes . in addition, network approaches are used to identify ad-associated molecular networks or pathways. for example, a module-trait network approach was proposed and applied to identify gene coexpression modules that were associated with cognitive decline , while a large-scale proteomic analysis identified an energy metabolism-linked protein module, strongly associated with ad pathology . however, a large proportion of the phenotypic variances in ad cannot be explained by known risk genes , , , which suggests additional ad- associated genes that remain to be discovered. since experimental approaches are often time consuming and expensive, computational approaches provide a promising alternative to discovering ad-associated genes. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / previous studies have shown that functional gene networks (fgns) are promising for predicting disease-associated genes , . in a fgn, a node represents a gene and the edge between two genes represents the co-functional probability (cfp) that the two genes take participate in the same biological process or pathway . for example, guan et al. constructed a global (i.e., non-tissue specific) fgn for mice, and identified timp and abcg as two novel genes associated with bone-mineral density , . using the same network, recla et al. discovered hydin as a new thermal pain gene , . because gene interactions might be rewired in different tissues, global networks cannot reveal the differences of gene networks among tissues. to address this limitation, tissue-specific networks have been proposed to more accurately capture gene interactions in tissues. greene et al. established human tissue-specific networks and investigated these networks for the interpretation of gene functions and diseases . using the brain-specific network , krishnan et al. predicted disease genes for autism spectrum disorder . by leveraging the functional genomic data of model species with similar genetic backgrounds, including mice and rats, a human brain-specific network was constructed, and its application to the identification of brain disorder-associated genes was illustrated in our previous work . because ad is a brain disorder with genetic contributions, we hypothesized that brain- specific fgns are informative for predicting ad-associated genes. it should be pointed out that our predictions of ad-associated genes do not indicate any causality, that is, the predicted genes may be either directly or indirectly associated with ad. to build models for ad-associated gene prediction, we first compiled ad-associated genes from multiple resources. these genes were used as positives for training models. we proposed a .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / functional enrichment-based approach to identify negative genes that are not likely associated with ad. next, we obtained ten brain-specific fgns from the giant and baihui databases. after assessing the predictivity of each network by cross-validation of state-of-the-art machine learning models, we built a final model for predicting ad- associated genes through an optimal selection of networks and machine learning methods. we scored all the other human genes that were not used in model training for their association with ad. we created a pipeline to evaluate top-ranked novel candidate genes by interrogating multiple biological networks. we then identified gene modules from an ad-related network. we assessed the association of these modules and top-ranked genes with ad-related phenotypes, including consortium to establish a registry for alzheimer’s disease (cerad) score, braak stage, and clinical dementia rating (cdr) on an independent dataset. we next identified a set of genes by combining our predictions and seven types of genomic evidence. we further identified potential variants that may affect the expression of prioritized genes. lastly, we developed the alzlink web interface to enable the expoitation of predicted ad-associated genes. the resulting predictions and pipeline could be valuable to advance the identification of risk genes for ad. results prediction of ad-associated genes our approach leverages machine learning and a brain fgn to predict ad-associated genes. the approach consists of three main components: compilation of ad-associated (positive) and non-ad (negative) genes, construction of a feature matrix based on a brain .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fgn, and prediction of ad-associated genes using machine learning models (fig. ). we first compiled a set of ad-associated genes and non-ad genes to train models (see the methods section; supplementary note ). we showed that the negative genes were superior to those selected by the random sampling approach (supplementary fig. ) and that the negative genes were poorly associated with ad (supplementary fig. ). in addition, we tested their enrichment in three ad-related gene sets associated with cognitive decline (the m module with genes) , amyloid-beta ( genes), and tau pathology ( genes) respectively, from two recent studies , . the results showed that the negative genes were not enriched in any of the three modules or pathways (p-values = . , , respectively). next, we extracted a feature matrix for the positive and negative genes based on fgns. for each gene (positives, negatives, or the other genes), its cfps with the positive genes in the network were collected into a -dimensional feature vector. we considered the collected brain fgns (nine from giant and one from baihui) and evaluated their ability to predict ad-associated genes using state-of-the-art machine learning methods, including lr, svm, rf, and extratrees, which were shown to be promising in a previous study . we found that the network in the baihui database achieved the best performance based on the four methods tested and that extratrees performed better than the other methods in terms of both the area under the receiver operating characteristic curve (auroc) and the area under the precision-recall curve (auprc) (fig. a; supplementary fig. - ). finally, we selected this network in combination with extratrees to construct the model for predicting ad-associated genes. we performed five-fold cross-validation with extratrees. each of the five models established during cross-validation was used to score all other human genes that were .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / not included in the training dataset. to achieve robust predictions, we repeated the cross- validation times and calculated an average score for each gene. the average auroc and auprc based on cross validation are . and . , respectively, suggesting the model is accurate. a higher score indicates that a gene is more likely to be associated with ad. the scores for predicted genes are provided in our developed web interface (www.alzlink.com). our literature search showed that of the top-ranked genes were likely associated with ad with some evidence (supplementary table ), suggesting that our model has captured molecular signature of ad and makes confident predictions. note that our prediction for ad-associated genes was based on only the machine learning model; the subsequent analysis such as enrichment, coexpression, and ppi relatedness was used separately to evaluate the association of predicted genes with ad. the top-ranked genes are functionally related to ad based on multiple lines of genomic evidence the top-ranked genes are enriched in ad-associated functions and phenotypes we hypothesize that genes with higher scores are more likely to be enriched in ad phenotype-related gene sets. to test this hypothesis, we excluded all genes in the training dataset, ranked the remaining ones based on their scores, and tested their enrichment in ad-related gene sets. we collected four gene sets associated with ad pathology. the first gene set was collected from alzgene, which contained genes. the other three gene sets, namely, the learning or memory pathway ( genes), the cognition pathway ( genes), and the amyloid-beta related pathway ( genes), were collected from the gene ontology (go) database. using the decile enrichment test (see the methods .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / section), we observed that the top-ranked genes were significantly enriched in the four gene sets: alzgene (p-value = . × - ), learning or memory pathway (p-value= . × - ), cognition pathway (p-value = . × - ), and amyloid-beta pathway (p-value= . × - ) (fig. b). we next tested whether the top-ranked genes were functionally similar to ad- associated genes. from the ranked genes, we selected the same number of top-ranked genes as the curated positive genes (n= ). we then performed go enrichment analysis of both the curated positive genes and the top-ranked genes using panther . the known positive genes and our predicted ad-associated genes were enriched in and terms, respectively, with of these terms being shared, which was significant compared with the baseline in that no more than pathway was shared (p-value< . ). the most significant shared terms are listed in supplementary table . we found that many known ad-related functions, including learning or memory, cognition, regulation of endocytosis, regulation of immune system process, regulation of cell death, and regulation of amyloid-beta formation, were shared pathways, implying that our predicted genes might be involved in ad pathology. specifically, we tested whether the top-scored genes (score > . ) were involved in neuron development. based on go enrichment analysis, we found that they were enriched in both neuron development (go: ) (fdr = . × - ) and central nervous system neuron development (go: ) (fdr = . × - ). we further tested whether the top-ranked genes overlap with gene modules that were associated with ad in published studies. a recent study identified gene coexpression modules that were related to ad . module (m ) containing genes was most .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / strongly associated with cognitive decline. genes overlapped with the brain fgn used in our work and therefore had predicted scores. we found that genes in m were among the top-scored genes (score > . ), which was significant compared to the random baseline (p < . ). we also obtained two gene sets from another recently published network association study on ad . for protein phosphorylation events in ad, the study derived kinases which were possibly implicated in ad, with kinases having scores > . . among the genes in the amyloid-beta correlated cascade reported by the authors (after removing clu because it is in the training set), nine had scores > . . these results provide additional evidence that our predicted genes are associated with ad. the top-ranked genes show higher sequence similarity with ad-associated genes we evaluated whether the sequences of the top-ranked genes were similar to those of ad-associated genes using the sequence similarity method (see the methods section). let k∊[ , , ] denote the number of top-ranked genes for testing. we found that the top-ranked genes had significantly higher sequence similarity with ad-associated genes than randomly selected genes (p-value < . , supplementary fig. ). taking the top-ranked genes as an example (fig. c), the standardized seqsim-score was . , which was significantly higher than that of the randomly selected genes (seqsim- score=- . ). the sequence similarity implies the functional similarity between predicted and known ad-associated genes. the top-ranked genes are coexpressed with ad-associated genes .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / for the top-ranked k∊[ , , ] genes, we showed that they were coexpressed with more ad-associated genes than random baseline on the independent mayo rna-seq dataset (p-value< . ) (supplementary fig. ; see methods). for example, the number of coexpressed gene pairs between the top-ranked genes and the ad- associated genes was significantly higher than that of randomly selected genes (p-value < . , fig. c), suggesting an association of our top predicted genes with ad. the top-ranked genes interact strongly with ad-associated genes in ppi networks we hypothesized that the top-ranked k genes were more likely to interact with ad- associated genes if the prediction is accurate. we obtained ppi networks from two databases: huri and string (see methods). to avoid circularity, we removed those interactions which were used to construct the brain fgn from the two databases. we found that the top-ranked k∊[ , , ] genes showed significantly more interactions with ad-associated genes (p-value < . , supplementary fig. ). taking the top- ranked genes as an example, the total number of interactions with ad-associated genes was in huri, whereas only interactions were found for the randomly selected genes (p-value < . , fig. c). the top-ranked genes are associated with ad based on mirna-target networks mirnas are important post-transcriptional regulators and have been implicated in ad . we investigated whether top-ranked genes were functionally related to ad-associated genes or mirnas. first, we observed that they shared more mirnas with ad-associated genes than randomly selected genes (supplementary fig. ; methods). for instance, the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / top-ranked genes shared a significant number of mirnas with ad-associated genes (fig. c, p-value< . ). second, we found that the top-ranked genes interacted with a significant number of ad-associated mirnas (fig. c; supplementary fig. ). these results imply that top-ranked genes are likely to be involved in post-transcriptional regulatory pathways associated with ad. ad-related regulatory networks reveal hub genes and hub mirnas associated with ad we constructed two regulatory networks. one is a transcriptional regulatory network (trn) extracted from the trrust database (version . ) that included only known and top- ranked ad-associated genes (fig. a and the methods section). from this network, we identified hub genes based on outdegrees and indegrees. the genes with outdegree and indegree represent transcription factors (tfs) and target genes, respectively. the other regulatory network is a mirna-target interaction network (fig. b) extracted from mirtarbase (version . ) by considering only ad-associated genes and mirnas (methods). we found that the hub genes in the ad-related trn were supported by the literature and interaction evidence (table ). for example, rela regulates ad-associated genes including apoe and bace , interacts with ad-associated genes in ppi networks, and is coexpressed with ad-associated genes. furthermore, rela was shown to be associated with neuroprotection, learning, and memory , . another hub gene is jun. it regulates known ad-associated genes such as app, bcl , relb, and plau, and interacts with the proteins encoded by ad-associated genes such as ms a and .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / gsk b. besides, jun is also responsible for aβ-induced neuroinflammation through a signaling pathway . we identified genes such as ccnd and cdkn a as hubs in the mirna-based regulatory network (fig. b). although some studies have reported their associations with ad , , the mechanisms underlying these associations are not well understood. these genes might contribute to ad by perturbing the post-transcriptional regulatory network mediated by mirnas (table and fig. b). for example, ccnd was associated with mirnas that also bind to known ad-associated genes, including six mirnas (mir- - p, mir- b- p, mir- a- p, mir- a- p, mir- - p and mir- - p) that bind to app and four mirnas (mir- b- p, mir- - p, mir- c- p and mir- - p) that bind to bace . in addition, knockout experiments of ccnd showed its protective role in neurodegeneration in the hippocampus . comparing the two networks focusing on only predicted (fig. b) and known (fig. c) ad-associated genes, we observed hub mirnas such as mir- b- p, mir- b- p, mir- - p, mir- - p, and mir- b- p that were shared between them, indicating that the shared mirnas might play roles in the pathology of ad. gene modules in the integrated gene interaction network are associated with ad-related functions, neuropathological and clinical phenotypes in independent data we constructed an integrated gene interaction network by aggregating multiple lines of genomic evidence and identified four gene modules with a community cluster algorithm (methods). the modules (denoted by m , m , m , and m ) are shown in fig. (the genes in each module are provided in supplementary table ). for each module, we performed .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / enrichment analysis using panther and identified the significantly enriched biological process terms (fdr < . ). as many of the enriched terms were redundant, we selected representative go terms with revigo . all four modules were enriched in ad- associated biological processes (fig. ). for example, m was enriched in regulation of cell death and regulation of neurogenesis; m was enriched in functions including response to amyloid-beta; m was enriched in learning or memory, regulation of synaptic plasticity; m was enriched in functions such as regulation of lipid transport and cholesterol efflux. these enrichments imply that the gene modules are not only biologically meaningful but also related to ad. next, we tested whether the modules were correlated with ad-related traits using a well established method . for each module, we extracted the gene expression matrix containing the genes only in that module. we then computed the eigengene (i.e. the first principal component) of the expression matrix followed by correlating the eigengene with the ad-related traits of interest. we performed this analysis on the independent msbb rna-seq dataset with data available for three traits: the cerad, braak and cdr score. we conducted a total of twelve correlation tests resulting from all combinations of the four modules and the three traits. we found that the results of all correlation tests were significant (fdr < . ), suggesting that our identified modules were associated with ad traits. taking the eigengene of m as an example, it was significantly correlated with the cerad (r=- . , fdr= . × - ), braak (r=- . , fdr= . × - ), and cdr score (r=- . , fdr= . × - ) (figure b). another example was m , whose eigengene was significantly correlated with the three traits (figure b). the correlation of m and m with the ad-related traits are provided in supplementary fig. . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / individual top-ranked genes are associated with neuropathological and clinical phenotypes on independent datasets we hypothesized that the top-ranked genes were more likely to be associated with ad- related phenotypes if our prediction was accurate. we tested this hypothesis using the independent msbb rna-seq dataset described above. for each gene, we calculated its pcc with the cerad, braak and cdr score (see the methods section). to better investigate the trends between our prediction and the gene’s absolute correlation with ad-related phenotypes, we ranked all the predicted genes, divided them into groups, and calculated the mean pcc for each bin. we found that higher ranks (higher predicted scores) were associated with higher mean pcc values for all three phenotypes. the predicted ranks were well correlated with the cerad (r = . ), braak (r = . ) and cdr (r = . ) score. the eigengenes for the top-ranked , and genes were all significantly correlated with cerad, braak and cdr scores (supplementary fig. ). we then examined the correlations of individual top-ranked genes (those not included in the training set) with ad-related phenotypes . among the top-ranked genes, we identified , and genes that were significantly correlated with cerad, braak and cdr scores, respectively (fdr < . ). of them, were correlated with all three phenotypes (supplementary table ). looking at fyn, its correlations with cerad, braak and cdr scores were . , . and . , while prkar a had pearson correlation coefficients of - . , - . and - . for the three traits respectively. these results indicate that our top-ranked genes were likely candidate genes for ad. multiple evidence-supported ad-associated genes and their regulatory variants .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / in the above sections, we have shown that the top-ranked genes are associated with ad based on multiple lines of functional genomic evidence. here we performed further screening for ad-associated genes by aggregating these evidence, which are divided into two categories: ( ) molecular interaction evidence reflecting the interaction of predicted genes with compiled ad-associated genes, and ( ) phenotypic correlation evidence supported by correlation of predicted genes with ad traits. the former includes three types of evidence, which are protein interaction, mrna coexpression, and mirna sharing with ad-associated genes. the latter includes four types of evidence, which were the correlation with cerad, braak and cdr scores based on the msbb dataset, and differential expression based on the rosmap dataset . to narrow down the predicted candidates, we focused on the top-ranked genes (after excluding the compiled ad-associated genes). the seven types of genomic evidence for these genes are visualized as a circus plot (figure ), from which the evidence for each gene can be easily identified. we also obtained their enriched go biological process terms and showed the functional annotation of these genes (figure ). we then applied strict criteria on functional evidence to screen for potentially confident ad-associated genes. that is, only one molecular interaction evidence and one phenotypic correlation evidence is allowed to be missing for each gene. from this, out of the top-ranked genes were retained (supplementary table ), providing a set of multiple evidence-based candidate genes to the community for further functional experiments. as the function of a gene is directly related to the cell type it is expressed in, we further investigated the cell type specificity of their expression. zhang et al. provides a set of genes that show cell type-specific expression in five major brain cell types .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / including astrocyte, microglia, endothelial, oligodendrocytes and neuron . using this dataset, we found that of the genes showed specific expression in cell types such as astrocytes and microglia (supplementary table ), while the others are expressed in two or more cell types. taking fyn as an example, it encodes a membrane-associated tyrosine kinase that is implicated in the control of cell growth and shows specific expression in astrocytes (supplementary table ). it interacts with proteins encoded by ad-associated genes such as app and mapt in ppi, shows significant coexpression with ad-associated genes like clu and interacts with ad-assocaited mirnas like hsa-mir- b. its expression was up-regulated based on the rosmap dataset (posterior error probability (pep) = . ) . its up-regulation in ad patients was further supported by the positive correlation with cerad (pcc = . ), braak (pcc = . ) and cdr (pcc = . ) scores (fdr < . ) on the msbb dataset. the expression of fyn for the sample groups partitioned based on cerad, braak and cdr scores is shown in figure a. prkar a encodes a regulatory subunit of the camp-dependent protein kinases involved in the camp signaling pathway. it is functionally related with ad-associated genes through ppi, coexpression and mirna-target network, and its expression is negatively correlated with the above three neuropathological traits (figure a). altered expression of prkar a in ad patients was also identified , providing independent evidence supporting our prediction. having shown that the expression level of the above genes was correlated with ad traits, we next exploited which genetic variants (snp) might causally regulate the expression of these genes by integrating genetic and regulatory data. a snp is likely .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / causal if it is not only an eqtl but also resides in the transcriptional factor binding site (tfbs) within the promoter of the target gene . by integrating eqtl and atac-seq data, we identified seven genes (fyn, prkar a, ppp r , bmpr a, lmna, egfr and kras), for which their eqtls are also located in the tfbs (supplementary table ). for instance, the snp rs is an qtl for the expression of a fyn isoform. further, we found that this snp also resided in the tfbs of multiple transcription factors within the promoter region of fyn, thus likely affecting the binding affinity of the transcription factor and therefore expression level. as an illustration, rfx _human.h mo. .b, which is a motif representing the tfbs of the transcription factor rfx , harbors the snp rs (figure b). this evidence suggests that rs is likely a variant causally affecting the expression of fyn. for prkar a, one tfbs in its promoter region harbors its eqtl (rs ) (figure b), indicating that rs is likely a causal variant that regulates the expression of prkar a. to summarize, our integrated analysis of eqtl and tfbs in active promoters suggests potential genetic variants that may be associated with ad through regulating the expression of their corresponding target gene. these results may be valuable to prioritize genes for further experimental studies. alzlink: a web resource for interrogating ad-associated genes to facilitate the interrogation of ad-associated genes and the use of the statistical evaluation pipeline developed in this work, we created the interactive web resource alzlink (available at: www.alzlink.com). this site provides the predicted genes along with their predicted scores and functional genomic evidence, facilitating experts in the field of ad to select candidates for further experimental testing. also, the statistical .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / methods to evaluate the association of an individual gene or a gene set with ad are implemented and available as an online pipeline. for an individual gene, users can query its interactions with known ad-associated genes in heterogeneous interaction networks and its correlation with ad-related traits including cerad, cdr and braak scores. for a gene set, users can statistically test its association with ad using the sequence or network-based methods, outputting the distribution of the test metric along with a p-value measuring the significance. for each interaction network such as ppi, the local network involving the queried gene or gene set and the known ad-associated genes is visualized on the web. the data and pipelines on alzlink could serve as a valuable resource for experts to prioritize ad-associated genes for further testing. discussion ad is a neurodegenerative disease with heterogeneous pathologies , , , . however, predicting ad-associated genes is challenging because ad, as a complex disease, is caused mainly by common variants of multiple genes and the disruption of related pathways. fgns are an important model for characterizing complex functional relationships between genes and have been successfully applied to predict candidate genes for complex diseases, including autism and parkinson’s disease . since ad is caused by gene dysregulation in the brain, we considered brain fgns as the basis for predicting ad-associated genes. the key idea of our approach was to discover the pattern of ad-associated genes from a brain fgn using machine learning methods. using our model, we were able to predict novel candidate genes for ad. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / we evaluated the association of top-ranked genes with ad by investigating their enrichment in ad-related functions and phenotypes along with examining their association with ad through multiple heterogeneous biological networks. we found that the top-ranked genes were associated with ad. based on the analyses of the independent msbb data, we observed that the top-ranked genes were correlated with ad-related neuropathological (cerad and braak scores) and clinical (cdr) phenotypes, suggesting that they were likely associated with ad. we also explored gene modules from the ad-related network. we found that these modules were enriched in many ad-related pathways and phenotypes and were also correlated with three ad-related phenotypes, implicating their biological relevance. combining the genomic data and our predictions, we identified a set of genes whose association with ad was supported by multiple lines of evidence, indicating these genes as potential promising candidates. we further identified potential causal variants for of the genes by integrating brain eqtl and atac-seq data. our contributions are mainly three-fold. first, we compiled a set of genes that were likely related to ad by performing an intensive, stringent hand curation of multiple resources, providing a potential resource for the community. for negative gene selection, we proposed a pathway-based approach that works by removing any gene that was likely to be associated with ad. thus, it can be expected that negative genes have been identified. we illustrated that this approach helped improve the accuracies of models in terms of both auroc and auprc. our model for predicting ad-associated genes depends on the non-ad (negative) genes. different ways of negative gene selection could lead to bias in the model and thus the prediction. as our method selects negative genes .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / by removing any gene that has a potential association with ad, a possible bias is that the predicted genes are more likely to be functionally related to and share go terms with the compiled ad-associated genes. second, we predicted novel candidate genes and showed that the top-ranked genes exhibit significant associations with ad through functional enrichment analysis and the investigation of multiple biological networks. moreover, the genes were found to be correlated with ad-related phenotypes on independent datasets. taking advantage of the functional genomic data, we identified a set of ad-associated genes supported by multiple lines of evidence, indicating promising candidates. third, we developed alzlink, a web interface to facilitate the use of data and pipeline developed in this study. it should be pointed out that the pipeline to evaluate the relevance of the predicted genes to ad is generic and can be applied to any other diseases. although our predictions are promising, as supported by our systematic analysis, our model for predicting ad-associated genes could be improved in several ways. first, our predictions were made at the gene level without differentiating the splice isoforms generated from the same gene through alternative splicing , . this factor is essential because isoforms of the same gene might have different or even opposite functions. isoforms have been implicated in diseases such as ovarian cancers . the prediction of ad-associated genes at the isoform level could have the potential to promote our understanding of ad. second, the human brain consists of multiple heterogeneous structures, each of which contains many different cell types. the association of the predicted genes with ad in different cell types remains to be resolved. integrating single- cell genomic data , , with our predicted genes could be helpful for addressing this .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / question. lastly, our predictions do not implicate causality. the genes predicted using our method are statistically significantly associated with ad. in summary, we predicted novel ad-associated genes and provided evidence for their association with ad. however, further studies are needed to test the validity of our predictions. this pipeline of prediction and validation is generic and can be readily used for other diseases, such as parkinson’s disease, cancers and heart diseases. we expect that the predicted genes might become a useful resource for experimental testing by the community and that our proposed pipeline could be used in other diseases. methods compilation of ad-associated and non-ad genes ad-associated (positives) and non-ad (negatives) genes are needed to build a machine learning model. first, we performed intensive hand-curation to identify confident ad- associated genes from various disease gene resources, including alzgene , alzbase , omim , disgenet , distild , and uniprot , open targets , gwas catalog , differentially expressed genes (degs) in rosmap and published literature. the curated genes from each resource as well as the corresponding criteria were provided in supplementary note . as the ad-associated genes and their reliability vary across these resources, we applied a voting strategy and selected only those that were present in at least two resources to ensure higher reliability (see details in supplementary note ). in this way, we obtained ad-associated genes. second, we selected a set of non-ad genes, which had no or minimal association with ad. the main idea of our method for non-ad gene selection was to remove any genes that exhibit potential associations with .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ad. we removed genes that (i) were annotated to the same gene ontology (go) term enriched for the ad-associated genes or (ii) showed any association with ad based on the above-described resources (see details in supplementary note ). in this way, we identified non-ad genes. model development for predicting ad-associated genes we first constructed the feature matrix for all human genes based on the brain-specific fgn. this fgn was built by integrating heterogeneous functional genomic data, including gene expression, protein-protein interaction (ppi), protein docking and gene-to- phenotype annotation using the well-established bayesian framework . the bayesian network model predicts a co-functional probability (cfp) for every pair of genes by using the following formula: 𝑃(𝐹!,𝐹",…,𝐹#) = ! $ 𝑃(𝑦 = )𝛱%&! # 𝑃(𝑦 = ) [ ] where p(y= ) is the prior probability for a sample (i.e. a gene pair in this study) to be positive, p(fi|y = ), i = , , …, n, is the probability of observing the value of the i-th feature under the condition that the gene pair is functionally related, and c is a constant normalization factor. in the resulting network, a node is a gene, and an edge represents cfp that two linked genes participate in the same biological process or pathway. for each gene, we extracted its cfp with the compiled ad-associated genes ( genes) from the network as features based on a previously proposed method . as a result, each gene is characterized by a -dimensional vector. the feature data for the training set ( positives and negatives, resulting in a total of genes) are represented by a x matrix x. the label ( for positives and for negatives) of .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / each gene is stored in a vector y. the feature matrix of all other genes not in the training set was extracted. to develop a model for predicting ad-associated genes, we compared the different combinations of fgns and machine learning models. to identify optimal fgns for feature matrix construction, we obtained ten networks for the whole brain or brain-regions, including the brain, forebrain, frontal lobe, temporal lobe, hippocampus, thalamus, amygdala, glia and astrocytes from the giant database and the baihui database . we considered these ten regions because they have been implicated in ad , . as ad- associated genes are likely to operate in immune cells , , we investigated how well immune cells were represented in these networks. as microglia is the dominant immune cell in the brain and cell type-specific genes are indicators of the cell type of interest, we analyzed how microglia-specific genes were represented in these networks. we obtained a set of microglia-specific genes from the work . we found that more than % of them existed in each of these networks, suggesting that immune cells are well represented in these networks. for the machine learning models, we considered logistic regression (lr), support vector machine (svm), random forest (rf) and extremely randomized trees (extratrees) for their promising accuracy shown in our previous work . statistical assessment of the relevance of top-ranked genes to ad we evaluated the relevance of the top-ranked genes to ad using the following method (the genes in the training set were excluded). these methods are based on the sequence, pathway and various biological networks, as described below. decile enrichment test for ad pathways and phenotypes .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / if the prediction is accurate, it is expected that ad-associated genes are more likely to be enriched in the top-ranked genes. using the decile enrichment test proposed in the previous study , we statistically assessed whether a larger proportion of a given ad- related gene set falls into the first decile of the ranked genes. to do so, we excluded the genes in the training set, ranked the remaining genes, and split genes into evenly binned deciles. let pnet and prandom denote the proportion of a given gene set that falls into the first decile based on our prediction and random chance, respectively. we tested whether pnet was significantly larger than prandom by using the binomial test (see details in the previous work ). evaluation based on sequence similarity genes with similar sequences are likely to carry out similar functions. for a set of k predicted genes denoted by gk, we evaluate its functional relationship with ad-associated genes using a sequence similarity-based score (seqsim-score), which measures the average similarity between predicted and known ad-associated genes. it is calculated as: seqsim-score(𝑔!) = " ! ∑ max 𝑔𝑗∈𝐺𝑃 (𝑠𝑐𝑜𝑟𝑒(𝑔#,𝑔$))!#%" [ ] , where gp denotes the set of compiled positive genes, score(gi, gj) is the sequence identity between a predicted gene gi and the ad-associated gene gj calculated using blast . the higher the seqsim-score is, the more similar to ad-associated genes the predicted gene is. seqsim-score was standardized to have zero mean and unit variance using z-transform. for the top-ranked k∊[ , , ] genes, their scores are denoted by the seqsim-scoreobserved. in the same way, we also calculated the seqsim- .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / scorerandom for a set of k randomly selected genes. we calculated , such scores from , randomly sampled gene sets. let nsig denote the number of random scores that are higher than seqsim-scoreobserved. we computed the p-value as nsig/ . evaluation based on coexpression with ad-associated genes compared to randomly selected genes, reliably predicted genes are more likely to be coregulated with ad-associated genes. based on this hypothesis, we calculated the number of coexpressed gene pairs between top-ranked k genes and known ad- associated genes using independent gene expression data. that’s to say, in each pair, one is a predicted gene and the other is a known ad-associated gene. the coexpression was measured with pearson correlation coefficient (pcc). a gene pair was considered to be coexpressed if the pcc ≥ . . to test whether the coexpression is significant, we generated , gene lists, each containing k randomly sampled genes. we calculated the number of coexpressed gene pairs for the top-ranked genes and for the randomly selected genes, denoted by eobserved and erandom. we calculated the p-value to measure whether eobserved is significantly higher than erandom. we used the mayo rna-seq dataset generated from the accelerating medicines partnership-alzheimer’s disease (amp-ad) project (publicly available at https://www.synapse.org/#!synapse: syn ) for coexpression evaluation. note that this dataset was not used for constructing the brain fgn that was used to build the model for predicting ad-associated genes, so circularity was avoided. this dataset contains gene expression data of the temporal cortex obtained from cases and controls. the .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / log -transformed fragments per kilobase of transcript per million mapped reads (fpkm) was used for this analysis. evaluation based on ppi networks we tested whether the top-ranked k genes were more likely to interact with ad-associated genes in ppi networks. we used the ppi data from human reference interactome (huri) and search tool for the retrieval of interacting genes/proteins (string) . because some ppi data were integrated to build the brain fgn, such ppis have been first removed from the two databases to avoid circularity. the interaction data in huri were experimentally identified. in string, a score is used to measure the interaction strength between two proteins; a score > indicates an interaction with high confidence. only the confident interaction was considered. we tested k values in [ , , ]. for a given k value, we computed the number of genes in the top-ranked k genes that interacted with at least one ad-associated gene, denoted by nobserved. similarly, we also calculated nrandom, which represents the number of genes in k randomly sampled genes that interacted with at least one ad-associated gene. with the same method described in the previous section, a p-value was calculated to measure the significance. evaluation based on mirna-target interaction networks this analysis was motivated by the assumption that top-ranked genes were more likely related to ad-associated genes or mirnas based on mirna-target interaction networks. first, we tested whether top-ranked genes and ad-associated genes share more mirnas. we downloaded mirna-target interaction data from mirtarbase , a high-quality .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / database of validated interactions. we computed the number of shared mirnas of the top-ranked k∊[ , , ] genes with ad-associated genes. based on randomly sampled genes, we calculated a p-value to test whether the number of shared mirnas was significant. second, we tested top-ranked genes for their binding to ad-associated mirnas. we retrieved ad-associated mirnas from the human microrna disease database (hmdd) (v . ). similarly, for the top-ranked k genes, we calculated a p-value to measure their significance of binding to ad-associated mirnas. construction of ad-related regulatory networks to analyze the regulatory relationship between the predicted candidates and ad- associated genes and obtain hub genes , , we constructed two ad-related regulatory networks: one was a transcriptional regulation network, the other was a mirna-target interaction network. the human transcriptional regulatory network was downloaded from the transcriptional regulatory relationships unraveled by sentence-based text mining (trrust) database . the full network contains transcription factors (tfs) and target genes. first, we extracted an ad-related transcriptional regulatory network by retaining only the tf-target gene pairs in which one node is known or predicted ad-associated gene (among the top-ranked ). we identified hub genes according to the outdegree or indegree. for constructing the ad-related mirna-target interaction network, we first collected ad-associated mirnas from an up-to-date review . then from the above-described mirtarbase (version . ), we extracted two networks. one contains only the interaction .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / between ad-associated mirnas and ad-associated genes, and the other contains only the interaction between ad-associated mirnas and predicted ad-associated genes. identification of gene modules in the integrated network to better understand the functions of the predicted genes, we constructed an integrated network by aggregating evidence from the brain fgn, ppi, coexpression network, mirna-target network and transcriptional regulatory network. this network included the top-ranked genes and the compiled ad-associated genes. two genes were connected with an edge if they were direct neighbors in any of the networks above. in detail, all tf-target interactions, which satisfy the above condition, were extracted from the transcriptional regulatory network in the trrust database . we also included the genes with a cfp ≥ . , and then expanded the resulting network by including other genes that have a cfp ≥ . with at least one known ad-associated gene. from the gene coexpression network, we retained only edges with pccs higher than . . from the ppi network, we included gene pairs whose encoded proteins show interaction in huri or string. for the mirna-target interaction data, we computed a network in which the weight of the edge between two genes was calculated as w=nshare/nmax, where nshare represents the number of mirnas shared by the two genes and nmax =max(n , n ) with n and n denoting the number of mirnas binding to the two genes, respectively. the range of w is from to . the interaction with w ≥ . was considered. by applying the glay algorithm implemented in cytoscape[ ] to the integrated network, we identified gene modules within which genes were closely connected. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the independent mountain sinai brain bank (msbb) dataset with ad-related neuropathological and clinical traits we obtained an independent dataset with ad-related neuropathological and clinical traits from the msbb study . we used the data from brodmann area (parahippocampal gyrus), which is one of the most vulnerable regions to ad . this dataset contains gene expression data from donors for which ad-related phenotypes are also available. these phenotypes include the neuritic plaque density assessed by cerad score, neurofibrillary tangle severity by braak score, and severity of dementia by cdr score. the dataset contains genes measured for the individuals and is available at the amp-ad portal (https://www.synapse.org/#!synapse:syn ). for each gene, its pcc with the cerad, braak and cdr scores was calculated. based on the cerad score, we extracted control and ad samples using the criteria provided on https://www.synapse.org/#!synapse:syn ; based on the braak score, we followed the practice in and divided samples into three groups in the ranges of [ , ], [ , ] and [ , ], representing different levels of tau pathology; based on cdr, the samples were partitioned into three groups in the range of [ ], [ . , ] and [ , ] in the same way as used in , representing different degrees of severity of clinical dementia. brain eqtl and atac-seq data we identify potentially causal regulatory variants by testing whether eqtl for a target gene also resides in the transcriptional factor binding site (tfbs) in its promoters through the integration of eqtl and atac-seq data. both gene- and isoform-expression eqtls were considered. we obtained brain gene eqtls from gtex (version: v ), psychencode .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / (http://resource.psychencode.org/) and the commonmind consortium (https://www.synapse.org/#!synapse:syn ). the latter two resources contain isoform eqtls, which were also used. we used active promoters from the human brain atac-seq peak data in the boca database . we identified tfbss in these promoters using the fimo tool , with the transcription factor binding motif in the hocomoco database (version ) as reference. data availability all accession codes, unique identifiers, or web links for publicly available datasets are described in the paper. all data supporting the findings of the current study are listed in supplementary tables - , supplementary figures - , and our web interface (www.alzlink.com). code availability the codes for model development are publicly available at https://github.com/genemine/alzlink. references . calsolaro v, antognoli r, okoye c, monzani f. the use of antipsychotic drugs for treating behavioral symptoms in alzheimer's disease. front pharmacol , ( ). . fredericks ca, et al. early affective changes and increased connectivity in preclinical alzheimer's disease. alzheimers dement (amst) , – ( ). . giri m, shah a, upreti b, rai jc. unraveling the genes implicated in alzheimer's disease. biomed rep , – ( ). . sims r, hill m, williams j. the multiplex model of the genetics of alzheimer’s disease. nature neuroscience , - ( ). . mostafavi s, et al. a molecular network of the aging human brain provides insights into the pathology and cognitive decline of alzheimer’s disease. nat neurosci , - ( ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . johnson ecb, et al. large-scale proteomic analysis of alzheimer’s disease brain and cerebrospinal fluid reveals early changes in energy metabolism associated with microglia and astrocyte activation. nat med , - ( ). . ridge pg, mukherjee, s., crane, p. k., kauwe, j. s. & alzheimer's disease genetics consortium. alzheimer's disease: analyzing the missing heritability. plos one , e ( ). . cuyvers e, sleegers k. genetic variations underlying alzheimer's disease: evidence from genome-wide association studies and beyond. lancet neurol , – ( ). . ridge pg, et al. assessment of the genetic variance of late-onset alzheimer's disease. neurobiol aging , .e – .e ( ). . guan y, myers cl, lu r, lemischka ir, bult cj, troyanskaya og. a genomewide functional network for the laboratory mouse. plos comput biol , e ( ). . krishnan a, et al. genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. nat neurosci , – ( ). . troyanskaya og, dolinski k, owen ab, altman rb, botstein d. a bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomyces cerevisiae). proc natl acad sci usa , – ( ). . guan y, ackert-bicknell cl, kell b, troyanskaya og, hibbs ma. functional genomics complements quantitative genetics in identifying disease-gene associations. plos comput biol , e ( ). . recla jm, robledo rf, gatti dm, bult cj, churchill ga, chesler ej. precise genetic mapping and integrative bioinformatics in diversity outbred mice reveals hydin as a novel pain gene. mamm genome , – ( ). . greene cs, et al. understanding multicellular function and disease with human tissue- specific networks. nat genet , – ( ). . li h-d, bai t, sandford e, burmeister m, guan y. baihui: cross-species brain-specific network built with hundreds of hand-curated datasets. bioinformatics , – ( ). . bai b, et al. deep multilayer brain proteomics identifies molecular networks in alzheimer's disease progression. neuron , - .e ( ). . duda m, zhang h, li hd, wall dp, burmeister m, guan y. brain-specific functional relationship networks inform autism spectrum disorder gene prediction. transl psychiatry , ( ). . mi h, muruganujan a, ebert d, huang x, thomas pd. panther version : more genomes, a new panther go-slim and improvements in enrichment analysis tools. nucleic acids res , d –d ( ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . allen m, et al. human whole genome genotype and transcriptome data for alzheimer’s and other neurodegenerative diseases. sci data , ( ). . wang m, qin l, tang b. micrornas in alzheimer's disease. front genet , ( ). . han h, et al. trrust v : an expanded reference database of human and mouse transcriptional regulatory interactions. nucleic acids res , d –d ( ). . chou ch, et al. mirtarbase update : a resource for experimentally validated microrna-target interactions. nucleic acids res , d –d ( ). . kaltschmidt b, kaltschmidt c. nf-kappab in the nervous system. cold spring harbor perspectives in biology , a -a ( ). . pizzi m, et al. nf-kappab factor c-rel mediates neuroprotection elicited by mglu receptor agonists against amyloid beta-peptide toxicity. cell death differ , - ( ). . vukic v, et al. expression of inflammatory genes induced by beta-amyloid peptides in human brain endothelial cells and in alzheimer's brain is mediated by the jnk-ap signaling pathway. neurobiol dis , – ( ). . kim h, et al. overexpression of cell cycle proteins of peripheral lymphocytes in patients with alzheimer's disease. psychiatry investig , – ( ). . scacchi r, gambina g, moretto g, corbo rm. p gene variation and late-onset alzheimer's disease in the italian population. dementia and geriatric cognitive disorders , – ( ). . marathe s, liu s, brai e, kaczarowski m, alberi l. notch signaling in response to excitotoxicity induces neurodegeneration via erroneous cell cycle reentry. cell death differ , - ( ). . supek f, bosnjak m, skunca n, smuc t. revigo summarizes and visualizes long lists of gene ontology terms. plos one , e ( ). . langfelder p, horvath s. wgcna: an r package for weighted correlation network analysis. bmc bioinformatics , ( ). . canchi s, et al. integrating gene and protein expression reveals perturbed functional networks in alzheimer’s disease. cell rep , - .e ( ). . mckenzie at, et al. brain cell type specific gene expression and co-expression network architectures. sci rep , ( ). . liang ws, et al. altered neuronal gene expression in brain regions differentially affected by alzheimer's disease: a reference data set. physiol genomics , - ( ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . liu j, li m, lan w, wu f, pan y, wang j. classification of alzheimer's disease using whole brain hierarchical network. ieee/acm trans comput biol bioinform , – ( ). . cummings j, feldman hh, scheltens p. the “rights” of precision drug development for alzheimer’s disease. alzheimer's res ther , ( ). . lambert j-c, et al. genome-wide association study identifies variants at clu and cr associated with alzheimer’s disease. nat genet , ( ). . yao v, et al. an integrative tissue-network approach to identify and test human disease genes. nat biotechnol , – ( ). . li h-d, menon r, omenn gs, guan y. the emerging era of genomic data integration for analyzing splice isoform function. trends genet , – ( ). . baralle fe, giudice j. alternative splicing as a regulator of development and tissue identity. nat rev mol cell biol , – ( ). . barrett cl, deboever c, jepsen k, saenz cc, carson da, frazer ka. systematic transcriptome analysis reveals tumor-specific isoforms for ovarian cancer diagnosis and therapy. proc natl acad sci usa , e –e ( ). . tian t, wan j, song q, wei z. clustering single-cell rna-seq data with a model-based deep learning approach. nat mach intell , – ( ). . zheng r, li m, liang z, wu f-x, pan y, wang j. sinnlrr: a robust subspace clustering method for cell type detection by non-negative and low-rank representation. bioinformatics , – ( ). . cao j, et al. the single-cell transcriptional landscape of mammalian organogenesis. nature , – ( ). . bertram l, mcqueen mb, mullin k, blacker d, tanzi re. systematic meta-analyses of alzheimer disease genetic association studies: the alzgene database. nat genet , – ( ). . bai z, et al. alzbase: an integrative database for gene dysregulation in alzheimer’s disease. mol neurobiol , – ( ). . hamosh a, scott af, amberger js, bocchini ca, mckusick va. online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders. nucleic acids res , d –d ( ). . pinero j, et al. disgenet: a discovery platform for the dynamical exploration of human diseases and their genes. database (oxford) , bav ( ). . palleja a, horn h, eliasson s, jensen lj. distild database: diseases and traits in linkage disequilibrium blocks. nucleic acids res , d –d ( ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . wu ch, et al. the universal protein resource (uniprot): an expanding universe of protein information. nucleic acids res , d –d ( ). . carvalho-silva d, et al. open targets platform: new developments and updates two years on. nucleic acids res , d -d ( ). . buniello a, et al. the nhgri-ebi gwas catalog of published genome-wide association studies, targeted arrays and summary statistics . nucleic acids res , d - d ( ). . xie a, gao j, xu l, meng d. shared mechanisms of neurodegeneration in alzheimer's disease and parkinson's disease. biomed res int , ( ). . dubois b. the emergence of a new conceptual framework for alzheimer's disease. j alzheimers dis , – ( ). . young amh, et al. a map of transcriptional heterogeneity and regulatory variation in human microglia. biorxiv doi: https://doi.org/ . / . . . , ( ). . tansey ke, cameron d, hill mj. genetic risk for alzheimer's disease is concentrated in specific macrophage and microglial transcriptional networks. genome med , - ( ). . mcginnis s, madden tl. blast: at the core of a powerful and diverse set of sequence analysis tools. nucleic acids res , w –w ( ). . luck k, et al. a reference map of the human binary protein interactome. nature , - ( ). . szklarczyk d, et al. string v : protein-protein interaction networks, integrated over the tree of life. nucleic acids res , d –d ( ). . wang m, et al. molecular networks and key regulators of the dysregulated neuronal system in alzheimer’s disease. biorxiv doi: https://doi.org/ . / , ( ). . scelsi ma, napolioni v, greicius md, altmann a. network propagation of rare mutations in alzheimer’s disease reveals tissue-specific hub genes and communities. biorxiv doi: https://doi.org/ . / , ( ). . wang m, et al. the mount sinai cohort of large-scale genomic, transcriptomic and proteomic data in alzheimer's disease. sci data , ( ). . wang m, et al. integrative network analysis of nineteen brain regions identifies molecular signatures and networks underlying selective regional vulnerability to alzheimer's disease. genome med , - ( ). . fullard jf, et al. an atlas of chromatin accessibility in the adult human brain. genome research , - ( ). . grant ce, bailey tl, noble ws. fimo: scanning for occurrences of a given motif. bioinformatics (oxford, england) , - ( ). .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . hooper c, meimaridou e, tavassoli m, melino g, lovestone s, killick r. p is upregulated in alzheimer's disease and induces tau phosphorylation in hek a cells. neurosci lett , – ( ). . qin w, et al. neuronal sirt activation as a novel mechanism underlying the prevention of alzheimer disease amyloid neuropathology by calorie restriction. j biol chem , - ( ). . feio dos santos ac, et al. decrease of pten expression levels among normal, symptomatic and asymptomatic alzheimer's disease (ad) subjects, measured in hippocampus, temporal and entorhinal cortices. alzheimer's & dementia : the journal of the alzheimer's association , s ( ). . sonoda y, et al. accumulation of tumor-suppressor pten in alzheimer neurofibrillary tangles. neurosci lett , – ( ). acknowledgments this work is supported by the national key r&d program of china (no. yfc ), the national natural science foundation of china (no. u , , ), project (no. b ), and hunan provincial science and technology program ( wk ). the results published here are in part based on data obtained from the amp-ad knowledge portal (https://adknowledgeportal.synapse.org/). the mayo rna-seq data were provided by the following sources: the mayo clinic alzheimer's disease genetic studies, led by dr. nilufer ertekin-taner and dr. steven g. younkin, mayo clinic, jacksonville, fl using samples from the mayo clinic study of aging, the mayo clinic alzheimer’s disease research center, and the mayo clinic brain bank. data collection was supported through funding by nia grants p ag , r ag , u ag , r ag , u ag , u ag , r ag , r ag , r ag , ninds grant r ns , curepsp foundation, and support from mayo foundation. study data includes samples collected through the sun .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / health research institute brain and body donation program of sun city, arizona. the brain and body donation program is supported by the national institute of neurological disorders and stroke (u ns national brain and tissue resource for parkinson’s disease and related disorders), the national institute on aging (p ag arizona alzheimer’s disease core center), the arizona department of health services (contract , arizona alzheimer’s research center), the arizona biomedical research commission (contracts , , - and to the arizona parkinson's disease consortium) and the michael j. fox foundation for parkinson’s research. the msbb data were generated from postmortem brain tissue collected through the mount sinai va medical center brain bank and were provided by dr. eric schadt from mount sinai school of medicine. author contributions c.x.l., h.d.l. and w.s.l. developed the statistical method, performed the analysis, and wrote the manuscript. d.c. and c.x.l developed the web interface. x.m.z., j.w., f.x.w. and d.w. provided instructions on the analysis. j.x.w. conceived and supervised the research and contributed to the manuscript. additional information supplementary information accompanies this paper at http://www.nature.com/ nature communications. competing financial interests: the authors declare no competing financial interests. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary information supplementary notes supplementary note . description for compiling ad-associated genes. supplementary figures supplementary fig. . comparison in model performance of two methods in negative non-ad gene selection. supplementary fig. . comparison of the negative controls and randomly selected genes based on their association with ad. supplementary fig. . performances of different brain-region networks based on random forest (rf). supplementary fig. . performances of different brain-region networks based on support vector machines (svm). supplementary fig. . performance of different brain-region networks based on logistic regression (lr). supplementary fig. . validation of the top-ranked genes based on sequence similarity with ad-associated genes. supplementary fig. . validation of the top-ranked genes based on their coexpression with known ad-associated genes. supplementary fig. . validation of the top-ranked genes based on protein-protein interaction networks in the string and huri database. supplementary fig. . validation of the top-ranked genes based on mirna-target binding networks. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / supplementary fig. . the correlation with three ad traits of the eigengenes of modules and . supplementary fig. . the correlation with three ad traits of the eigengenes of the top- ranked genes. supplementary tables: supplementary table . the top-ranked genes (excluding training set) that are likely associated with ad based on literature. supplementary table . the top ten shared go terms of the ad-associated genes with the top predicted genes. supplementary table . gene modules identified from the integrated gene interaction network. supplementary table . the correlation of genes with cerad, braak score and cdr on the msbb data. supplementary table . the seven types of functional evidence for the selected genes. supplementary table . the genes with cell type specific expression. supplementary table . the seven genes with eqtls located in the transcription factor binding site in the promoter region. figure captions fig. overview of the method for genome-wide prediction of ad-associated genes and their functional characterization. a selection of ad-associated genes. ad-associated genes were compiled from various resources, including ad-associated genes from omim, disgenet, uniprot, distild, alzbase, .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / alzbase, alzgene, literature, open targets, rosmap-deg and gwas-catalog. the gene that was present in at least two resources was selected. the ad-associated genes as well as potential positive genes inferred with a functional enrichment method were then removed from the full set of all human genes. the remaining genes were treated as non-ad genes (negatives). b brain specific functional gene networks (fgns) were used for feature matrix construction. for each gene, its cofunction probabilities with the positive genes in the network were extracted as features. thus, each gene was characterized by a - dimensional vector. c selection of brain fgns. we compared the ten networks collected for their predictivity of ad-associated genes with machine learning approaches. an optimal network was selected. d validation. predicted ad-associated genes were validated by ad-related pathways and various gene networks, including coexpression networks, protein-protein interaction networks, mirna-target binding networks, transcriptional regulatory networks. e functional implication in ad. the associations of the top predicted genes with ad-related phenotypes were evaluated. gene modules from an ad-related network were identified. fig. model performance and statistical evaluation based on ad-related pathways and various gene networks. a comparison of extratrees models built from different functional gene networks in terms of auroc and auprc based on cross-validation. b enrichment of the genes ranked in the first decile in the four ad-associated gene sets or pathways with the decile enrichment test (described in methods). c validation of the top-ranked genes based on their sequence similarity, the number of shared mirnas, the number of ad-associated mirnas they can bind to, the number of coexpressed gene pairs, the number of interactions with ad-associated genes in huri and string. in all the subplots, the red vertical line and the distribution in yellow indicate the results for our top-ranked genes and randomly selected genes, respectively. fig. ad-related regulatory networks. a transcriptional regulatory network including our compiled ad- associated genes and the top-ranked genes. b the interaction network between predicted genes and .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ad-relevant mirnas. c the interaction network between the compiled ad-associated genes and ad- relevant mirnas. fig. gene modules and their association with ad traits. the network was built by aggregating the evidence from the protein-protein interaction network, coexpression network, mirna-gene binding network, transcriptional regulatory network and the brain fgn. this network contains the top-ranked genes and the compiled ad-associated genes. a four gene modules, denoted by m , m , m and m , were identified by applying the glay algorithm to the integrated network in cytoscape. b the association of m and m with the three ad-related phenotypes (the cerad, braak and cdr score) was assessed. the results for all the tests were significant (fdr < . ). fig. . visualization of functional evidence supporting the association of the top-ranked genes with ad. the seven circles show the strength of the seven types of evidence, including the three molecular interaction evidence (the number of interacting ad-associated genes in ppi, coexpression network and mirna-target binding network, respectively) and the four phenotypic correlation evidence (the pearson correlation with cerad, braak and cdr on the msbb dataset, and the log -transformed fold change of expression obtained from the rosmap study). the darker the purple color is, the stronger the functional association is. the section corresponding to the blue arc shows the enriched go biological process terms, where each curve points the gene annotated to the term. fig. illustration of the association of the top-ranked individual genes with ad-related phenotypes and the potential regulatory variant of the gene. a comparison of the expression of individual genes in different sample groups. the samples were divided into groups based on the cerad, braak or cdr score. the comparison for fyn and prkar a is shown. b potential regulatory snps that may regulate the expression. for fyn, the snp rs not only resides in the tfbs within its promoter region but also is an eqtl (upper); the snp rs is located in the tfbs and also an eqtl for prkar a. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / tables and figures table . hub genes (after excluding known ad-associated genes) measured with the outdegree and indegree in ad-related transcriptional regulatory network (trn) and with the degree in mirna-based regulatory networks (mrn). hub gene gene type outdegree, indegree in ad-related trn degree in ad- related mrn association with ad rela| nfkb oncogenic tf , rela is associated with learning and memory , jun|ap- oncogenic tf , ap signaling pathway is responsible for aβ-induced neuroinflammation tp |p tf, tumor suppressor gene , tp was overexpressed in ad and involved in tau phosphorylation sirt tf , sirt is associated with the production of aβ ccnd oncogene , ccnd knockout protects against neurodegeneration in hippocampus . cdkn a|p oncogene , increased expression pten tumor suppressor gene , recruitment of pten into synapses contributed to synaptic depression in ad , .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fig. overview of the method for genome-wide prediction of ad-associated genes and their functional characterization. a selection of ad-associated genes. ad-associated genes were compiled from various resources, including ad-associated genes from omim, disgenet, uniprot, distild, alzbase, alzbase, alzgene, literature, open targets, rosmap-deg and gwas-catalog. the gene that was present in at least two resources was selected. the ad-associated genes as well as potential positive genes inferred with a functional enrichment method were then removed from the full set of all human genes. the remaining genes were treated as non-ad genes (negatives). b brain specific functional gene networks (fgns) were used for feature matrix construction. for each gene, its cofunction probabilities with the positive genes in the network were extracted as features. thus, each gene was characterized by a - dimensional vector. c selection of brain fgns. we compared the ten networks collected for their predictivity of ad-associated genes with machine learning approaches. an optimal network was selected. d validation. predicted ad-associated genes were validated by ad-related pathways and various gene networks, including coexpression networks, protein-protein interaction networks, mirna-target binding networks, transcriptional regulatory networks. e functional implication in ad. the associations of the top predicted genes with ad-related phenotypes were evaluated. gene modules from an ad-related network were identified. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fig. model performance and statistical evaluation based on ad-related pathways and various gene networks. a comparison of extratrees models built from different functional gene networks in terms of auroc and auprc based on cross-validation. b enrichment of the genes ranked in the first decile in the four ad-associated gene sets or pathways with the decile enrichment test (described in methods). c validation of the top-ranked genes based on their sequence similarity, the number of shared mirnas, the number of ad-associated mirnas they can bind to, the number of coexpressed gene pairs, the number of interactions with ad-associated genes in huri and string. in all the subplots, the red vertical line and the distribution in yellow indicate the results for our top-ranked genes and randomly selected genes, respectively. a b fig. model performance and statistical evaluation through ad pathways and various gene networks. a comparison of extratrees models built from different functional gene networks in terms of auroc and auprc based on cross- validation. b enrichment of the genes ranked in the first decile in the four ad-associated gene sets or pathways with the decile enrichment test (described in methods). c validation of the top-ranked ad genes based on their sequence similarity, the number of shared mirnas, coexpression, number of interacting with ad-associated genes in (bioplex, huri and string). in all the sub-plots, the red vertical line and the distribution in yellow indicate the results for our top-ranked genes and randomly selected genes, respectively. c .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fig. ad-related regulatory networks. a transcriptional regulatory network including our compiled ad- associated genes and the top-ranked genes. b the interaction network between predicted genes and ad-relevant mirnas. c the interaction network between the compiled ad-associated genes and ad- relevant mirnas. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fig. gene modules and their association with ad traits. the network was built by aggregating the evidence from the protein-protein interaction network, coexpression network, mirna-gene binding network, transcriptional regulatory network and the brain fgn. this network contains the top-ranked genes and the compiled ad-associated genes. a four gene modules, denoted by m , m , m and m , were identified by applying the glay algorithm to the integrated network in cytoscape. b the association of m and m with the three ad-related phenotypes (the cerad, braak and cdr score) was assessed. the results for all the tests were significant (fdr < . ). fig. gene modules and their association with ad traits. a gene modules identified from the network integrated from the brain-specific functional gene network, protein-protein interaction network, coexpression network, mirna-gene binding network and transcriptional regulatory network. this network contains the top predicted genes and the compiled ad genes. then seven gene modules were identified by applying the glay algorithm in cytoscape. b the association with ad traits of modules. learning or memory regulation of ion transmembrane transport cognition regulation of synaptic plasticity regulation of neuron death regulation of lipid transport cholesterol efflux regulation of amyloid-beta formation positive regulation of cytokine production regulation of peptidyl-lysine acetylation response to amyloid-beta m m m regulation of phosphorylation regulation of cell death immune system process regulation of neurogenesis inflammatory response gliogenesis m a b ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − . − . . . cerad e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − . − . . . braak e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − . − . . . cdr e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● − . . . . cerad e ig en ge ne ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● − . . . . braak e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● − . . . . cdr e ig en ge ne ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● − . − . . . cerad e ig en ge ne ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● − . − . . . braak e ig en ge ne ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● − . − . . . cdr e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● − . . . cerad e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − . . . braak e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● − . . . cdr e ig en ge ne m ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − . − . . . cerad e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − . − . . . braak e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − . − . . . cdr e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● − . . . . cerad e ig en ge ne ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● − . . . . braak e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● − . . . . cdr e ig en ge ne ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● − . − . . . cerad e ig en ge ne ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● − . − . . . braak e ig en ge ne ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● − . − . . . cdr e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● − . . . cerad e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − . . . braak e ig en ge ne ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● − . . . cdr e ig en ge ne m .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fig. . visualization of functional evidence supporting the association of the top-ranked genes with ad. the seven circles show the strength of the seven types of evidence, including the three molecular interaction evidence (the number of interacting ad-associated genes in ppi, coexpression network and mirna-target binding network, respectively) and the four phenotypic correlation evidence (the pearson correlation with cerad, braak and cdr on the msbb dataset, and the log -transformed fold change of expression obtained from the rosmap study). the darker the purple color is, the stronger the functional association is. the section corresponding to the blue arc shows the enriched go biological process terms, where each curve points the gene annotated to the term. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / fig. illustration of the association of the top-ranked individual genes with ad-related phenotypes and the potential regulatory variant of the gene. a comparison of the expression of individual genes in different sample groups. the samples were divided into groups based on the cerad, braak or cdr score. the expression for fyn and prkar a is shown. b potential regulatory snps that may regulate the expression. for fyn, the snp rs not only resides in the tfbs within its promoter region but also is an eqtl (upper); the snp rs is located in the tfbs and also an eqtl for prkar a. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / accommodating site variation in neuroimaging data using hierarchical and bayesian models accommodating site variation in neuroimaging data using hierarchical and bayesian models a preprint johanna m. m. bayer orygen centre for youth mental health, melbourne, australia the university of melbourne, melbourne, australia bayerj@student.unimelb.edu.au richard dinga donders institute, radboud university, nijmegen, the netherlands radboud university medical centre, nijmegen, the netherlands seyed mostafa kia donders institute, radboud university, nijmegen, the netherlands radboud university medical centre, nijmegen, the netherlands akhil r. kottaram orygen centre for youth mental health, melbourne, australia thomas wolfers radboud university medical centre, nijmegen, the netherlands department of psychology, university of oslo, norway jinglei lv school of biomedical engineering brain and mind center, university of sydney, sydney, australia andrew zalesky melbourne neuropsychiatry centre, the university of melbourne melbourne health, melbourne, australia department of biomedical engineering, the university of melbourne, australia lianne schmaal ∗ orygen centre for youth mental health,melbourne, australia the university of melbourne, melbourne, australia andre marquand * donders institute, radboud university, nijmegen, the netherlands radboud university medical centre, nijmegen, the netherlands institute of psychiatry, kings college london, london, uk february , ∗shared last author (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , abstract the potential of normative modeling to make individualized predictions has led to structural neu- roimaging results that go beyond the case-control approach. however, site effects, often con- founded with variables of interest in a complex manner, induce a bias in estimates of normative models, which has impeded the application of normative models to large multi-site neuroimag- ing data sets. in this study, we suggest accommodating for these site effects by including them as random effects in a hierarchical bayesian model. we compare the performance of a linear and a non-linear hierarchical bayesian model in modeling the effect of age on cortical thickness. we used data of healthy individuals from the abide (autism brain imaging data exchange, http://preprocessed-connectomes-project.org/abide/) data set in our experiments. we compare the proposed method to several harmonization techniques commonly used to deal with additive and multiplicative site effects, including regressing out site and harmonizing for site with combat, both with and without explicitly preserving variance related to age and sex as biological variation of interest. in addition, we make predictions from raw data, in which site has not been accommodated for. the proposed hierarchical bayesian method shows the best performance accord- ing to multiple metrics. performance is particularly bad for the regression model and the combat model when age and sex are not explicitly modeled. in addition, the predictions of those models are noticeably poorly calibrated, suffering from a loss of more than % of the original variance. from these results we conclude that harmonization techniques like regressing out site and combat do not sufficiently accommodate for multi-site effects in pooled neuroimaging data sets. our results show that the complex interaction between site and variables of interest is likely to be underestimated by those tools. one consequence is that harmonization techniques removed too much variance, which is undesirable and may have unpredictable consequences for subsequent analysis. our results also show that this can be mostly avoided by explicitly modeling site as part of a hierarchical bayesian model. we discuss the potential of z-scores derived from normative models to be used as site corrected variables and of our method as site correction tool. keywords neuroimaging · normative modeling · site effects · hierarchical bayesian modeling introduction the most prominent paradigm in clinical neuroimaging research has for a long time been case-control approaches which compare averages of groups of individuals on brain imaging measures. case-control inferences can be clinically meaningful under some circumstances when the group mean is a good representation of each individual in the group. however, this pre-condition has been challenged recently, demonstrating that the biological heterogeneity within clinical groups can be substantially large [marquand et al., ]. for example, the structure and morphology of the brain have been found to vary between individuals in dynamic phases like adolescence [foulkes and blakemore, ] and within clinical groups, such as bipolar disorder and schizophrenia [wolfers et al., a] and attention deficit disorder [wolfers et al., ]. in addition, inter-individual differences have shown to not necessarily be in line with results obtained via the group comparison approach [wolfers et al., ]. such heterogeneity has been considered a potential cause for the lack of differences between clinical groups and controls within the standard group comparison approach [feczko et al., ] and the failure to replicate findings between studies [fried, ]. as a consequence, there has been a shift in focus towards taking into account variation at the individual level [marquand et al., ]. this is in line with a trend towards personalized medicine or "precision medicine" [mirnezami et al., ], where characteristics of the individual are used to guide the treatment of mental disorders. this shift has been accompanied by a trend towards approaches that go beyond comparing averages of distinctly labeled groups [insel et al., , insel, ], for an overview of methods see [marquand et al., ]. among them, normative modeling has been successfully used to capture inter-individual variability and make predictions at the individual level. the strength of normative modeling lies within the ability to map variation along one dimension (e.g., brain volume) onto a second co-varying variable (e.g., age), redefining the variation in the first dimension as explained by this new covariate of interest. this concept allows to describe the normative variation, thus the range containing e.g., % of all individuals, as a function of the covariate and considers each individual’s score in relation to the variation in the reference group defined by the covariate score. the concept is similar to the use of growth charts in pediatric medicine, in which height and weight are expressed as a function of age. hence, in this setting, an individual’s height or weight is not considered by its absolute value, but expressed as a percentile score of deviation fluctuating with age, with the median line corresponding to the % percentile and defining the norm, or average height. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://preprocessed-connectomes-project.org/abide/ https://doi.org/ . / . . . a preprint - february , in neuroimaging, normative models have been applied to clinical and non-clinical problems using various covariates, statistical modeling approaches (for an overview see [marquand et al., ]) and targeting a variety of response variables. in general, any variable can be used as a covariate in a normative model targeting neuroimaging measures, as long as the variation along the co-varying dimension is not zero. however, normative models with age and sex as covariates and brain volume as response variable are currently more frequently found in the litera- ture, [wolfers et al., , wolfers et al., b, zabihi et al., , kessler et al., ]. these implement the growth charting idea applied to high dimensional brain imaging data. for example, a normative model of a brain structure can be created based on the variation of individuals in population based cohorts. the estimated norm can be used to infer where individuals with clinical symptoms can be placed with respect to the reference defined by the norma- tive model. this has been the recipe of many recently published studies using the normative modeling framework [wolfers et al., , bethlehem et al., , wolfers et al., , lv et al., ]. underlying this approach is the assumption that the individually derived patterns of deviation uncover associations to clinical/behavioral variables that would be obscured by averaging across groups of individuals. however, the amount of data necessary to create normative models poses a challenge to normative modeling in neuroimaging, as the cost and time factor associated with neuroimaging data impedes the collection of large neuroimaging samples in a harmonized way. one exceptional example where large scale data collection succeeded and included both harmonized scanners and scanning protocols, is the uk biobank initiative, which, when launched in , aimed to scan , individuals at four different scanning locations [https://www.ukbiobank.ac.uk/explore-your-participation/contribute-further/ imaging-study][miller et al., ]. other neuroimaging initiatives have also taken on the challenge to collect neuroimaging data in large scale quantities and have relied on harmonized scanning protocols, but did not collect the data using harmonized scanners (i.e. adni, [mueller et al., ], abcd study [volkow et al., ]). nonetheless, the restricted age ranges (e.g., - years in uk biobank [miller et al., ]), or focus on a particular (clinical) cohort (e.g. alzheimer’s in adni, [mueller et al., ]) limit their utility for estimating normative models mapping the normative association between, for example, age and brain structure or function. an alternative way to obtain large neuroimaging data sets and assess data from a large number of subjects is by pooling or sharing data that has already been collected. one example is the enhancing neuroimaging and genetics through meta-analysis (enigma) consortium [thompson et al., ]. enigma succeeded in pooling neuroimaging and genetics data of thousands of individuals, including healthy individuals and individuals with psychiatric or neurological disorders. the strategy of data sharing initiatives like enigma is to collect already collected data from different cohorts and different scanning sites and harmonize preprocessing and statistical analysis with standardized protocols. however, a major disadvantage is the presence of confounding "scanner effects" [fortin et al., ] (e.g., differences in field strength, scanner manufacturer etc. [han et al., ])). these confounding effects present as site correlated biases that cannot be explained by biological heterogeneity between samples. an example of those effects on derived measures of cortical thickness can be found in fig. a. they result from a complex interaction between site and variables of interest, manifesting in biases on lower and higher order properties of the distribution of interest, such as differences in mean and standard deviations, skewness and spatial biases fig. ( a, b), and cannot be explained by e.g., differences in age or sex fig. ( c). as the origin of these effects might not only be related to the scanner per se, but extend to various factors related to a single acquisition site [gronenschild et al., ], we will refer to them as site effects from here on. as outlined in the previous paragraph, the effort to create large samples to capture between subject variability often induces site-driven variability. this issue of site-driven variability in shared neuroimaging data has been acknowledged and has led to the development of harmonization methods at a statistical level. a common approach to deal with site effects is through "harmonizing" by, e.g., confound regression. one example of this approach is a set of algorithms summarized under the name "combat" [fortin et al., ]. the method had originally been developed by [johnson et al., ], who used empirical bayes to estimate "batch effects", referring to non-biological variation added due to the handling of petri-dishes in micro-array experiments on the results of gene expression data. fortin and colleagues adapted the framework to apply to neuroimaging data [fortin et al., ]. in combat, additive and multiplicative site effects on a particular target unit (e.g., a particular brain voxel for one participant) are estimated using empirical bayes and by placing a prior distribution over estimates for these units. the etsimate of the scanner effect is then used to adjust the prediction. newer versions also allow to preserve variance of interest in the model, for example for age, sex or diagnosis [fortin et al., , fortin et al., ]. combat has been applied to several types of neuroimaging data, including diffusion tensor imaging data (dti, [fortin et al., ]) and structural magnetic resonance imaging data [fortin et al., ]. however, the reliability of harmonization strategies is grounded on the condition that site effects are orthogonal to the effect of interest and uncorrelated with other covariates in the model [chen et al., ]. in reality, however, data pooled from several sites is often confounded with co-linear effects. many individual neuroimaging samples, (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.ukbiobank.ac.uk/explore-your-participation/contribute-further/imaging-study https://www.ukbiobank.ac.uk/explore-your-participation/contribute-further/imaging-study https://doi.org/ . / . . . a preprint - february , for example, are restricted to a specific age range, leading to age being correlated with site effects. in this scenario, removing an estimate of the scanner effect can lead to excluding (biological) variation that would be of interest. with this paper we suggest an alternative approach to deal with site effects in neuroimaging, which is relatively generally applicable. however, here we focus in particular on normative modeling. we propose a hierarchical bayesian approach in which we include site as a covariate into the model, avoiding the exclusion of meaningful variance correlated to site by predicting site effects as part of the model instead of removing them from the data. this approach is similar to the approach by [kia et al., ], who used hierarchical bayesian regression (hbr) in a similar way for multi-site modeling in a pooled neuroimaging data set, which contained participants that were scanned with different scanners. [kia et al., ]’s estimate of site variation is based on a partial pooling approach, in which the variation between site-specific parameters is bound by a shared prior. the approach showed better performance when evaluated with respect to metrics accounting for the quality of the predictive mean and variance compared to a complete pooling of site parameters and to combat harmonization, and similar performance to a no-pooling approach, with the benefit of reduced risk of over-fitting due to the shared site variance. moreover, [kia et al., ] also showed that the posterior distribution of site parameters from the training set can also be used as an informed prior to make predictions in an unseen, new test set, outperforming predictions from complete pooling and uninformed priors, and overcoming a weakness of combat. the method was also able to display heterogeneity between individuals with varying clinical diagnoses in associated brain regions of clinical patients of the study. the present paper is a replication and extension of the approach by [kia et al., ]. based on several successful attempts of using gaussian process regression to map non-linearity in normative models [kia and marquand, , marquand et al., , marquand et al., ], we extend the normative model with the capacity to account for site effects by adding a gaussian process to model non-linear effects between age and the brain structure. in addition, our model is fully bayesian and entails a hierarchical structure, including priors and hyper priors for each parameter. we use data from the abide (autism brain imaging data exchange, http://preprocessed-connectomes-project. org/abide/) data set to compare a non-linear, gaussian version of the model, to a linear hierarchical bayesian version accounting for site effects that does not include the gaussian process term. we show that the hierarchical bayesian models including a site parameter perform better than existing methods for dealing with additive and multiplicative site effects, including combat and regressing out site. we discuss the normative hierarchical bayesian methods with regard to their implications for neuroimaging data-sharing initiatives and their use as general technique to correct for site effects. methods in this section we will introduce the data used in this study and the pre-processing steps applied, followed by a conceptual and mathematical description of our approach to include site as predictor in a normative hierarchical bayesian model. we will also illustrate other methods (than including site as predictor) to accommodate for site effects that will be used to validate our approach against. lastly, we will outline which measures will be used for model comparison. . data the following sub-section aims to give a description of the abide data set, including a study on the scope of site effects in the data. . . abide data set the abide consortium (http://preprocessed-connectomes-project.org/abide/) was founded to facilitate research and collaboration on autism spectrum disorders by data aggregation and sharing. the consortium provides a publicly available structural magnetic resonance imaging (mri) data set and corresponding phenotypic information of individuals with autism spectrum disorder and age-matched typical controls. for this study, only data from healthy individuals were included. as those healthy controls are meant to be complementary to the autism branch in the data set, out of subjects in this study were male. the data was processed using a standardized protocol [craddock et al., ] of the freesurfer standard pipeline (desikan-kiliany atlas) as part of the preprocessed connectomes project [craddock et al., ] and has been made available for download on the preprocessed section of the abide initiative. for the current study we focused on cortical thickness measures of the bilateral regions of the desikian killiany atlas parcellation [desikan et al., ] as a part of the freesurfer [fischl et al., ] output and the average cortical thickness across all regions. we chose to include cortical thickness measures since they show a strong (negative) association with age (unlike measures of surface area, which remain more stable across the life span [storsve et al., ]). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://preprocessed-connectomes-project.org/abide/ http://preprocessed-connectomes-project.org/abide/ http://preprocessed-connectomes-project.org/abide/ https://doi.org/ . / . . . a preprint - february , (a) distribution of average cortical thickness measures of individuals, grouped by the acquisition sites the data were collected at (each boxplot de- scribes the distribution of one site). (b) average cortical thickness of individuals regressed onto age, grouped by site (each regression line describes one site). (c) thickness measures of all cortical re- gions average cortical thickness grouped by individual, colored by site, sorted by age (each boxplot represents one individual). displayed are out of sites from the abide data set (d) distribution of all cortical regions average cortcial thickness per individual, summarized as boxplot (each boxplot represents one individual). boxplots are coloured by site and ordered by age within site. figure : site effects in healthy individuals from the abide data set. . . site effects in the abide data set the abide data set has been obtained by aggregating data from independent samples collected at different scan- ning locations [di martino et al., ]. although all data has been collected with tesla scanners and preprocessed in a harmonized way [craddock et al., ], sequence parameters for anatomical and functional data, as well as type of scanner varied across sites [di martino et al., ]. in addition, sites differ in distribution of age and sex and in sample size. an overview of site-specific data is provided in table and in [di martino et al., ]. the abide data set is affected by site specific effects that are unlikely to be explained by biological variation. they manifest as linear and non-linear interactions between scanning site, covariates (for example age and sex), and cortical measures. similar to batch effects in genomics [leek et al., ], those effects lead to a clustering of the data caused by external factors related to the scanning- and analysis process. with the aim to estimate to which extent the abide data set is affected by site effects, we calculated an ancova with age as covariate. it revealed that average cortical thickness differed between site (main effect site: f( , ) = . , p < . × − , sum contrast). in addition we tested for differences in variance between sites. bartlett’s sphericity test [bartlett, ] showed a difference in variance between sites even after regressing out variance that could be explained by age and sex (p < . ). the site effects in the abide data set are visualized in fig. . . splitting the abide data set into training and test sets to evaluate the performance of the models, we split the data into a training set ( % of data) and a test set ( % of data) using the r package caret and splitstackshape, while the distribution of age, sex and site was preserved between sets. thus, training and test sets contained individuals from the same sites ("within-site-split"). an overview of the distribution of age and sex for the training and test sets can be found in fig. . subsequently, the training and test sets were standardized region-wise based on location and scale parameters of the training set. for the model estimation process, only complete pairs of observations (per region) were used. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , figure : overview over pheno- typic information in the abide data set. // age male subjects: m = . . sd = . . age female subjects: m = . , sd = . . range = . - . site as a predictor in a hierarchical bayesian model with the aim to create reliable normative models in multi-site neuroimaging data, we developed and compared two versions of a hierarchical bayesian models that include site as a predictor. in a hierarchical linear version of the model, site is modeled hierarchically, resulting in a random effect for site (hierarchical bayesian linear model, hblm). in a non-linear version of the model, a gaussian process for age is added to test whether performance is increased if the model is also able to capture non-linear effects between age and thickness of the cortical region ("hierarchical bayesian gaussian process model, hbgpm"). both hierarchical bayesian models were trained and tested in a within site split (see section . on splitting the multi-site abide data set.) . comparison models to get a better understanding of the performance of our approach, we performed a second analysis, comparing the hierarchical bayesian approach with site as predictor to predictions made from a data that other methods managing site effects had been applied on. in the following, those alternative models will be summarized under the term comparison models. of note, the approach used to accommodate for site effects in the comparison models is fundamentally different from the approach used in the hierarchical bayesian models. in the hierarchical bayesian approach, multi-level modeling is used to account for site-variance without removing it, whereas different methods of harmonization are used on the data to remove variance related to site as part of the comparison models approach. in detail, the comparison model approach entailed a two-step procedure, in which site effects are first harmonized by three different common models of site harmonization, and then a simple bayesian linear algorithm, with an additive term for age and sex, but without site as a predictor is used to make predictions in stan [stan development team, b]. the harmonization procedures include i) regressing out site effect from the cortical thickness measures using linear regression and using the residuals as input to the simple bayesian linear model (thus, removing additive variant components of site), ii) using combat [johnson et al., , fortin et al., ] to clear the data from site effects (thus, harmonizing for additive and multiplicative effects of site, and iii) using combat as above, but explicitly preserving the variance associated with sex and age; an approach which will be referred to as modified combat in the following. predictions made from raw data (thus, without any treatment of site effects) were used as a baseline model. an overview over all pipelines for all models can be found in fig. . . performance measures . . measures of model performance model performance is assessed using several common performance metrics. the pearson’s correlation coefficient ρ indicates the linear association between true and predicted value of cortical thickness measures. however, correlations (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , figure : pipelines for hierarchical bayesian and comparison models are not a sensitive error measure and cannot capture the "miss" between true and predicted value. hence, we also calculate the standardized version of the root mean squared error (srmse) and the point-wise log-likelihood at each data point in the test set as a metric indicating deviance from the true value. however, these measures only take into account the estimate of the mean, and do not account for variations in the estimate of the variance. thus we also compute the proportion of variance explained (ev) by the predicted values and a standardized version of the log-loss (mean standardized log-loss, msll [rasmussen and williams, ]). the latter does not only take into account the variance of the test set, but also standardizes it by the variance of the training set, making a comparison between the models possible. this step is necessary as various methods of correcting for site might also have an impact on the variance remaining in the data. . . measures of goodness of the simulation in stan parameters indicating the goodness of the model simulation process in stan itself, like convergence, effective sample size, and trace plots can be found in the supplementary material. . model specification in this section we show how normative models describing the association between age, and sex, and cortical thickness measures can be modeled on data comprising site effects using a hierarchical bayesian linear mixed model with a gaussian process term, which allows to model non-linear association between age and cortical thickness measures. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , following the notations of [gelman, , rasmussen and williams, ], we model a target vector y ∈ rn× containing the the individual responses yi for each subject i = , . . . ,n and each region, using a latent function f = f(x). fi = f(xi) is the evaluation of the latent function for an input vector xi containing all p input variables of subject i, and is considered to differ from the true response variables by additive noise �i with the variance ηi and n( ,σ ) along the diagonal, with i being a n×n identity matrix: y ∼n(f,σ i), ( ) or, for the individual case: yi = f(xi) + �i, ( ) with: �i ∼n ( ,σ ). the ability of the model to deal with site effects is obtained by introducing a random effect for site s = , , . . . ,q so that the prediction for the ith subject is a combination of fixed and varying effects: f = xβ + zu + γ, ( ) where γ is an additional non-linear component (defined in ( ) below) and the estimate for one particular subject i is calculated the following: fi = p∑ j= xijβj + q∑ s= zisus + γi ( ) with β ∼n( , Σj) u ∼n( , Σs). here, β is a × p vector containing the fixed regression weights corresponding to an n × p input matrix x with columns j = , . . . ,p. in case of non-centralized data one column of ones for an intercept offset has to be added. similarly, u is a × q vector containing the weights for random effects across subjects, corresponding to a dummy coded n×q matrix z modeling site. for all linear models, in ( ) we assume γi = . for the non-linear models we assume γ is a gaussian process with mean function m(x) and covariance function k(x,x′) to allow for non-linear dependencies between the predictors and the target variable: γ ∼ gp(m(x),k(x,x′)). ( ) in our case, we set m(x) = and define k(x,x′) as the additional non-linear component in the following squared-exponential form: k(x,x′) = σ fexp(− l j (x,x′)) , ( ) with free parameters for the signal variance term σ f and the length scale l. note this allows to specify two sources of variance: the signal variance σ f and the noise variance σ as modeled in ( ). from a hierarchical bayesian point of view, random effects are equal to a hierarchical structure of sources of variation. for modeling site effects, introducing a hierarchical structure has the benefit that it allows to include structural dependencies between sites via partial pooling. thus, instead of modeling site effects as an effect shared between sites or independently from each other, a semi-independent association between sites can be obtained via assuming that all site parameters originate from a shared first-order prior distribution. this concept has been used elsewhere [kia et al., , gelman et al., , mathys et al., ]. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , we hence induce shared priors and hyper priors θ for site s, i.e. ∀s,us ∼ invΓ( , ), and a uniform prior for the length scale l ∼ u( , ). we use stan [carpenter et al., , stan development team, b] to estimate all free parameters θ = (βt , ut, lt,σ,σf ) performing bayesian inference: p(θ|x,y,θ ) = p(θ,x,y,θ ) p(x,y,θ ) = p(θ|x,y,θ ) p(x,y,θ ) ( ) where p(x,y,θ ) = ∫ p(θ)p(x,y,θ |θ) dθ. . . posterior predictive distribution we obtain the posterior predictive distribution y∗ for a new sample x∗ via: p(y∗|y) = ∫ p(y∗,θ|x∗,x,y,θ ) dθ = ∫ p(y∗|θ,x∗,x,y,θ ) p(θ|x,y,θ ) dθ = ∫ p(y∗|θ) p(θ|x,y,θ ) dθ ( ) as y and y∗ are considered to be conditionally independent given θ [gelman et al., ]. further, the predictive distribution can be computed exactly, writing the joint distribution of the known data y, x and the new sample x∗, with the variance being determined by sample variance σ and the gaussian kernel k(x,x′): k(x,x′) = [ k + σ i k∗ kt∗ k∗∗ ] ( ) here, k is an n×n covariance matrix of training data, k∗∗ denotes the variance at the test sample points and k∗ is the covariance between y∗ the known data. . . comparison models we compare the hierarchical bayesian attempt to normative modeling to commonly used harmonization techniques in which site is controlled for by subtracting an estimate of the site effect from the data prior to fitting the normative model. these methods included: i) removing additive effects of site, by regressing out site effects via linear regression and using the residuals as input for the simple bayesian linear model to obtain the normative scores, ii) harmonizing for additive and multiplicative effects of site using combat [johnson et al., , fortin et al., ], iii) modified combat, thus, using combat as before, but preserving biological variance of interest i.e., sex and age. all these methods involve removing site effects prior to estimating the normative scores in contrast to our method in which we explicitly model site within the normative modeling framework. these harmonized data, obtained as output from the harmonization techniques, are subsequently used for normative modeling in a simple bayesian linear model that does neither take into account site effects nor non-linear dependencies between age and measures of cortical thickness. thus, equation ( ) is reduced to f = xβ with β ∼n( , ∑ j). in addition we use this simple bayesian linear model to make one set of predictions for each regions from data that was not in any way harmonized for site (raw data model). r [r core team, ] was used for preprocessing of all data and to create the data set where site was regressed out, and for preprocessing the data with combat [johnson et al., , fortin et al., ]. . . implementation: normative modeling in stan both the hierarchical bayesian and the comparison model version of the normative models were implemented in stan [carpenter et al., , stan development team, b], a probabilistic c++ based programming language to perform bayesian inference, and analyzed in r [r core team, ] using the package rstan [stan development team, a]. stan allows to directly compute the log posterior density of a model given the known variables x and y. it uses the no-u-turn sampler (nuts) [hoffman and gelman, ], a variation of hamiltonian monte carlo sampling [duane et al., , neal et al., , neal, ] to generate representative samples from the posterior distribution of parameters and hyper parameters θ, each of which has the marginal distribution p(θ | y,x). this is achieved by first approximating the distribution of the data to a defined threshold in a warm up period and then randomly sampling (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , from the model, generating new draws of parameters for each iteration and calculating the response of the model. this approach of sampling instead of fitting allows for the simulation of complex models for which the derivation of an analytical solution of the posterior is computationally costly or not possible. the bayesian framework provides access to the full posterior distribution and to the distribution of all parameters. this allows to deduce the a variance estimate of each parameter, leading to a parameter estimate that is not only described by its mean, but also by the (un)-certainty around the mean estimation, providing information on its accuracy and reliability. moreover, we can use the posterior distribution of each site-specific parameter from the training set as prior for the test set, allowing to make predictions for unfamiliar sites. the stan code for the hblm, the hbgpm and the simple bayesian linear model without site as predictor can be found at https://github.com/likeajumprope/bayesian_normative_models. . model simulation process in stan parameters indicating the goodness of the model simulation process in stan [carpenter et al., , stan development team, b] itself, like convergence, effective sample size and trace plots can be found in the supplementary material. results both the hblm and the hbgpm outperformed all other comparison models with respect to all performance measures considered in this study. in detail, the hblm and the hbgpm showed higher average values of the pearson’s correlation coefficient ρ (table ), lower average srmses (table ), smaller average ll (table ) and higher average proportions of ev (table ) than all comparison models (p < . for all comparisons). for none of these comparisons did the non-linear hbgpm outperform the linear hblm. in addition to the mean comparisons reported in table - , the distribution of all performance measures across all regions and for average cortical thickness across the entire cortex per model can be found in fig. . a detailed comparison of all models with respect to to ρ, srmse, ev and ll can be found in the supplementary material. . mean standardized log loss to also account for the second order statistics of the posterior distributions created by each model, we calculated the mean standardized log loss (msll). this measure can only be calculated for the test set, as it is the log loss standardized by the mean loss of the training data set [rasmussen and williams, ]. hence, the msll gives an indication of whether a model is able to predict the data better than the mean of the training set (with more negative values being better). an overview of the msll for all cortical thickness measures of all regions for all models is given in fig. a. the only models that perform better for most regions than the mean of the training data set are the hierarchical bayesian models ( msllhbgp m < for all regions; msllhblm < for all but one region), in contrast to prediction from the residuals and the combat model, where none of the predictions perform better than the mean of the training data set (msllresiduals > for all regions; msllcombat > for all regions, see fig. a. the msll for the modified combat model and raw data model were region-dependent, with % regions ( out of ) for the modified combat model and % of regions (six out of ) for the raw data model performing better than predictions from the mean of the training set. it should also be mentioned that for some individual regions the comparison models performed very poorly (max msllcombat = , max msllmod.combat = , max msllraw = ; max msllresiduals = ) and show measures that exceeded the plotted range of fig. a. in contrast, the maximum msll for the hierarchical bayesian models was max - . for the hbgpm and max . for the hblm. . predictive variance we also observed that the models differ in the variance of predicted values, as visualized in fig. b for average cortical thickness. for the combat, the raw data and the residuals model the range of predicted values was severely restricted (range predicted values raw data, test set: [ . - . ], range predicted values residuals, test set [ . - . ]; range predicted values combat, test set: [ . - . ]. these intervals cover . %, . % and . % of the original test set variance, respectively. the modified combat model retained . % of the original test set variance (range predicted value modified combat [ . = . ]. in other words, all harmonization techniques had a reduced predictive variance and were instead biased toward predicting the mean, sometimes severely. in contrast, this bias was substantially reduced (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/likeajumprope/bayesian_normative_models https://doi.org/ . / . . . a preprint - february , s it e m an uf ac tu re r p la tf or m v ox el s iz e t r t e n m al es ag e ra ng e (a bb re vi at io n) [m m ] [m s] [m s] [% ] [y ea rs ] c al if or ni a i. of te ch no lo gy (c al te ch ) s ie m e n s t im t r io . × . × . . . - c ar ne gi e m el lo n u .( c m u ) s ie m e n s t im t r io . × . × . . . - k en ne dy k ri eg er i. (k k i) p hi li ps a ch ie va . × . × . . . - l ud w ig m ax im il ia ns u .m un ic h (l m u ) s ie m e n s v e r io . × . × . . - n y u l an go ne m ed ic al c en te r (n y u ) s ie m e n s a l l e r g r a . × . × . . . - o li n i. of l iv in g at h ar tf or d h os pi ta l (o l in ) s ie m e n s a l l e g r a . × . × . . . - o re go n h ea lt h an d s ci en ce u . (o h s u ) s ie m e n s t im t r io . × . × . . . - s an d ie go s ta te u . (s d s u ) g e m r . × . × . n a n a s oc ia lb ra in l ab (s b l ) p hi li ps in t e r a . × . × . . . - s ta nf or d u . (s ta n f o r d ) g e s ig n a . × . × . . . . t ri ni ty c en tr e fo r h ea lt h s ci en ce s (t r in it y ) p hi li ps a ch ie va . × . × . . . - u .o f c al if or ni a l os a ng el es (u c l a ) s ie m e n s t im t r io . × . × . . . - u .o f c al if or ni a l os a ng el es (u c l a ) s ie m e n s t im t r io . × . × . . . - u .o f l eu ve n (l e u v e n ) p hi li ps a ch ie va . × . × . . . - u .o f l eu ve n (l e u v e n ) p hi li ps a ch ie va . × . × . . . . - u .o f m ic hi ga n (u m ) g e s ig n a n a . . - u .o f m ic hi ga n (u m ) g e s ig n a n a . . - u .o f p it ts bu rg hs ch oo lo f m ed ic in e (p it t ) s ie m e n s a l l e r g r a . × . × . . . u .o f u ta h s ch oo lo f m ed ic in e (u s m ) s ie m e n s t im t r io . × . × . . - y al e c hi ld s tu dy c en te r (y a l e ) s ie m e n s t im t r io . × . × . . . - ta bl e : t he sc an ne r pa ra m et er s an d sa m pl e sp ec ifi ca ti on s of th e a b id e da ta se t. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , in the hierarchical bayesian models, which retained . % (hblm) and . % (hbgpm) of the original test variance (range predicted values hblm, test set: [ . - . ]; range predicted values hbgpm, test set: [ . - . ]). mean correlation (std) post-hoc comparison ρ training set test set hblm hbgpm mod. combat combat residuals raw data hblm . ( . ) . ( . ) ns. *** *** *** *** hbgpm . ( . ) . ( . ) ns. *** *** *** *** mod. combat . ( . ) . ( . ) *** *** *** *** *** combat . ( . ) . ( . ) *** *** *** ns. *** residuals . ( . ) . ( . ) *** *** *** ns *** raw data . ( . ) . ( . ) *** *** *** * ** table : post-hoc tests of correlations between true and predicted values. cell values indicate post-hoc comparison significance values (adjusted by tukey method for a comparing a family of estimates). signif. codes: ‘***’ . ‘**’ . ‘*’ . ‘.’ . ‘ ’ ns. blue: test set. yellow: training set. mean srmse (std) post-hoc comparison srmse training set test set hblm hbgpm mod. combat combat residuals raw data hblm . ( . ) . ( . ) n.s *** *** *** *** hbgpm . ( . ) . ( . ) ns. *** *** *** *** mod. combat . ( . ) . ( . ) *** *** *** n.s ns. combat . ( . ) . ( . ) *** *** *** *** *** residuals . ( . ) . ( . ) *** *** ns. *** n.s raw data . ( . ) . ( . ) *** *** *** *** *** table : post-hoc tests of srmse between true and predicted values. cell values indicate post-hoc comparison significance values (adjusted by tukey method for a comparing a family of estimates). signif. codes: ‘***’ . ‘**’ . ‘*’ . ‘.’ . ‘ ’ ns. blue: test set. yellow: training set. ll training set test set hblm − . − . hbgpm − . − . combat mod. − . − . combat − . − . residuals − . − . raw − . − . table : averaged log loss for training and test set. ev training set test set hblm . . hbgpm . . combat mod. . . combat . . residuals . . raw . . table : averaged explained variance for training and test set. discussion in this work, we aim to provide a method that allows the application of normative modeling to neuroimaging data sets that are affected by site effects resulting from pooling data between sites. in contrast to other methods of harmonizing for additive and multiplicative site effects in the data prior to the normative modeling (e.g., regressing out site effects, harmonization with combat), our approach is based on modeling site as predictor within the normative modeling framework. the benefit of this approach is that it does not entail removing variance and thus cannot lead to an overestimation of site variance and accidental removal of meaningful variation in case the latter is confounded with site variation. using a hierarchical bayesian approach, we propose two versions of normative models that were able to control for site effects. in both versions, site is modeled via a random intercept offset, but one version only models linear effects of age on cortical thickness (hierarchical bayesian linear model, hblm), whereas the other version also includes a gaussian process term in order to allow potential non-linear relationships between age and cortical thickness measures (hierarchical bayesian gaussian process model; hbgpm). the normative models are trained on a training set consisting of healthy individuals from the abide data set ( % of the data from different sites, within-site split, preserving the distribution of age and sex across training and (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , (a) distribution of pearson’s correlation coefficient ρ for cortical regions, indicating the correlation between true and predicted values, training and test set. (b) srmse for cortical regions, indicating the deviation true and predicted values of six different models for the training and the test set. (c) explained variance for cortical regions, training and test set. (d) log likelihood distribution for cortical regions, train- ing and test set. figure : performance measures test set) and we present results from generalization to a test set (the remaining % of the data from the same sites). we compare the performance of our hierarchical bayesian normative models explicitly modeling site effects applied to cortical thickness measures derived from freesurfer [fischl et al., ]) to other commonly used methods to deal with site effects. these alternative methods included: i) regressing out site via linear regression and using the residuals, removing additive site variation, ii) applying combat [fortin et al., , fortin et al., ] to harmonize additive and multiplicative site effects in the data, and iii) modified combat, hence applying combat while preserving age and sex effects in the data. cortical thickness measures cleared from site effects using these alternative methods are used as dependent variables in a normative model with age and sex as predictors but excluding site. for comparison reasons, we also include a fourth model where we made predictions from raw data uncorrected for any site effects. we report three main findings: ( ) our normative hierarchical bayesian models (both the linear hblm and non-linear hbgpm version), explicitly modeling site effects within the normative modeling framework, outperform all alternative harmonization models with respect to model fit, including correlations between true and predicted values (ρ), standardized root mean square error (srmse), explained variance scores (ev), log-likelihood (ll) and the mean standardized log loss (msll); ( ) the non-linear model did not significantly improve prediction of cortical thickness (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , (a) msll distribution for cortical regions, test set. (b) predicted variance vs. actual variance for average cortical thickness for each model derived from predictions of individuals. figure : mean standardized log loss and predicted variance for cortical regions based on age, sex and site compared to the linear model; ( ) all methods, but in particular the harmonization methods lead to an undesirable shrinking of the variance in the predictions. we showed that when using neuroimaging structural data sets pooled across different sites and scanners for estimating normative models, better predictive performance can be achieved by including site as a predictor than using a two-step approach of first harmonizing the data with respect to site and subsequently creating a normative model using these “cleared” data. this conclusion is based on results showing that the hierarchical bayesian models outperformed the harmonizing comparison models on all of the performance metrics we examined. this includes the predictions derived from data that was cleared from site effects by a version of combat [fortin et al., , fortin et al., ] in which variation associated with age and sex was preserved, which was the best performing method across all harmonizing models. we observed a higher correlation between true and predicted values and ll values closer to zero for our hierarchical bayesian models explicitly modeling site effects with a random intercept offset, indicating better model fit. as a key factor of normative models is that they are not only able to estimate the predictive mean, but also give an estimate of the predictive variance and variation around the mean [marquand et al., , marquand et al., ], we also included explained variance scores and the msll as performance metrics. our hblm and hbgpm models showed higher explained variance than the alternative models. in addition, the hblm and hbgpm showed a negative msll in the test set; a metric which contrasts the log loss between the true and predicted values by the loss that would be achieved using the mean and the variance of the training set [rasmussen and williams, ], thus capturing differences in variance in the data sets. this benefit in performance for the hierarchical bayesian models is in line with previous literature using a similar paradigm [kia et al., ]. [kia et al., ] showed that a hierarchical bayesian regression approach using site as a batch effect lead to a better performance than complete pooling, no pooling and combat. in detail, our findings match [kia et al., ]’s findings with respect to the comparison between a normative model created from hierarchical bayesian regression (hbr) and a modified combat version in a data set with the same sites in training and test set. their findings are in line with ours with respect to ρ ([kia et al., ]: hbr range: . - . , modified combat range: . - . ), smse: ([kia et al., ]: hbr range: . - . , modified combat range: . - . ) and msll ([kia et al., ]: hbr range: - . - - . , modified combat range: - . - . ), except that the msll for the modified combat model was worse in our study (see figs. a, b, a). therefore, our findings replicate the findings of [kia et al., ] using an independent data set and separate implementation and extend that method to model non-linear functions using a gaussian process term. we anticipated that the non-linear version of the normative model, which included a gaussian process for age, would perform better than the linear version, as studies have shown that the association between age and regions of cortical thickness can be non-linear, especially for older age ranges [storsve et al., ]. however, our results showed similar performance in predicting cortical thickness based on age, sex and site for both linear and non-linear models. this might be due to the fact that the the age range in our sample was restricted, ranging from - years, thus likely capturing an age range where the association between age and cortical thickness is still mostly linear [wierenga et al., ]. as a consequence, the non-linear version of the model was not able to improve the overall (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , performance. nonetheless, since other structural brain measures, including sub-cortical volumes and cortical surface area [wierenga et al., , raznahan et al., ], have shown stronger non-linear associations with age, non-linear normative models may outperform a linear model for other types of structural brain imaging measures. despite an overall good performance of our models, it should also be mentioned that the performance showed substantial variation between regions, as reflected in the variation in ρ values, srmse, ev, ll and msll within models. we assume that this due to the fact that, although average cortical thickness shows a strong association with age, different cortical brain regions differ in their association with age and the magnitude of this correlation also changes across the life span ([storsve et al., ]). all models, but in particular the comparison models, have a significant shrinkage effect on the variance of the predicted values, indicating that harmonization techniques remove variance that is useful in predicting the response variable. this is most extreme for regressing out site effects and leads to poor performance across all performance metrics. we also observe that the performance of the residuals model is similar to the combat model without the preservation of age and sex, which is particularly reflected in the similarities of predicted variance in fig. b and in the srmse. both models suffer a loss of more than % of their original test variance. in contrast, the performance improves when variables like age and sex are preserved, as demonstrated by an increase in performance measures when using the version of combat in which variation associated with age and sex was preserved. we argue that the similarity in performance between combat and the residuals model is an indicator of the same underlying process, showing a weakness of the harmonization approach: merely regressing out site effects leads to the removal of meaningful variation correlated with the predictors of interest (in this case age and sex), especially when these predictors of interest are correlated with the site effects, which subsequently led to worse predictions of cortical thickness based on age and sex. this can be partially prevented by preserving important sources of variation when regressing out site effects, as shown for the modified combat model, where specified sources of variance were preserved when regressing out site effects. however, our results show two additional flaws of the harmonization approach: ) as already pointed out by [kia et al., ], in order to specify sources of variance that should be retained, all those sources of variance have to be known, which is not always the case; ) even with age and sex preserved the modified combat model only retains % of the original variance. our hierarchical bayesian models including the prediction-based approach, in contrast, preserves known and unknown interactions between site and biological covariates by specifically modeling site, thus overcoming this requirement. the result is reflected in larger proportions of variance retained (see fig. b. the advantage of the hierarchical bayesian approach becomes particularly clear when considering that the scores derived from normative models are relative scores describing the deviation from a predicted normative mean. thus, the normative deviation score is not affected by the absolute value of the predicted mean, and the number of predictors in the model does not influence the normative score. previous attempts to estimate the centiles of normative models have included polynomial regression [kessler et al., ], support vector regression [erus et al., ], quantile regression [huizinga et al., , lv et al., ] and gaussian process regression [wolfers et al., b], providing different degrees of the ability to separate between sources of variances and making individual predictions (for an overview see [marquand et al., ]). we chose a hierarchical bayesian framework for the implementation of our normative model as it has several advantages. the distribution-based structure based on posteriors allows for the separation and integration of different sources of variances, including epistemic (uncertainty in the model parameters), aleatoric (inherent variability in data) and prior variation, which are all considered when predicting cortical thickness based on age, sex and site. this allows for both the integration of already known information in the form of priors into the predictions, and for an adjustment of the precision of the estimate based on the uncertainty at each data point. in addition, the bayesian framework, as implemented in software packages like stan [carpenter et al., , stan development team, b], allows to draw samples from the full posterior distribution at the level of individual participants, which leads to an exact estimate of all parameters instead of an approximation. in particular in comparison to quantile regression, the distributional assumption entailed in the hierarchical bayesian approach also allows to get more precise estimates of the underlying centiles, particularly in the outer centiles, which are usually of primary interest and where the data are sparsest. the proposed bayesian framework also offers an elegant way to integrate site effects into normative models. site effects can be modeled via a hierarchical random effect structure, in which different sites are modeled semi-independently, sharing variation via a combined prior of higher order. this approach, also known as partial pooling, allows for including site- specific variance into the prediction for site, while at the same time constraining the amount of between-site variation to a maximum. whilst the primary aim of this study was to develop a novel method for dealing with site effects specifically within a normative modeling framework, the method can be used as general approach to clear neuroimaging data from site, age and sex effects. this is due to the fact that a normative score describes an individual’s cortical thickness in relation to the variance explained by the predictor variables in the normative model (age, sex and site). hence, they can be seen as “cleaned” cortical thickness measures that can be the basis for further analysis, for example to establish the (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , association between cortical thickness measures and clinical or demographic information. another potential clinical use of a normative model based on healthy controls could be, that, once established, it can be used to derive individualized deviation scores from individuals with a psychiatric or neurological disorder. their individual deviation scores can be considered as the degree of deviation from the normative variation and be used for further analysis, for example to predict clinically useful information. our proposed method has two potential disadvantages. the first one is related to the computational cost associated with estimating the covariance matrix within the gaussian process for the non-linear models, which in our analysis amounted to hours per model per region and could only be mastered via parallel processing on a cluster. this is due to the fact that using the non-linear gaussian process term becomes very time and memory expensive with growing n (o(n )). thus, in cases in which the relationship between the predictor and the outcome is estimated to be close to linear, the need for the more complex non-linear model should be carefully considered. secondly, the between-site split and the model at its current state only allow generalizations to a test set which includes individuals from the same sites as the training set, thus where the site variation is known. however, especially in clinical settings, generalizing the model and making predictions in data from new sites is an important additional goal. despite the fact that we cannot use the posterior distribution of one particular site as a prior when applying the model to a new, unknown site, the hierarchical bayesian framework still allows using the posterior parameter distributions of all sites as derived from the training data set as priors for site parameters when applying the model to a new site. this approach has already been successfully demonstrated in [kia et al., ] where the posterior parameter distribution of site derived from the training data was fed as a informative prior for the site predictor in a normative model applied to the test data consisting of new (unknown) sites. this use of a so called informed priors leads to more accurate and precise predictions than the broad, unspecific prior that would have to be used in cases where the distribution of the data is unknown [kia et al., ]. thus, despite some loss in precision, the bayesian framework can, in contrast to all other methods examined in this paper, be adapted to make predictions to new, unknown sites. conclusion we proposed an extended version of a normative modeling approach that is able to accommodate for site effects in neuroimaging data. the method is superior to previous approaches, including regressing out site and versions of combat [fortin et al., , johnson et al., ] and facilitates the estimation of normative models based on neuroimaging data pooled across many different scan sites. a further extension of the model to make generalizations to new sites and the application to clinical data will be the objective of future work. online material the supplementary material and the stan code for the hblm, hbgpm and simple bayesian linear model can be found at https://github.com/likeajumprope/bayesian_normative_models. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/likeajumprope/bayesian_normative_models https://doi.org/ . / . . . a preprint - february , references [bartlett, ] bartlett, m. s. ( ). properties of sufficiency and statistical tests. proceedings of the royal society of london. series a-mathematical and physical sciences, ( ): – . [bethlehem et al., ] bethlehem, r., seidlitz, j., romero-garcia, r., and lombardo, m. ( ). using normative age modelling to isolate subsets of individuals with autism expressing highly age-atypical cortical thickness features. biorxiv, page . [carpenter et al., ] carpenter, b., gelman, a., hoffman, m. d., lee, d., goodrich, b., betancourt, m., brubaker, m., guo, j., li, p., and riddell, a. ( ). stan: a probabilistic programming language. journal of statistical software, ( ). [chen et al., ] chen, j., liu, j., calhoun, v. d., vasquez, a. a., zwiers, m. p., gupta, c. n., frannke, b., and turner, j. a. ( ). exploration of scanning effects in multi-site structural mri studies. journal of neuroscience methods, ( ): – . [craddock et al., ] craddock, c., benhajali, y., chu, c., chouinard, f., evans, a., jakab, a., khundrakpam, b. s., lewis, j. d., li, q., milham, m., et al. ( ). the neuro bureau preprocessing initiative: open sharing of preprocessed neuroimaging data and derivatives. neuroinformatics, . [desikan et al., ] desikan, r. s., ségonne, f., fischl, b., quinn, b. t., dickerson, b. c., blacker, d., buckner, r. l., dale, a. m., maguire, r. p., hyman, b. t., albert, m. s., and killiany, r. j. ( ). an automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest. neuroimage, ( ): – . [di martino et al., ] di martino, a., yan, c. g., li, q., denio, e., castellanos, f. x., alaerts, k., anderson, j. s., assaf, m., bookheimer, s. y., dapretto, m., deen, b., delmonte, s., dinstein, i., ertl-wagner, b., fair, d. a., gallagher, l., kennedy, d. p., keown, c. l., keysers, c., lainhart, j. e., lord, c., luna, b., menon, v., minshew, n. j., monk, c. s., mueller, s., müller, r. a., nebel, m. b., nigg, j. t., o’hearn, k., pelphrey, k. a., peltier, s. j., rudie, j. d., sunaert, s., thioux, m., tyszka, j. m., uddin, l. q., verhoeven, j. s., wenderoth, n., wiggins, j. l., mostofsky, s. h., and milham, m. p. ( ). the autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. molecular psychiatry, ( ): – . [duane et al., ] duane, s., kennedy, a. d., pendleton, b. j., and roweth, d. ( ). hybrid monte carlo. physics letters b, ( ): – . [erus et al., ] erus, g., battapady, h., satterthwaite, t. d., hakonarson, h., gur, r. e., davatzikos, c., and gur, r. c. ( ). imaging patterns of brain development and their relationship to cognition. cerebral cortex, ( ): – . [feczko et al., ] feczko, e., miranda-dominguez, o., marr, m., graham, a. m., nigg, j. t., and fair, d. a. ( ). the heterogeneity problem: approaches to identify psychiatric subtypes. trends in cognitive sciences, ( ): – . [fischl et al., ] fischl, b., van der kouwe, a., destrieux, c., halgren, e., ségonne, f., salat, d. h., busa, e., seidman, l. j., goldstein, j., kennedy, d., caviness, v., makris, n., rosen, b., and dale, a. m. ( ). automatically parcellating the human cerebral cortex. cerebral cortex, ( ): – . [fortin et al., ] fortin, j. p., cullen, n., sheline, y. i., taylor, w. d., aselcioglu, i., cook, p. a., adams, p., cooper, c., fava, m., mcgrath, p. j., mcinnis, m., phillips, m. l., trivedi, m. h., weissman, m. m., and shinohara, r. t. ( ). harmonization of cortical thickness measurements across scanners and sites. neuroimage, (june ): – . [fortin et al., ] fortin, j. p., parker, d., tunç, b., watanabe, t., elliott, m. a., ruparel, k., roalf, d. r., sat- terthwaite, t. d., gur, r. c., gur, r. e., schultz, r. t., verma, r., and shinohara, r. t. ( ). harmonization of multi-site diffusion tensor imaging data. neuroimage, : – . [foulkes and blakemore, ] foulkes, l. and blakemore, s.-j. ( ). studying individual differences in human adolescent brain development. nature neuroscience, ( ): – . [fried, ] fried, e. ( ). moving forward: how depression heterogeneity hinders progress in treatment and research. expert review of neurotherapeutics, ( ): – . [gelman, ] gelman, a. ( ). data analysis using regression and multilevel/hierarchical models. cambridge university press. [gelman et al., ] gelman, a., carlin, j. b., stern, h. s., dunson, d. b., vehtari, a., and rubin, d. b. ( ). bayesian data analysis. crc press. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , [gronenschild et al., ] gronenschild, e. h. b. m., habets, p., jacobs, h. i. l., mengelers, r., rozendaal, n., van os, j., and marcelis, m. ( ). the effects of freesurfer version, workstation type, and macintosh operating system version on anatomical volume and cortical thickness measurements. plos one, ( ):e . [han et al., ] han, x., jovicich, j., salat, d., van der kouwe, a., quinn, b., czanner, s., busa, e., pacheco, j., albert, m., killiany, r., maguire, p., rosas, d., makris, n., dale, a., dickerson, b., and fischl, b. ( ). reliability of mri-derived measurements of human cerebral cortical thickness: the effects of field strength, scanner upgrade and manufacturer. neuroimage, ( ): – . [hoffman and gelman, ] hoffman, m. d. and gelman, a. ( ). the no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. j. mach. learn. res., ( ): – . [huizinga et al., ] huizinga, w., poot, d., vernooij, m., roshchupkin, g., bron, e., ikram, m., rueckert, d., niessen, w., and klein, s. ( ). a spatio-temporal reference model of the aging brain. neuroimage, : – . [insel et al., ] insel, t., cuthbert, b., garvey, m., heinssen, r., pine, d. s., quinn, k., sanislow, c., and wang, p. ( ). research domain criteria (rdoc): toward a new classification framework for research on mental disorders. [insel, ] insel, t. r. ( ). the nimh research domain criteria (rdoc) project: precision medicine for psychiatry. american journal of psychiatry, ( ): – . [johnson et al., ] johnson, w. e., li, c., and rabinovic, a. ( ). adjusting batch effects in microarray expression data using empirical bayes methods. biostatistics, ( ): – . [kessler et al., ] kessler, d., angstadt, m., and sripada, c. ( ). growth charting of brain connectivity networks and the identification of attention impairment in youth. jama psychiatry, ( ): – . [kia et al., ] kia, s. m., huijsdens, h., dinga, r., wolfers, t., mennes, m., andreassen, o. a., westlye, l. t., beckmann, c. f., and marquand, a. f. ( ). hierarchical bayesian regression for multi-site normative modeling of neuroimaging data. in martel, a. l., abolmaesumi, p., stoyanov, d., mateus, d., zuluaga, m. a., zhou, s. k., racoceanu, d., and joskowicz, l., editors, medical image computing and computer assisted intervention – miccai , pages – , cham. springer international publishing. [kia and marquand, ] kia, s. m. and marquand, a. ( ). normative modeling of neuroimaging data using scalable multi-task gaussian processes. in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), volume lncs, pages – . [leek et al., ] leek, j. t., scharpf, r. b., bravo, h. c., simcha, d., langmead, b., johnson, w. e., geman, d., baggerly, k., and irizarry, r. a. ( ). tackling the widespread and critical impact of batch effects in high-throughput data. nature reviews genetics, ( ): – . [lv et al., ] lv, j., biase, m. d., cash, r. f., cocchi, l., cropley, v., klauser, p., tian, y., bayer, j., schmaal, l., cetin-karayumak, s., rathi, y., pasternak, o., bousman, c., pantelis, c., calamante, f., and zalesky, a. ( ). individual deviations from normative models of brain structure in a large cross-sectional schizophrenia cohort. biorxiv, page . . . . [marquand et al., ] marquand, a. f., brammer, m., williams, s. c., and doyle, o. m. ( ). bayesian multi-task learning for decoding multi-subject neuroimaging data. neuroimage, : – . [marquand et al., ] marquand, a. f., kia, s. m., zabihi, m., wolfers, t., buitelaar, j. k., and beckmann, c. f. ( ). conceptualizing mental disorders as deviations from normative functioning. molecular psychiatry, ( ): – . [marquand et al., ] marquand, a. f., rezek, i., buitelaar, j., and beckmann, c. f. ( ). understanding heterogeneity in clinical cohorts using normative models: beyond case-control studies. biological psychiatry, ( ): – . [mathys et al., ] mathys, c. d., prüssmann, k., stephan, k. e., and behrens, t. ( ). hierarchical gaussian filtering construction and variational inversion of a generic bayesian model of individual learning under uncertainty. [miller et al., ] miller, k. l., alfaro-almagro, f., bangerter, n. k., thomas, d. l., yacoub, e., xu, j., bartsch, a. j., jbabdi, s., sotiropoulos, s. n., andersson, j. l., et al. ( ). multimodal population brain imaging in the uk biobank prospective epidemiological study. nature neuroscience, ( ): . [mirnezami et al., ] mirnezami, r., nicholson, j., and darzi, a. ( ). preparing for precision medicine. new england journal of medicine, ( ): – . [mueller et al., ] mueller, s. g., weiner, m. w., thal, l. j., petersen, r. c., jack, c. r., jagust, w., trojanowski, j. q., toga, a. w., and beckett, l. ( ). ways toward an early diagnosis in alzheimer’s disease: the alzheimer’s disease neuroimaging initiative (adni). alzheimer’s & dementia, ( ): – . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , [neal, ] neal, r. m. ( ). an improved acceptance procedure for the hybrid monte carlo algorithm. journal of computational physics, ( ): – . [neal et al., ] neal, r. m. et al. ( ). mcmc using hamiltonian dynamics. handbook of markov chain monte carlo, ( ): . [r core team, ] r core team ( ). r: a language and environment for statistical computing. r foundation for statistical computing, vienna, austria. [rasmussen and williams, ] rasmussen, c. e. and williams, c. k. i. ( ). gaussian processes for machine learning. mit press, cambridge. [raznahan et al., ] raznahan, a., shaw, p., lalonde, f., stockman, m., wallace, g. l., greenstein, d., clasen, l., gogtay, n., and giedd, j. n. ( ). how does your cortex grow? journal of neuroscience, ( ): – . [stan development team, a] stan development team ( a). rstan: the r interface to stan. r package version . . . [stan development team, b] stan development team ( b). stan modeling language users guide and reference manual, version . . [storsve et al., ] storsve, a. b., fjell, a. m., tamnes, c. k., westlye, l. t., overbye, k., aasland, h. w., and walhovd, k. b. ( ). differential longitudinal changes in cortical thickness, surface area and volume across the adult life span: regions of accelerating and decelerating change. journal of neuroscience, ( ): – . [thompson et al., ] thompson, p. m., jahanshad, n., ching, c. r. k., salminen, l. e., thomopoulos, s. i., bright, j., baune, b. t., bertolín, s., bralten, j., bruin, w. b., bülow, r., chen, j., chye, y., dannlowski, u., de kovel, c. g. f., donohoe, g., eyler, l. t., faraone, s. v., favre, p., filippi, c. a., frodl, t., garijo, d., gil, y., grabe, h. j., grasby, k. l., hajek, t., han, l. k. m., hatton, s. n., hilbert, k., ho, t. c., holleran, l., homuth, g., hosten, n., houenou, j., ivanov, i., jia, t., kelly, s., klein, m., kwon, j. s., laansma, m. a., leerssen, j., lueken, u., nunes, a., neill, j. o., opel, n., piras, f., piras, f., postema, m. c., pozzi, e., shatokhina, n., soriano-mas, c., spalletta, g., sun, d., teumer, a., tilot, a. k., tozzi, l., van der merwe, c., van someren, e. j. w., van wingen, g. a., völzke, h., walton, e., wang, l., winkler, a. m., wittfeld, k., wright, m. j., yun, j.-y., zhang, g., zhang-james, y., adhikari, b. m., agartz, i., aghajani, m., aleman, a., althoff, r. r., altmann, a., andreassen, o. a., baron, d. a., bartnik-olson, b. l., marie bas-hoogendam, j., baskin-sommers, a. r., bearden, c. e., berner, l. a., boedhoe, p. s. w., brouwer, r. m., buitelaar, j. k., caeyenberghs, k., cecil, c. a. m., cohen, r. a., cole, j. h., conrod, p. j., de brito, s. a., de zwarte, s. m. c., dennis, e. l., desrivieres, s., dima, d., ehrlich, s., esopenko, c., fairchild, g., fisher, s. e., fouche, j.-p., francks, c., frangou, s., franke, b., garavan, h. p., glahn, d. c., groenewold, n. a., gurholt, t. p., gutman, b. a., hahn, t., harding, i. h., hernaus, d., hibar, d. p., hillary, f. g., hoogman, m., hulshoff pol, h. e., jalbrzikowski, m., karkashadze, g. a., klapwijk, e. t., knickmeyer, r. c., kochunov, p., koerte, i. k., kong, x.-z., liew, s.-l., lin, a. p., logue, m. w., luders, e., macciardi, f., mackey, s., mayer, a. r., mcdonald, c. r., mcmahon, a. b., medland, s. e., modinos, g., morey, r. a., mueller, s. c., mukherjee, p., namazova-baranova, l., nir, t. m., olsen, a., paschou, p., pine, d. s., pizzagalli, f., rentería, m. e., rohrer, j. d., sämann, p. g., schmaal, l., schumann, g., shiroishi, m. s., sisodiya, s. m., smit, d. j. a., sønderby, i. e., stein, d. j., stein, j. l., tahmasian, m., tate, d. f., turner, j. a., van den heuvel, o. a., van der wee, n. j. a., van der werf, y. d., van erp, t. g. m., van haren, n. e. m., van rooij, d., van velzen, l. s., veer, i. m., veltman, d. j., villalon-reina, j. e., walter, h., whelan, c. d., wilde, e. a., zarei, m., and zelman, v. ( ). enigma and global neuroscience: a decade of large-scale studies of the brain in health and disease across more than countries. translational psychiatry, ( ): . [volkow et al., ] volkow, n. d., koob, g. f., croyle, r. t., bianchi, d. w., gordon, j. a., koroshetz, w. j., pérez-stable, e. j., riley, w. t., bloch, m. h., conway, k., et al. ( ). the conception of the abcd study: from substance use to a broad nih collaboration. developmental cognitive neuroscience, : – . [wierenga et al., ] wierenga, l. m., langen, m., oranje, b., and durston, s. ( ). unique developmental trajectories of cortical thickness and surface area. neuroimage, : – . [wolfers et al., ] wolfers, t., beckmann, c. f., hoogman, m., buitelaar, j. k., franke, b., and marquand, a. f. ( ). individual differences v. the average patient: mapping the heterogeneity in adhd using normative models. psychological medicine, pages – . [wolfers et al., ] wolfers, t., beckmann, c. f., hoogman, m., buitelaar, j. k., franke, b., and marquand, a. f. ( ). individual differences v. the average patient: mapping the heterogeneity in adhd using normative models. psychological medicine, ( ): – . [wolfers et al., a] wolfers, t., doan, n. t., kaufmann, t., alnæs, d., moberget, t., agartz, i., buitelaar, j. k., ueland, t., melle, i., franke, b., andreassen, o. a., beckmann, c. f., westlye, l. t., and marquand, a. f. ( a). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . a preprint - february , mapping the heterogeneous phenotype of schizophrenia and bipolar disorder using normative models. jama psychiatry, ( ): . [wolfers et al., b] wolfers, t., doan, n. t., kaufmann, t., alnæs, d., moberget, t., agartz, i., buitelaar, j. k., ueland, t., melle, i., franke, b., et al. ( b). mapping the heterogeneous phenotype of schizophrenia and bipolar disorder using normative models. jama psychiatry, ( ): – . [zabihi et al., ] zabihi, m., oldehinkel, m., wolfers, t., frouin, v., goyard, d., loth, e., charman, t., tillmann, j., banaschewski, t., dumas, g., et al. ( ). dissecting the heterogeneous cortical anatomy of autism spectrum disorder using normative models. biological psychiatry: cognitive neuroscience and neuroimaging, ( ): – . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . introduction methods data abide data set site effects in the abide data set splitting the abide data set into training and test sets site as a predictor in a hierarchical bayesian model comparison models performance measures measures of model performance measures of goodness of the simulation in stan model specification posterior predictive distribution comparison models implementation: normative modeling in stan model simulation process in stan results mean standardized log loss predictive variance discussion conclusion online material sequence neighborhoods enable reliable prediction of pathogenic mutations in cancer genomes shayantan banerjee , , , karthik raman , , * & balaraman ravindran , , * robert bosch centre for data science and artificial intelligence (rbcdsai), indian institute of technology (iit) madras, chennai - initiative for biological systems engineering, iit madras, chennai - bhupat and jyoti mehta school of biosciences, department of biotechnology, iit madras, chennai - department of computer science and engineering, iit madras, chennai - *corresponding author abstract identifying cancer-causing mutations from sequenced cancer genomes hold much promise for targeted therapy and precision medicine. “driver” mutations are primarily responsible for cancer progression, while “passengers” are functionally neutral. although several computational approaches have been developed for distinguishing between driver and passenger mutations, very few have concentrated on utilizing the raw nucleotide sequences surrounding a particular mutation as potential features for building predictive models. using experimentally validated cancer mutation data in this study, we explored various string-based feature representation techniques to incorporate information on the neighborhood bases immediately ʼ and ʼ from each mutated position. density estimation methods showed significant distributional differences between the neighborhood bases surrounding driver and passenger mutations. binary classification models derived using repeated cross-validation experiments gave comparable performances across all window sizes. integrating sequence features derived from raw nucleotide sequences with other genomic, structural and evolutionary features resulted in the development of a pan-cancer mutation effect prediction tool, nbdriver, which was highly efficient in identifying pathogenic variants from five independent validation datasets. an ensemble predictor obtained by combining the predictions from nbdriver with two other commonly used driver prediction tools (condel and mutation taster) outperformed existing pan-cancer models in prioritizing a literature-curated list of driver and passenger mutations. using the list of true positive mutation predictions derived from nbdriver, we identified a list of known driver genes with functional evidence from various sources. overall, our study underscores the efficacy of utilizing raw nucleotide sequences as features to distinguish between driver and passenger mutations from sequenced cancer genomes. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / introduction cancer is caused due to the accumulation of somatic mutations during an individualʼs lifetime [ ]. these mutations arise due to both endogenous factors such as errors during dna replication, or exogenous factors such as substantial exposure to mutagens such as tobacco smoking, uv light, and radon gas. [ ]–[ ]. these somatic mutations can be of different types, ranging from single-nucleotide variants (snvs), to insertions and deletions of a few nucleotides, copy-number aberrations (cnas), and large-scale rearrangements known as structural variants (svs) [ ]. with the advent of high-throughput sequencing, the identification of somatic mutations from sequenced cancer genomes has become easier. international cancer genomics projects have resulted in the development of large mutational databases such as the catalogue of somatic mutations in cancer (cosmic) [ ], the international cancer genome consortium (icgc) [ ], and the cancer genome atlas (tcga) [ ]. several open-access resources to analyze and visualize large cancer genomics datasets, such as the cbio cancer genomics portal [ ] and the database of curated mutations in cancer (docm) [ ], have also been developed. these resources aggregate functionally relevant cancer variants from different studies and help researchers gain easy access to expert-curated lists of pathogenic somatic variants. however, not all somatic mutations present in the cancer genome are equally responsible for developing the disease. a small fraction of somatic variants known as “driver mutations” provide a growth advantage and are positively selected for, during cancer cell development [ ]. on the other hand, “passenger mutations” provide no growth advantage and do not contribute to cancer progression [ ]. identifying the complete set of cancer-causing genes that harbor driver mutations, also known as driver genes, holds much promise for precision medicine, where a specific therapeutic intervention is tailored towards a patientʼs mutational profile [ ]. distinguishing between driver and passenger mutations from sequenced cancer genomes is a non-trivial task. doing so solely based on the substitution type (a->t, g->c, etc.) is very difficult. hence, several computational methods that utilize several other factors to identify driver mutations have been developed over the years. recurrence-based driver prioritization tools such as mutsigcv [ ] and music [ ] for single-nucleotide variants, and gistic [ ] for copy number aberrations, have been developed to identify variants that occur more than what is expected by chance, otherwise known as the “background mutation rate”. other methods such as sift [ ], provean [ ], polyphen- [ ], chasm [ ], and fathmm [ ] are based on predicting the functional impact of mutations on the protein encoded by the gene. expert-curated databases such as the oncokb database [ ] contain information regarding the functional impact of over cancer-causing alterations belonging to over genes. pathway analysis based tools such as netbox [ ] and hotnet [ ] work by identifying mutations affecting large scale gene regulatory or protein–protein interaction networks. machine .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?plijhv https://www.zotero.org/google-docs/?xyj f https://www.zotero.org/google-docs/?wzjo i https://www.zotero.org/google-docs/?azvumf https://www.zotero.org/google-docs/?c p https://www.zotero.org/google-docs/? svab https://www.zotero.org/google-docs/?r u rw https://www.zotero.org/google-docs/?a avbq https://www.zotero.org/google-docs/? mghq https://www.zotero.org/google-docs/?slmuet https://www.zotero.org/google-docs/?uxlosi https://www.zotero.org/google-docs/? ndh https://www.zotero.org/google-docs/?mhcwtw https://www.zotero.org/google-docs/?gm ep https://www.zotero.org/google-docs/?tz l h https://www.zotero.org/google-docs/?ccyi https://www.zotero.org/google-docs/? ptyhx https://www.zotero.org/google-docs/? aj la https://www.zotero.org/google-docs/?yxy https://www.zotero.org/google-docs/? emxbi https://www.zotero.org/google-docs/?nwbpxd https://www.zotero.org/google-docs/?phduqr https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / learning-based methods have also been recently developed to predict deleterious missense mutations [ ]–[ ]. genome instability, demonstrated by a higher than average rate of substitution, insertion, and deletion of one or more nucleotides, is a hallmark of most cancer cells. there is a considerable variation in the rates of snps across the human genome. sequence context plays a significant role in the variability of the substitutions rate as explained by the cpg dinucleotides, which exhibit an elevated c->t substitution rate by almost folds relative to the average rate observed in mammals [ ]. mutational hotspots such as the cpg dinucleotides in breast and colorectal cancer [ ] and tpc dinucleotides in lung cancer, melanoma, and ovarian cancer [ ] are some examples of “signatures” that promote mutagenesis. there have been several efforts to utilize the sequence context to measure the human genomeʼs substitution rates. aggarwala et al. [ ] used the local sequence context of snps to explain the observed variability in substitution rates. zhao et al. [ ] studied the neighboring nucleotide biases and their effect on the mutational and evolutionary processes for over two million snps. recent studies have identified specific signatures or patterns of mutations in different cancer types that shed light on the underlying mechanisms responsible for cancer progression [ ], [ ]. alexandrov et al. [ ] identified distinct mutational signatures in human cancers by considering the substitution class and the sequence context immediately to the ʼ and ʼ of the mutated base. several studies have demonstrated that certain factors such as tobacco smoking, uv light, or the inactivation of tumor suppressor genes involved in dna repair can result in the development of mutational hotspots [ ], [ ], [ ]. there have been two recently published studies that have tackled this problem, to the best of our knowledge. deitlein et al. [ ] hypothesized that driver mutations occur more frequently in “unusual” nucleotide positions than passenger mutations and built probabilistic models to identify driver genes that had mutations in those “unusual” contexts. agajanian et al. [ ] integrated classical machine learning and deep learning approaches to model raw nucleotide sequences to differentiate between driver and passenger mutations. in this study, our overall aim is to build models utilizing machine learning and natural language processing techniques to differentiate between driver and passenger mutations solely based on the raw nucleotide context. using missense mutation data with experimentally validated functional impacts compiled from various studies, we show that the underlying probability distributions of driver and passenger mutationsʼ neighborhoods are significantly different from one another. we extracted features from the neighborhood nucleotide sequences and built robust binary classification models to distinguish between the two classes of mutations. we achieved good classification performances during our repeated cross-validation experiments and against an independent hold-out set of literature curated mutations. integrating neighborhood features with other features such as protein physicochemical properties and evolutionary conservation scores significantly improved our .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/? p kkb https://www.zotero.org/google-docs/?dpdkux https://www.zotero.org/google-docs/?m ybv https://www.zotero.org/google-docs/?rwqm v https://www.zotero.org/google-docs/?dsmpda https://www.zotero.org/google-docs/?t ez https://www.zotero.org/google-docs/?f c xc https://www.zotero.org/google-docs/?f c xc https://www.zotero.org/google-docs/?d jlpx https://www.zotero.org/google-docs/?jg iea https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / algorithmʼs overall predictive power in identifying pathogenic variants from five separate independent test sets, and had comparable performances with some of the existing state-of-the-art mutation effect prediction tools. overall, this study establishes that we can leverage efficient feature representation of the neighborhood sequences of cancer-causing mutations to differentiate between a known driver and passenger mutations with sufficient discriminative power. methods mutation datasets for building and evaluating the models our training data consisted of the list of missense mutations whose effects were determined from experimental assays and were compiled in the study conducted by brown et al. [ ]. in this study, missense mutations from genes that were pan-cancer-based were combined from five different datasets [ ], [ ]–[ ] (supplementary table ). these mutations were presented as amino acid substitutions based on their protein coordinates (e.g., f l, l q, etc.). since we were interested in studying the effects of neighboring dna nucleotide sequences, we mapped them to their corresponding genomic coordinates (gdna) for further analysis. we used the publicly available transvar web-interface [ ] for this purpose. the final training set was made up of single nucleotide variants ( passengers and drivers). for external validation, we collected somatic mutation data from five different sources. first, we considered a literature-curated list of passengers and driver mutations categorized based on functional evidence published by martelotto et al. [ ] as part of the benchmarking study to rank various mutation effect prediction algorithms. second, we used a subset of mutations published by the recently released cancer mutation census. the cancer mutation census (cmc) [ ] is a database that integrates all coding somatic mutation data from the cosmic database to prioritize variants driving different cancer forms. it contains functional evidence obtained using both manual curation and computational predictions from multiple sources. for our validation experiments, we chose only single nucleotide variants classified as missense and derived from the cgc-classified list of tumor suppressor genes and oncogenes. based on the databaseʼs various evidence criteria, we considered only mutations categorized as tier , , and for our study. from this list, we further removed all overlapping mutations with our training set and derived a final set of mutations for further analysis. the catalog of validated oncogenic mutations from the cancer genome interpreter [ ] database contains a high confidence list of pathogenic alterations compiled from several sources such as the docm [ ], clinvar [ ], oncokb [ ], and the cancer biomarkers database .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/? mofj https://www.zotero.org/google-docs/?wdykrt https://www.zotero.org/google-docs/? giund https://www.zotero.org/google-docs/?zqawo https://www.zotero.org/google-docs/?r t z https://www.zotero.org/google-docs/?iw hgh https://www.zotero.org/google-docs/?vuybxn https://www.zotero.org/google-docs/?eduh d https://www.zotero.org/google-docs/?eiteb https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / [ ]. we extracted only missense somatic mutations flagged as “cancer” for our validation experiments. after removing all overlapping mutations with our training set, we obtained a final list of driver mutations. this constituted our third validation set. the fourth validation dataset consisted of the list of top hotspot mutations reported in the comprehensive study done by rheinbay et al. [ ]. in this study, mutation data was accumulated from the pan-cancer analysis of whole genomes (pcawg) consortium and involved analyzing more than cancer genomes derived from more than patients. a total of coding missense mutations from five well-known cancer genes: tp , pik ca, nras, kras, idh , were extracted from this study. mao et al. [ ] published mutation datasets to judge the performance of the driver prediction tool (candra) in predicting rare driver mutations. they were constructed using the following criteria: . gbm and ovc mutations reported in the cosmic database only once. . the reported mutations had no other mutations within bp of their position and were not part of either the training or test datasets for building the machine learning model (candra). we used the same datasets to judge our modelʼs ability to predict rare driver mutations based solely on the neighborhood sequences. after removing all overlapping mutations with the training set, we obtained gbm mutations and ovc mutations. a summary of all the mutational datasets used in our study is available in table . besides, all our predictions are derived using the forward strand and were based on the grch (ensembl release ) build of the human genome. feature extraction sequence-based features we used the raw nucleotide sequences surrounding a mutation as features for our analysis. each unique mutation was represented as a triplet (chromosome, position, type) where “type” refers to one of the types of point substitution (a>t, a>g, a>c, t>a, tc, g>a, g>c, g>t, c>t, c>a, c>g). we then extracted the surrounding raw nucleotide sequences from the reference genome for a given mutation position using the bedtools getfasta command. the “window size” for a particular mutation captures the number of nucleotides upstream and downstream from the mutated position. hence, considering all possible window sizes between and , including the wild-type nucleotide at the mutated position, we obtained nucleotide strings of length , , , , , , , , , and , respectively. we also considered the chromosome number and the type of point substitution as features for our analysis. now, for particular window size, to map the nucleotide strings to a numerical format, we used the following two widely used feature transformation approaches (figure ): .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?qegamk https://www.zotero.org/google-docs/?cppjwd https://www.zotero.org/google-docs/?azdaon https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . one-hot encoding: each neighboring nucleotide was represented as a binary vector of size containing all zero values except the nucleotide index, which was marked as . thus “a” was encoded as [ , , , ], “g” as [ , , , ] and so on. this particular feature representation resulted in a feature space of size , wheren + represents the window sizes. we used the pandas get_dummies() to, , ... n = perform this task. . overlapping k-mers: in this type of feature representation, the neighboring nucleotide string sequences for a given window size were represented as overlapping k-mers of lengths , and . for instance, an arbitrary sequence of window size {atttgga}, where ̒tʼ is the wild type base at the mutated position, can be decomposed into overlapping k-mers of size {at, tt, tt, tg, gg, ga}, {att, ttt, ttg, tgg, gga} and {attt, tttg, ttgg, tgga} respectively. to map these overlapping k-mers to a numerical format, we applied two commonly used encoding techniques known as countvectorizer and tfidfvectorizer. the countvectorizer returns a vector encoding whose length is equal to that of the vocabulary (total number of unique k-mers in the data set) and contains an integer count for the number of times a given k-mer has appeared in our dataset. a term frequency – inverse document frequency (tf-idf) vectorizer assigns scores to each k-mer based on i) how often the given k-mer appears in the dataset and ii) how much information the given k-mer provides, i.e., whether it is common or rare in our dataset. mathematically, for a given term i present in a document j, the tf-idf score is given bytf i,j req ogtf i,j = f i,i × l di n where is the number of occurrences of i in j, is the number of documents containingreq f i,j d i i, and n is the total number of documents. these techniques were implemented in python using the feature_extraction module from scikit-learn. the final processed training set used to build the machine learning models was represented as a matrix of size , where m is the total nm number of coding point mutations and n is the size of the vocabulary. the matrix entries were the tf-idf or the countvectorizer scores. the number of one-hot encoded features, k-mers, and the size of the vocabulary possible for each window size is shown in table . descriptive genomic features in addition to the neighborhood features, a set of features (supplementary table ) previously used to train the cancer-specific missense mutation annotation tool, candra [ ], were extracted from the following three data portals: chasmʼs snvbox [ ], mutation assessor [ ] and annovar [ ]. among them were conservation scores (such as ʻgerpʼ scores, ʻhmmphcʼ scores and others), amino acid substitution features (such as ʻpredrsaeʼ, ʻpredbfactorsʼ, and others), exon features (such as ʻexonsnpdensityʼ, ʻexonconservationʼ and others), features indicative of protein domain knowledge (such as ʻʻuniprotdom_postmodenzʼ, ʻuniprotregionsʼ and others) and functional impact scores computed by algorithms such as vest [ ] and chasm [ ]. a tiny fraction ( . %) of the uniprotkb annotations were not .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?iadw https://www.zotero.org/google-docs/?w lqnm https://www.zotero.org/google-docs/?g ogev https://www.zotero.org/google-docs/?dmxzxz https://www.zotero.org/google-docs/?aepklv https://www.zotero.org/google-docs/?haesz https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / available from the snvbox database for our training data. we used the k-nearest neighbors-based imputation technique to substitute the missing features with those of the same geneʼs nearest mutations. our external validation datasets were free from any missing information. density estimation a kernel density estimator (or kde) takes an n-dimensional dataset as an input and outputs an estimate of the underlying n-dimensional probability distribution. a gaussian kde uses a mixture of n-dimensional gaussian probability distributions to represent the density being estimated. it essentially tries to center one gaussian component per data point, resulting in a non-parametric estimation of the density. one of the hyperparameters for a kernel density estimator is the bandwidth, which controls the kernelʼs size at each data point, thereby affecting the “smoothness” of the resulting curve. we estimated the underlying probability distributions for the driver and passenger neighborhoods using a gaussian kernel density estimator. the schematic workflow of the entire process for a single run of the kernel density estimation experiment is shown in figure (a-f). first, we randomly selected, with replacement, an equal number (n) of driver and passenger mutations from our training data for a single run of the kernel density estimation algorithm and particular window size (figure a). then, we tuned the bandwidth hyperparameter for each class of mutations using a -fold cross-validation approach and used the best parameters to derive the kernel density estimates (figure b). finally, we used the jensen-shannon (js) distance metric to calculate the similarity between the two class-wise density estimates (figure c). the js distance between two probability distributions is based on the kullback-leibler (kl) divergence, but unlike kl divergence, it is bounded and symmetric. for two probability vectors, p and q, it is given by, s j = √d(p||m) (q||m)+ d where , and is the kl divergence. the significance of the estimated distances (p )m = + q d between the probability estimates was calculated using a randomized bootstrapping approach. specifically, we randomly sampled with replacement twice the number ( n) of mutations from the same training set, irrespective of the labels. we then split the dataset in half, randomly assigning each half to driver and passenger mutations, respectively (figure d). this was followed by a similar process of tuning the hyperparameters and deriving the class-wise density estimates (figure e). finally, we reported the js distance between the density estimates (figure f). we experimented with the following seven different neighborhood-based feature representations: ● one-hot encoding ● count vectorizer (k-mer sizes of , and ) ● tf-idf vectorizer (k-mer sizes of , and ) .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the aforementioned kde estimation experiments were repeated times for all possible window sizes between and and all seven feature representations. next, the best median js distance estimate from the original experiments was reported for the given window size. the percentage of runs of the randomized experiments for which the estimated distance was greater than the original estimate was reported as the p-value.the kerneldensity() function from the scikit-learn neighbors module was used to derive the density estimates and jensenshannon() from the scipy spatial.distance submodule was used to calculate the distance metric. classification models to build our binary classification models, we implemented three classifiers: the random forest classifier, the extra trees classifier (extreme random forests), and the generative kde classifier. the overall approach for the kde-based classification was as follows (figure a): . the dataset was split using the cross-validation strategy. . the training data was then split by label (driver/passenger). . for each class, we fit a generative model using the kernel density estimation method as described in the previous section. this gave us the likelihood that and (x|passenger) p respectively for a particular data point x.(x|driver) p . next, the class prior, which is given by the number of examples of each class: (driver) p and was calculated.(passenger) p . now, for a test data point x, the posterior probability was given by and . the(driver|x) ∝ p (x|driver)p (driver) p (passenger|x) ∝ p (x|passenger)p (passenger) p label that maximized the posterior probabilities was the one assigned to x. in contrast, both the tree-based classifiers are discriminative. they are composed of a large collection of decision trees where the final output is derived by combining every single treeʼs predictions by a majority voting scheme. the main difference between the two tree-based classifiers lies in selecting splits or cut points to split the individual nodes. random forest chooses an optimal split for each feature under consideration, whereas extra trees chooses it randomly. all the classification models were written using the predefined functions available in the scikit-learn (v. . ) [ ] module. model selection and tuning repeated cross-validation experiments owing to the relatively smaller sample size ( mutations) of the training set of mutations, we adopted a repeated -fold cross-validation approach to building our model. first, we split the dataset into ten equal subsets in a stratified fashion. splitting the dataset in a stratified fashion maintains the same proportion of mutations in each class as observed in the original data. nine of the ten subsets were combined into one training set (figure a). in each training phase, we performed feature selection using the extra trees classifier, cross-validated grid search-based parameter tuning, training the classifiers using the best parameters, and .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?w osk https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / obtaining the corresponding prediction scores on the hold-out test set (figure b). for a given window size, we experimented with a total of seven feature representations (one-hot encoding, count vectorizer (k-mer size= , and ), tf-idf vectorizer (k-mer size= , and ), and three binary classifiers (random forests, extra trees, and kernel density estimation). so overall, we had distinct feature-classifier pairs. we ran the -fold cross-validation experiments (figure (a-b)) three times for each such pair, thereby obtaining values for each classification metric: sensitivity, specificity, auc, and mcc. the best overall median value, the % ci for each of the above metrics, and the corresponding feature-classifier pair were reported. to study the variation in classification performances with the addition of more nucleotides (or increase in window size), we repeated the wilcoxon signed-rank test on the generated performance metrics for all pairs of window sizes , where . the ci()from the gmodels package [ ] in r x, ) ( y and (x, ) , , .., ] x < y y ∈ [ . was used to calculate the % cis for the various classification metrics. derivation of the binary classification model to distinguish between driver and passenger mutations to derive the final machine learning model, all overlapping mutations between the training set brown et al., and the validation set martelotto et al., were discarded, and the classifier was retrained on the reduced train set ( mutations: drivers and passengers). the set of mutations published by martelotto et al. [ ] formed our independent test set. due to the inherent imbalance in the dataset, we implemented an undersampling technique known as repeated edited nearest neighbors [ ] to downsize the majority class and consequently obtain a balanced dataset for subsequent training. predictions were obtained using two separate feature sets: ) only neighborhood features based on the raw nucleotide sequences (or the neighborhood-only-model) and ) neighborhood features plus the descriptive genomic features (or nbdriver). in addition to random forests, extra trees, and the kde classifier, we also experimented with a fourth classifier: a linear kernel svm to obtain these predictions. various combinations of these classifiers were implemented as ensemble models using the votingclassifier() of the ensemble module in scikit-learn. feature selection we adopted an impurity-based feature selection technique for feature selection using the extra trees classifier to derive a ranked list of the top predictive features for our analysis. for the repeated cross-validation experiments, the features that were within the top percentile of the most important features were selected and subsequently used to train our models. however, for deriving nbdriver, we built several classification models based on the top n (n= , , , , ) features and chose the one that gave the best overall classification performance. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?enskkz https://www.zotero.org/google-docs/?euz i https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / the tf-idf and countvectorizer scores, used as features for our analysis, were implemented using the feature_extraction module in scikit-learn. in both cases, a new vocabulary dictionary of all the k-mers was first learnt from the training data using the fit_transform() routine and the corresponding term-document matrix was returned. using this vocabulary, the scores of the k-mers from the test data were obtained using the transform() routine and were subsequently used in our analysis. hyperparameter tuning and classifier threshold selection hyperparameter tuning was done using a cross-validation based grid search technique over a parameter grid. the gridsearchcv() from the model_selection module in scikit-learn was used for this purpose. to further fine-tune the classifiers, we experimented with various classification thresholds from to with step sizes . and chose the one that gave the best auroc. for an imbalanced classification problem, using the default threshold of . is not a viable option and often results in the incorrect prediction of the minority class examples. performance metrics for the repeated cross-validation experiments, we assessed our classifiersʼ performance using four commonly used performance metrics: sensitivity, specificity, mathews correlation coefficient (mcc), and area under the roc curve (auroc). mathews correlation coefficient is a balanced metric and is very useful in imbalance classification problems. it is bounded between - and , with - representing perfect misclassification, representing average classification, and + representing ideal classification. it is given by the following expression: ccm = tp ×tn−fp ×fn √(tp +fp )(tp +fn)(tn+fp )(tn+fn) where tp stands for true positives, tn, true negatives, fp, false positives and fn, false negatives. mcc is a more robust alternative to accuracy and f -score that can sometimes show overoptimistic classification performance for imbalanced data and was therefore not used for the analysis. after deriving the binary classifier, we used additional classification performance metrics outlined by martelotto et al. to compare our algorithm's performance with other state-of-the-art mutation effect prediction tools. they were positive predictive value (ppv), negative predictive value (npv), and a composite score, defined as the sum of sensitivity, specificity, ppv, and npv. comparison with other pan-cancer mutation effect predictors similar to the benchmarking study conducted by martelotto et al., we compared the generated binary classifiers with nine pan-cancer mutation effect prediction tools: mutation taster [ ], fathmm (cancer) [ ], condel [ ], fathmm (missense) [ ], provean (v . . ) [ ], sift (ensemble ) [ ], polyphen [ ], mutation assessor [ ] and vest [ ] using the set of .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?dmlx f https://www.zotero.org/google-docs/?fxfcrn https://www.zotero.org/google-docs/?h aazi https://www.zotero.org/google-docs/?nrvhwv https://www.zotero.org/google-docs/?boe rn https://www.zotero.org/google-docs/?geq jo https://www.zotero.org/google-docs/?bvsa r https://www.zotero.org/google-docs/?mclkrl https://www.zotero.org/google-docs/?aqm zp https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / literature-curated mutations. for each of these predictors, we used the prediction labels based on predefined score cutoffs published as part of the martelotto et al. [ ] study. two new prediction algorithms (chasmplus (pan-cancer) [ ] and candra+ (cancer-in general) [ ]) were also added to the list, and the score cutoffs were decided in the following manner. for chasmplus, we tested all possible thresholds between and with step sizes of . and chose the corresponding threshold with the highest composite score due to the absence of a default threshold. all mutations with predicted scores greater than this optimal threshold were labeled as drivers and vice versa. for candra+, we used the default prediction categories [ ]. predictions for chasmplus and candra+ were obtained from the opencravat web server [ ] and executable packages published by mao et al. [ ]. different mutation effect predictors were combined using the majority voting rule to obtain better predictive power, and ensemble models were created. while comparing two algorithms, to derive the significance of the difference between any two classification metrics, we adopted the same strategy as martelotto et al . briefly, we derived the % ci for each of these classification metrics by repeated sampling with replacement with iterations. if the generated ciʼs touched or there was no overlap, the difference was considered significant ( ) based on the results of the analysis done by ng et al. [ ].. p < results first, we report a pan-cancer machine learning tool, nbdriver (neighborhood driver), which utilizes neighborhood sequences as features to discriminate missense mutations as either drivers or passengers. our key results are three-fold. first, we use generative models to derive the distances between the underlying probability estimates of the two classes of mutations. then, we build robust classification models using repeated cross-validation experiments to derive the median values of the metrics designed to estimate the classification performances. finally, we demonstrate our modelsʼ ability to predict unseen coding mutations from independent test datasets derived from large mutational databases. neighborhood sequences of driver and passenger mutations show markedly different distributions we estimated the driver and passenger neighborhood sequencesʼ underlying probability distributions using kernel density estimation. we computed the jensen–shannon (js) distance metric to understand how “distinguishable” they are from one another. the js metric is bounded between (maximally similar) and (maximally dissimilar). table shows the results of the kde estimation experiments for various window sizes. we observed that, for the brown et al. dataset [ ], the maximum significant ( ) median js distance between passenger . p < and driver neighborhood distributions, calculated across runs of bootstrapping experiments, was . (for a window size of ), and the minimum was . (for window sizes .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?tppuvo https://www.zotero.org/google-docs/?l is https://www.zotero.org/google-docs/?vwlzpj https://www.zotero.org/google-docs/?nk ang https://www.zotero.org/google-docs/?dpgzop https://www.zotero.org/google-docs/?xqdhdr https://www.zotero.org/google-docs/?k ajbf https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / - ). figure shows the variation in the js distances between the original and the randomized kde experiments for window sizes between and . as evident from figure , except for window size , all other window sizes had a significant js distance value ( ).. p < out of the seven different feature representations, we reported the ones that gave the maximum median js distance. from table , we observed that a tf-idf vectorizer with k-mer sizes , and was the preferred form of feature representation for six window sizes ( , , , , and ), whereas a count vectorizer with k-mer sizes and was chosen for three window sizes ( , , and ). however, the only exception was for a window size of , where the one-hot encoding-based feature representation technique gave the maximum median js distance. these results indicated the tf-idf based feature representation was the most efficient at delineating the differences in the distributions between the driver and passenger neighborhoods. repeated cross-validation using neighborhood features generates robust classification models the repeated cross-validation experiments using only the neighborhood sequences as features are shown in the supplementary table a. from these results, we observed that the best median sensitivity of . ( %ci . - . ) was obtained using features derived from a count vectorizer and subsequent training using a random forest classifier for window sizes , , and . however, the best median specificity of . ( %ci . - . ), auc of . ( % ci . - . ), and mcc of . ( % ci . - . ) were obtained using a tf-idf based feature representation trained using a kde classifier for a window size of . the variation in the classification performances for different window sizes obtained during the repeated cross-validation experiments using the initial training set of mutations is shown in figure . this figure shows that except for window sizes and , a tf-idf vectorizer gave the maximum median auc, specificity, and mcc. however, for all window sizes, the maximum median sensitivities were obtained using the count vectorizer based feature representation technique. classification metrics such as auc and mcc are used to measure the quality of binary classifications. similar to our observations made from the kde estimation experiments (table ), the tf-idf vectorizer performed consistently well both in terms of the overall auc and mcc, indicating that this particular feature representation technique was the most efficient separating the two classes of mutations. the variation in the classification performances with the increase in the window size is shown in supplementary table b. from this table, we observed that out of the unique pairs of window sizes (methods: repeated cross-validation experiments), had a significant ( ; . p < wilcoxon signed-rank test) increase in specificity and auc while had a significant ( ; . p < wilcoxon signed-rank test) increase in mcc with the addition of more nucleotides. however, for sensitivity, a significant increase was observed only when the window size was increased from to and to , respectively. these results indicated that adding more nucleotides to a .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / particular window does not always guarantee an increase in the classifierʼs performance in distinguishing between driver and passenger mutations. classification models give performances comparable with other state-of-the-art mutation effect predictors using only the neighborhood nucleotide sequences as features, the best results (table a) on the independent test set [ ], was obtained using an extra trees classifier. this neighborhood-only model was trained on features extracted using the count vectorizer technique on a window size of . we trained nbdriver by combining the neighborhood features and the descriptive genomic features. out of the various classifiers implemented, an ensemble model consisting of a linear kernel svm and a kde classifier gave the best results (table a). compared to the neighborhood-only model, there was a significant increase ( ) in accuracy (= . ), . p < sensitivity (= . ), npv (= . ), composite score (= . ), and mcc (= . ). however, this was accompanied by a significant ( ) drop in specificity (= . ). there was no . p < significant change in ppv, though. a ranked list of the features used to train nbdriver is shown in supplementary table . out of those features, were neighborhood-based features or the tf-idf scores of the overlapping -mers extracted from a window size of . the plot displaying the variation in the auroc with various classification thresholds is shown in figure . the best results were obtained using a threshold of . . consequently, all mutations with the prediction scores above this threshold were classified as drivers and vice versa. overall, on this benchmarking dataset, nbdriver ranked fourth in terms of the composite score, fifth in terms of specificity, and second in npv, ppv, sensitivity, and accuracy. by contrast, although the neighborhood-only-model was the top-ranking tool in terms of specificity and ppv, it did not perform well in terms of the other metrics. owing to nbdriverʼs superior performance, all subsequent external validations were performed using this model. voting ensemble of prediction algorithms gives better classification performances we also assessed the effect of combining multiple top-ranked single predictors into an ensemble model. we evaluated nbdriverʼs contribution to the overall ensemble by obtaining predictions without the tool. the top-performing ensemble consisting of nbdriver, chasmplus, fathmm (cancer), mutation taster, and condel resulted in a composite score of . , accuracy of . , and an npv of . , significantly higher ( than every single . ) p < predictor evaluated in the study (table b; supplementary table ). the composite score and .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?r loip https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / accuracy obtained using this ensemble were also the highest among all the different combinations of single-predictors tested in this study (supplementary table ). removing nbdriver from the ensemble resulted in a significant decrease ( in the composite . ) p < score, npv, mcc, accuracy, and sensitivity. however, it was accompanied by a significant increase in specificity and no significant ppv change for the smaller ensemble (table b). another ensemble model consisting of nbdriver, mutation taster, and condel gave similar results (composite score= . ) as the previous one (table b; supplementary table ). compared to the previous ensemble (table b), there was no significant difference in mcc, composite score, ppv, sensitivity, and accuracy. however, there was a significant increase in the npv and a significant decrease in the specificity. a complete set of all the different combinations of the single predictors evaluated in this study is present in supplementary table . from this table, we observed that the maximum sensitivity (= . ) and npv (= . ) were obtained by the ensemble (mutation taster, fathmm (cancer), and condel), which did not include nbdriver. however, the maximum specificity (= . ) and ppv (= . ) were obtained using the ensemble (nbdriver, chasmplus, mutation taster, and condel). driver and passenger mutationsʼ features used to train nbdriver are significantly different our feature selection results illustrate the differences in the underlying biological processes governing driver and passenger mutations similar to mao et al. [ ]. using the training data used to build nbdriver, we found that driver mutations tend to occur on amino acid residues that have stiff backbones and have less solvent accessibility as denoted by the significantly lower (wilcoxon test; ) ʻpredrsaeʼ probability measure (figure a) and the p < . × − significantly higher (wilcoxon test; ) ʻpredbfactorsʼ probability measure (figure p < . × − b) respectively. we also observed that a mutation is more likely to be a driver if it occurs in genomic regions that were evolutionarily conserved. the mean gerp score for driver mutations was significantly higher (wilcoxon test; ) than that of passengers p < . × − (figure c). similarly, driver mutations were more common in genomic sites that had a significantly higher (wilcoxon test; ) positional hidden markov model (hmm) p < . × − conservation score (or hmmphc) as compared to passengers (figure d). among the other features, we observed similar class-wise distributional differences among features indicative of protein domain knowledge. ʻuniprotdom_postmodenzʼ denotes the presence or absence of a mutation in a site within an enzymatic domain responsible for post-translational modification (or ptm). ptm-related mutations are often accountable for changes in protein functions and alterations of regulatory pathways, eventually leading to carcinogenesis. ʻuniprotregionsʼ is another binary feature that tells us whether a mutation occurred in an experimentally defined .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?acandn https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / region of interest in the protein sequence, such as those associated with protein-protein interactions and regulation of biological processes. our analysis pointed out that a considerable portion ( %) of driver mutations clustered around ptm sites, contrasted by around . % of passengers (figure e). similarly, about % of driver mutations were located in protein domains that were experimentally defined as regions of interest compared to around % of passengers (figure f). in our approach, the tf-idf algorithm was used to weigh a k-mer and assign importance to it in the given set of neighborhood sequences. also, a higher tf-idf score is indicative of the greater relevance/importance of that k-mer. our feature selection results indicated that for the neighborhood sequence-based features, the mean tf-idf scores for drivers were significantly higher (wilcoxon test; ) than that of passengers (figure ). this result . p < suggested that nbdriverʼs top neighborhood features are more specific to the driver neighborhoods than the passengers. evaluation using previously unseen coding mutation data to evaluate nbdriver's capability at identifying previously unseen driver mutations, we evaluated it using missense mutation data compiled from the following four databases. cancer mutation census based on the various evidence criteria set forth by the cancer mutation census database, a particular mutation can be classified into tier , , or , with tier mutations having the highest level of evidence of being a driver and so on. from the list of missense mutations in the cmc not present in our training data, nbdriver could accurately predict all tier , out of tier , and out of tier mutations, achieving an overall accuracy of %. on the other hand, the ensemble model consisting of nbdriver, condel and mutationtaster could accurately predict all tier , out of tier , and out of tier mutations achieving an overall accuracy of %. upon further investigation, we found that nbdriver was highly successful in identifying hotspot mutations present in the cmc. recurrent alterations at the same genomic site in cancer genes such as met, mpl, flt and kit have been implicated in many different cancer types [ ]–[ ] (supplementary table a). cancer genome interpreter database using pathogenic mutations compiled from various sources, we found that nbdriver could accurately identify out of non-overlapping missense driver mutations, achieving an overall accuracy of %. the model correctly identified all three mutations from the cancer biomarkers database, out of mutations from the docm database, out of mutations from the martelotto et al. study [ ], and out of mutations from the oncokb database. on the other hand, the ensemble model comprising nbdriver, condel and mutationtaster could accurately predict out of mutations achieving an overall accuracy of %. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/? jre https://www.zotero.org/google-docs/?squtb https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / recurrent driver mutations out of the top hotspot mutations identified in the study conducted by rheinbay et al. [ ] as recurrently mutated, nbdriver correctly identified as drivers. however, mutation taster displayed superior performance by identifying all mutations correctly. except for kras, nbdriver correctly identified all mutations from the other four genes (nras, tp , pik ca, and idh ) as cancer drivers. hotspot mutations in these four genes reported by rheinbay et al. [ ], correctly identified as drivers by nbdriver have been implicated in many different cancers [ ]–[ ] (supplementary table a). rare driver mutations found in glioblastoma and ovarian cancer using the list of rare drivers reported by the developers of the driver prediction tool candra [ ], we evaluated nbdriverʼs ability to identify less frequent alterations in the cancer genome. overall, nbdriver alone could identify out of ( %) glioblastoma mutations and out of ( %) ovarian cancer mutations. all these mutations belonged to eight known ovc-related genes (arid a, cdk , erbb , mlh , msh , msh , pik r , pms ) and seven known gbm-related genes (atm, egfr, mdm , nf , pdgfra, pik ca, ros ). all eight ovc-related genes correctly identified as drivers by nbdriver have been implicated in ovarian cancer through observations made from multiple studies [ ]–[ ] (supplementary table b). the ensemble model made up of nbdriver, condel and mutation taster performed better than the single predictor by identifying out of ( %) glioblastoma mutations and out of ( %) ovarian cancer mutations. stratification of the predicted driver genes based on literature evidence we combined the list of genes with at least one true positive missense driver mutation prediction from nbdriver into a catalog of putative driver genes. we then compared our gene set against those already published in six landmark pan-cancer studies for driver gene identification. bailey et al. [ ] identified driver genes from tumor exomes by combining the predictions from different computational tools. martincorena et al. [ ] used the normalized ratio of non-synonymous to synonymous mutations (dn/ds model) to identify driver genes from tumors and reported a total of putatively positively-selected driver genes and known cancer genes from three main sources: ) cancer genes from the version of the cosmic database [ ]. ) significantly mutated genes across tumors identified by lawrence et al. [ ] using the mutsigcv tool. ) genes identified through a literature search. two marker papers from tcga [ ], [ ] identified significantly mutated genes using the mutsigcv tool. tamborero et al. [ ] identified a list of high-confidence drivers from tumor samples using a rule-based approach. deitlein et al. [ ] modelled the nucleotide context .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/? hvccw https://www.zotero.org/google-docs/?lmka o https://www.zotero.org/google-docs/?qf cos https://www.zotero.org/google-docs/?nanxno https://www.zotero.org/google-docs/?vauxul https://www.zotero.org/google-docs/?xnpuxd https://www.zotero.org/google-docs/?niz z https://www.zotero.org/google-docs/?onyznr https://www.zotero.org/google-docs/?dcjwes https://www.zotero.org/google-docs/?nfktmg https://www.zotero.org/google-docs/?q ibqm https://www.zotero.org/google-docs/?ulpg g https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / around driver mutations and identified driver genes based on nucleotide context. apart from the aforementioned studies, overlap between our list of genes and two well-established cancer gene repositories: the cancer gene census [ ], [ ] and the intogen database [ ] was also reported. we identified (= %) of our predicted driver genes as canonical cancer genes present in the cancer gene census. among the remaining genes, six were catalogued as drivers in at least two of the pan-cancer studies or mutation databases as mentioned above (supplementary table ). a total of eight genes (ctla , igf r, pik cd, tgfbr , rad l, shoc , cdkn b and xrcc ) were not identifiable from any of the landmark studies or databases and required further validation. discussion our investigation aimed to compare the raw neighborhood sequences of driver and passenger mutations and exploit any observed distributional differences to build robust classification models. we showed that except for one window size (n= ), a significant difference in the distributions between the neighborhoods of driver and passenger mutations was consistently present in our cohort. using tf-idf and count vectorizer scores derived from the overlapping k-mers, we trained a kde-based generative classifier and two other tree-based classifiers. one crucial distinction between nbdriver and other methods is the inclusion of overlapping k-mers extracted from the neighborhood of mutations as features for further analysis. nbdriver was trained using a small set (= ) of highly discriminative features, % of which were neighborhood scores. using this model, we could accurately predict % of all the literature-curated mutations outlined in the martelotto et al. study [ ], % of the high confidence list of mutations recently published by the cancer mutation census, % of all the actionable alterations reported in the cancer genome interpreter, % of all the hotspot mutations reported from a pan-cancer genome analysis, % and % of rare driver mutations found in glioblastoma and ovarian cancer respectively. ensemble models obtained by combining the predictions from other state-of-the-art mutation effect predictors with nbdriver performed significantly better than the individual predictors in all five validation datasets. these results underscore the importance of including neighborhood features to build mutation effect prediction algorithms. although our methodʼs focus was to identify missense driver mutations from sequenced cancer genomes, the majority of the genes ( out of ) containing at least one predicted mutation belonged to the cancer gene census or other large-scale driver gene discovery studies. the protein products of the eight remaining genes not flagged as drivers by any of the databases/studies had known functional roles in maintaining the cancer genomeʼs stability and promoting tumor development. the ctla gene modulates immune response by serving as checkpoints for t-cell activation, essentially decreasing the t cellsʼ ability to attack cancer cells. immune checkpoint inhibitors, which are designed to “block” these checkpoints have drastically changed the treatment outcomes for several cancers [ ]. transcriptomic profiling of blood samples drawn from cervical cancer patients identified igf r as a biomarker for .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/? wkzns https://www.zotero.org/google-docs/?tbt nk https://www.zotero.org/google-docs/?ypfwyv https://www.zotero.org/google-docs/?ytzbmx https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / increased risk of treatment failure [ ]. overexpression of the pik cd gene has been associated with cell proliferation in colon cancer and is responsible for poor prognosis among patients [ ]. multiple studies have indicated an association with polymorphisms observed in tgfbr and cancer susceptibility [ ], [ ]. similarly, polymorphisms detected in the rad l is a genetic marker associated with the development of meningeal tumours [ ]. shoc has been reported to be a regulator of the ras signalling pathway and is associated with poor prognosis among breast cancer patients [ ]. similarly, the inactivation of the cdkn b gene is responsible for the progression of pancreatic cancer [ ]. with the help of massively parallel sequencing studies, rare mutations in the xrcc gene have been linked to increased breast cancer susceptibility among patients [ ]. our study does have some limitations. first, we used a representative dataset of driver and passenger mutations whose labels were not in silico predictions from other mutation effect prediction algorithms but derived from experimentally validated functional and transforming impacts from various sources. this resulted in a relatively small sample size for supervised classification. however, this approach also minimized the chances of inadvertently introducing false-positive mutations into the training set used to derive the driver and passenger neighborhoodsʼ class-wise density estimates or the machine learning models. evidence [ ] suggests that a sizeable proportion of mutations present in large mutational databases are mostly false positives, reflecting sequencing errors due to dna damage. moreover, nbdriver derived using this high confidence list of mutations performed reasonably well across all five independent validation sets and produced driver genes with sufficient literature evidence suggesting that our initial choice of the training dataset was overall beneficial. second, since missense mutations are the most abundant form of somatic alterations [ ], our machine learning models were all trained using missense mutations only. however, in principle, our approach could be extended to other types of mutations as well. additionally, during the external validation analysis, although nbdriver performed very well in terms of ppv (= . ), the npv (= . ) was relatively low (table a). to identify biologically relevant mutations for further functional validation, npv is often overlooked as a classification metric. a high npv allows us to exclude passenger mutations with greater confidence and reduces the number of driver mutations incorrectly labeled as passengers (false negatives). however, we observed that adding different combinations of multiple single predictors into ensemble models resulted in a significant improvement in the npv (table b). our observations on the ensemble modelsʼ performances were similar to those made by martelotto et al. [ ]. last, we trained our machine learning models using the combined dataset containing mutational effects determined from experimental assays not specific to any cancer type. hence, all our models were pan-cancer based. consequently, a cancer-type specific analysis in the future would require the list of known driver and passenger mutations from specific tumor types. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?rlkmzk https://www.zotero.org/google-docs/?mdz n https://www.zotero.org/google-docs/?isuk s https://www.zotero.org/google-docs/? ys v https://www.zotero.org/google-docs/? saakc https://www.zotero.org/google-docs/?ktiaxo https://www.zotero.org/google-docs/?pjyrg https://www.zotero.org/google-docs/?f rwyj https://www.zotero.org/google-docs/?qeyppw https://www.zotero.org/google-docs/?sklt https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / conclusion in this study, we showed that there is a significant difference in the nucleotide contexts surrounding driver and passenger mutations obtained from sequenced cancer genomes. using efficient feature representation, we generated robust classification models that gave comparable performances across five independent validation sets. the predicted true positive mutations were part of genes with experimental support of being functionally relevant from multiple sources. future experiments using a much larger sample size need to be performed to derive neighborhood-sequence-based classification scores for all possible missense mutations in the genome across several cancer types. this would be possible if future large-scale sequencing studies such as msk-impact [ ], pcawg [ ], icgc [ ], and genie [ ] produce a complete catalog of missense driver mutations with functional evidence in a cancer-type specific manner. this relatively novel strategy of utilizing the sequence neighborhoods for driver mutation identification can dramatically improve the annotation processʼs efficiency for unknown mutations. acknowledgements this work was supported by department of biotechnology, government of india (dbt) (bt/pr /bid/ / / ), iit madras, initiative for biological systems engineering (ibse) and robert bosch center for data science and artificial intelligence (rbc-dsai). conflicts of interest the authors declare no conflict of interest. references [ ] m. r. stratton, p. j. campbell, and p. a. futreal, “the cancer genome,” nature , vol. , no. , pp. – , . [ ] j. m. samet, “radon and lung cancer,” jnci: journal of the national cancer institute, vol. , no. , pp. – , . [ ] j. w. drake, “mutagenic mechanisms,” annual review of genetics, vol. , no. , pp. – , . [ ] w. zhu, s. wu, and y. a. hannun, “contributions of the intrinsic mutation process to cancer mutation and risk burdens,” ebiomedicine, vol. , pp. – , . [ ] b. j. raphael, j. r. dobson, l. oesper, and f. vandin, “identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine,” genome medicine, vol. , no. , pp. – , . [ ] s. a. forbes et al. , “cosmic: exploring the world’s knowledge of somatic mutations in human cancer,” nucleic acids research, vol. , no. d , pp. d –d , . [ ] j. zhang et al., “international cancer genome consortium data portal—a one-stop shop for cancer genomics data,” database (oxford), vol. , sep. , doi: .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?s etco https://www.zotero.org/google-docs/?wnabhh https://www.zotero.org/google-docs/?wquckk https://www.zotero.org/google-docs/?iean j https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / . /database/bar . [ ] j. n. weinstein et al., “the cancer genome atlas pan-cancer analysis project,” nature genetics, vol. , no. , pp. – , . [ ] e. cerami et al., “the cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data,” cancer discov, vol. , no. , pp. – , may , doi: . / - .cd- - . [ ] b. j. ainscough et al., “docm: a database of curated mutations in cancer,” nature methods, vol. , no. , pp. – , . [ ] l. a. garraway, “genomics-driven oncology: framework for an emerging paradigm,” journal of clinical oncology, vol. , no. , pp. – , . [ ] m. s. lawrence et al., “mutational heterogeneity in cancer and the search for new cancer-associated genes,” nature, vol. , no. , pp. – , . [ ] n. d. dees et al., “music: identifying mutational significance in cancer genomes,” genome research, vol. , no. , pp. – , . [ ] c. h. mermel, s. e. schumacher, b. hill, m. l. meyerson, r. beroukhim, and g. getz, “gistic . facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers,” genome biology, vol. , no. , pp. – , . [ ] p. kumar, s. henikoff, and p. c. ng, “predicting the effects of coding non-synonymous variants on protein function using the sift algorithm,” nature protocols, vol. , no. , p. , . [ ] y. choi, g. e. sims, s. murphy, j. r. miller, and a. p. chan, “predicting the functional effect of amino acid substitutions and indels,” plos one, vol. , no. , p. e , oct. , doi: . /journal.pone. . [ ] i. adzhubei, d. m. jordan, and s. r. sunyaev, “predicting functional effect of human missense mutations using polyphen- ,” curr protoc hum genet, vol. chapter , p. unit . , jan. , doi: . / .hg s . [ ] h. carter et al., “cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations,” cancer res., vol. , no. , pp. – , aug. , doi: . / - .can- - . [ ] h. a. shihab, j. gough, d. n. cooper, i. n. m. day, and t. r. gaunt, “predicting the functional consequences of cancer-associated amino acid substitutions,” bioinformatics, vol. , no. , pp. – , jun. , doi: . /bioinformatics/btt . [ ] d. chakravarty et al., “oncokb: a precision oncology knowledge base,” jco precis oncol, vol. , jul. , doi: . /po. . . [ ] e. cerami, e. demir, n. schultz, b. s. taylor, and c. sander, “automated network analysis identifies core pathways in glioblastoma,” plos one, vol. , no. , feb. , doi: . /journal.pone. . [ ] f. vandin, e. upfal, and b. j. raphael, “algorithms for detecting significantly mutated pathways in cancer,” journal of computational biology, vol. , no. , pp. – , . [ ] h. carter, c. douville, p. d. stenson, d. n. cooper, and r. karchin, “identifying mendelian disease genes with the variant effect scoring tool,” bmc genomics, vol. , no. , p. s , may , doi: . / - - -s -s . [ ] c. tokheim and r. karchin, “chasmplus reveals the scope of somatic missense mutations driving human cancers,” cell systems, vol. , no. , pp. - .e , jul. , doi: . /j.cels. . . . [ ] b. reva, y. antipin, and c. sander, “predicting the functional impact of protein mutations: application to cancer genomics,” nucleic acids res., vol. , no. , p. e , sep. , .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / doi: . /nar/gkr . [ ] a. gonzalez-perez, j. deu-pons, and n. lopez-bigas, “improving the prediction of the functional impact of cancer mutations by baseline tolerance transformation,” genome medicine, vol. , no. , p. , nov. , doi: . /gm . [ ] y. mao, h. chen, h. liang, f. meric-bernstam, g. b. mills, and k. chen, “candra: cancer-specific driver missense mutation annotation with optimized features,” plos one, vol. , no. , oct. , doi: . /journal.pone. . [ ] p. c. ng and s. henikoff, “predicting deleterious amino acid substitutions,” genome research, vol. , no. , pp. – , . [ ] a. hodgkinson and a. eyre-walker, “variation in the mutation rate across mammalian genomes,” nat. rev. genet., vol. , no. , pp. – , oct. , doi: . /nrg . [ ] t. sjöblom et al., “the consensus coding sequences of human breast and colorectal cancers,” science, vol. , no. , pp. – , oct. , doi: . /science. . [ ] a. f. rubin and p. green, “mutation patterns in cancer genomes,” pnas, vol. , no. , pp. – , dec. , doi: . /pnas. . [ ] v. aggarwala and b. f. voight, “an expanded sequence context model broadly explains variability in polymorphism levels across the human genome,” nat. genet., vol. , no. , pp. – , apr. , doi: . /ng. . [ ] z. zhao and e. boerwinkle, “neighboring-nucleotide effects on single nucleotide polymorphisms: a study of . million polymorphisms across the human genome,” genome res, vol. , no. , pp. – , nov. , doi: . /gr. . [ ] l. b. alexandrov, s. nik-zainal, d. c. wedge, p. j. campbell, and m. r. stratton, “deciphering signatures of mutational processes operative in human cancer,” cell rep, vol. , no. , pp. – , jan. , doi: . /j.celrep. . . . [ ] d. tamborero et al., “cancer genome interpreter annotates the biological and clinical relevance of tumor alterations,” genome medicine, vol. , no. , p. , . [ ] l. b. alexandrov and m. r. stratton, “mutational signatures: the patterns of somatic mutations hidden in cancer genomes,” curr. opin. genet. dev., vol. , pp. – , feb. , doi: . /j.gde. . . . [ ] a.-l. brown, m. li, a. goncearenco, and a. r. panchenko, “finding driver mutations in cancer: elucidating the role of background mutational processes,” plos computational biology, vol. , no. , p. e , . [ ] l. g. martelotto et al. , “benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations,” genome biology, vol. , no. , p. , oct. , doi: . /s - - - . [ ] m. jeffers et al. , “activating mutations for the met tyrosine kinase receptor in human cancer,” proceedings of the national academy of sciences, vol. , no. , pp. – , . [ ] t. s. akpınar, v. s. hançer, m. nalçacı, and r. diz-küçükkaya, “mpl w l/k mutations in chronic myeloproliferative neoplasms,” turk j haematol, vol. , no. , pp. – , mar. , doi: . /tjh. . [ ] d. liang et al., “flt -tkd mutation in childhood acute myeloid leukemia,” leukemia, vol. , no. , pp. – , . [ ] j. a. fletcher, c. d. fletcher, b. p. rubin, l. k. ashman, c. l. corless, and m. c. heinrich, “kit gene mutations in gastrointestinal stromal tumors: more complex than previously recognized?,” the american journal of pathology, vol. , no. , p. , . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / [ ] s. yui et al., “d mutation of the kit gene in core binding factor acute myeloid leukemia is associated with poorer prognosis than other kit gene mutations,” annals of hematology, vol. , no. , pp. – , . [ ] e. rheinbay et al., “discovery and characterization of coding and non-coding driver mutations in more than , whole cancer genomes,” biorxiv, p. , . [ ] g. a. hobbs, c. j. der, and k. l. rossman, “ras isoforms and mutations in cancer at a glance,” journal of cell science, vol. , no. , pp. – , . [ ] e. h. baugh, h. ke, a. j. levine, r. a. bonneau, and c. s. chan, “why are there hotspot mutations in the tp gene in human cancers?,” cell death & differentiation, vol. , no. , pp. – , . [ ] d. a. fruman and c. rommel, “pi k and cancer: lessons, challenges and opportunities,” nat rev drug discov, vol. , no. , pp. – , feb. , doi: . /nrd . [ ] f. e. bleeker et al. , “idh mutations at residue p. r (idh r ) occur frequently in high-grade gliomas but not in other solid tumors,” human mutation, vol. , no. , pp. – , . [ ] k. c. wiegand et al., “arid a mutations in endometriosis-associated ovarian carcinomas,” new england journal of medicine, vol. , no. , pp. – , . [ ] t. popova et al., “ovarian cancers harboring inactivating mutations in cdk display a distinct genomic instability pattern characterized by large tandem duplications,” cancer research, vol. , no. , pp. – , . [ ] h. luo, x. xu, m. ye, b. sheng, and x. zhu, “the prognostic value of her in ovarian cancer: a meta-analysis of observational studies,” plos one, vol. , no. , p. e , . [ ] c. zhao, s. li, m. zhao, h. zhu, and x. zhu, “prognostic values of dna mismatch repair genes in ovarian cancer patients treated with platinum-based chemotherapy,” archives of gynecology and obstetrics, vol. , no. , pp. – , . [ ] a. j. philp et al. , “the phosphatidylinositol ′-kinase p α gene is an oncogene in human ovarian and colon tumors,” cancer research, vol. , no. , pp. – , . [ ] m. h. bailey et al., “comprehensive characterization of cancer driver genes and mutations,” cell, vol. , no. , pp. – , . [ ] i. martincorena et al., “universal patterns of selection in cancer and somatic tissues,” cell, vol. , no. , pp. – , . [ ] m. s. lawrence et al., “discovery and saturation analysis of cancer genes across tumour types,” nature, vol. , no. , pp. – , . [ ] k. a. hoadley et al., “cell-of-origin patterns dominate the molecular classification of , tumors from types of cancer,” cell, vol. , no. , pp. – , . [ ] cancer genome atlas research network, “comprehensive molecular profiling of lung adenocarcinoma,” nature, vol. , no. , pp. – , . [ ] f. dietlein et al., “identification of cancer driver genes based on nucleotide context,” nature genetics, vol. , no. , pp. – , . [ ] p. a. futreal et al. , “a census of human cancer genes,” nature reviews cancer, vol. , no. , pp. – , . [ ] f. martínez-jiménez et al., “a compendium of mutational cancer driver genes,” nature reviews cancer, vol. , no. , pp. – , . [ ] a. rotte, “combination of ctla- and pd- blockers for treatment of cancer,” journal of experimental & clinical cancer research, vol. , no. , p. , . [ ] p. moreno-acosta et al., “igf r gene expression as a predictive marker of response to .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / ionizing radiation for patients with locally advanced hpv -positive cervical cancer,” anticancer res, vol. , no. , pp. – , oct. . [ ] j. chen et al., “pik cd induces cell growth and invasion by activating akt/gsk- β/β-catenin signaling in colorectal cancer,” cancer science, vol. , no. , pp. – , . [ ] b. pasche, m. j. pennison, h. jimenez, and m. wang, “tgfbr and cancer susceptibility,” transactions of the american clinical and climatological association, vol. , p. , . [ ] y. wang, x. qi, f. wang, j. jiang, and q. guo, “association between tgfbr polymorphisms and cancer risk: a meta-analysis of case-control studies,” plos one, vol. , no. , p. e , . [ ] p. e. leone, m. mendiola, j. alonso, c. paz-y-miño, and a. pestaña, “implications of a rad l polymorphism ( c/t) in human meningiomas as a risk factor and/or a genetic marker,” bmc cancer, vol. , no. , p. , . [ ] w. geng, k. dong, q. pu, y. lv, and h. gao, “shoc is associated with the survival of breast cancer cells and has prognostic value for patients with breast cancer,” molecular medicine reports, vol. , no. , pp. – , . [ ] q. tu et al. , “cdkn b deletion is essential for pancreatic cancer development instead of unmeaningful co-deletion due to juxtaposition to cdkn a,” oncogene, vol. , no. , pp. – , . [ ] d. park et al., “rare mutations in xrcc increase the risk of breast cancer,” the american journal of human genetics, vol. , no. , pp. – , . [ ] l. chen, p. liu, t. c. evans, and l. m. ettwiller, “dna damage is a pervasive cause of sequencing errors, directly confounding variant identification,” science, vol. , no. , p. , feb. , doi: . /science.aai . [ ] b. vogelstein, n. papadopoulos, v. e. velculescu, s. zhou, l. a. diaz, and k. w. kinzler, “cancer genome landscapes,” science, vol. , no. , p. , mar. , doi: . /science. . [ ] d. t. cheng et al., “comprehensive detection of germline variants by msk-impact, a clinical diagnostic platform for solid tumor molecular oncology and concurrent cancer predisposition testing,” bmc medical genomics, vol. , no. , p. , . [ ] aacr project genie consortium, “aacr project genie: powering precision medicine through an international consortium,” cancer discovery, vol. , no. , pp. – , . [ ] m. olivier, r. eeles, m. hollstein, m. a. khan, c. c. harris, and p. hainaut, “the iarc tp database: new online mutation analysis and recommendations to users,” human mutation, vol. , no. , pp. – , , doi: . /humu. . [ ] b. b. campbell et al., “comprehensive analysis of hypermutation in human cancer,” cell, vol. , no. , pp. – , . [ ] p. k.-s. ng et al. , “systematic functional annotation of somatic mutations in cancer,” cancer cell, vol. , no. , pp. – , . [ ] l. m. starita et al. , “massively parallel functional analysis of brca ring domain variants,” genetics, vol. , no. , pp. – , . [ ] k. mahmood et al., “variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics,” hum. genomics, vol. , no. , p. , , doi: . /s - - - . [ ] w. zhou et al. , “transvar: a multilevel variant annotator for precision genomics,” nature methods, vol. , no. , art. no. , nov. , doi: . /nmeth. . [ ] m. j. landrum et al., “clinvar: public archive of interpretations of clinically relevant .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / variants,” nucleic acids research, vol. , no. d , pp. d –d , . [ ] k. wang, m. li, and h. hakonarson, “annovar: functional annotation of genetic variants from high-throughput sequencing data,” nucleic acids research, vol. , no. , pp. e –e , . [ ] f. pedregosa et al., “scikit-learn: machine learning in python,” machine learning in python, p. . [ ] g. r. warnes, b. bolker, t. lumley, and r. c. johnson, “gmodels: various r programming tools for model fitting,” r package version, vol. , no. , . [ ] d. l. wilson, “asymptotic properties of nearest neighbor rules using edited data,” ieee transactions on systems, man, and cybernetics, no. , pp. – , . [ ] j. m. schwarz, c. rödelsperger, m. schuelke, and d. seelow, “mutationtaster evaluates disease-causing potential of sequence alterations,” nature methods, vol. , no. , art. no. , aug. , doi: . /nmeth - . [ ] n.-l. sim, p. kumar, j. hu, s. henikoff, g. schneider, and p. c. ng, “sift web server: predicting effects of amino acid substitutions on proteins,” nucleic acids research, vol. , no. w , pp. w –w , jul. , doi: . /nar/gks . [ ] k. a. pagel et al. , “integrated informatics analysis of cancer-related variants,” jco clinical cancer informatics, vol. , pp. – , . [ ] c. k. y. ng et al. , “predictive performance of microarray gene signatures: impact of tumor heterogeneity and multiple mechanisms of drug resistance,” cancer res, vol. , no. , pp. – , jun. , doi: . / - .can- - . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://www.zotero.org/google-docs/?hsltkm https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table : summary of datasets used in this study type study/ database name description sample size training brown et al. missense mutations from cancer genes generated from experimental assays mutations (driver: passenger: ) validation martelotto et al. a literature curated list of mutations from cancer genes used to benchmark mutation-effect prediction algorithms mutations (driver: passenger: ) validation catalog of validated oncogenic mutations high confidence pathogenic missense variants compiled from several sources driver mutations validation rheinbay et al. recurrent single point driver mutations in the coding region compiled from the pan-cancer analysis of whole genomes consortium driver mutations validation mao et al. rare driver mutations from gbm and ovc cancer types gbm: driver mutations ovc: driver mutations validation cancer mutation census (cosmic v ) cosmic mutation data categorized into different functional classes both through manual curation and computational predictions driver mutations .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table : number of one-hot encoded features and possible k-mers for a given window size. the size of the vocabulary (or n) is given in brackets window size number of one-hot encoded features number of k-mers possible for a given k-mer size k= (n= ) k= (n= ) k= (n= ) w= w= w= w= w= w= w= w= w= w= .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table : median js distances for both the original and randomized experiments for different window sizes window size feature type median js distance (original) median js distance (randomized) p-value tf (k= ) . . not significant ohe . . < . cv (k= ) . . < . tf (k= ) . . < . cv (k= ) . . < . tf (k= ) . . < . cv (k= ) . . < . tf (k= ) . . < . tf (k= ) . . < . tf (k= ) . . < . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / table a: comparison of the generated binary classifiers with other mutation effect prediction algorithms using the benchmarking dataset by martelotto et al. table b: evaluating the contribution of nbdriver to the top performing ensemble algorithm accuracy sensitivity specificity ppv npv cs mcc mutation taster . . . . . . . fathmm (cancer) . . . . . . . chasmplus (pancancer) . . . . . . . nbdriver . . . . . . . neighborhood-only model . . . . . . . condel . . . . . . . fathmm (missense) . . . . . . . provean . . . . . . . sift . . . . . . . polyphen- . . . . . . . mutation assessor . . . . . . . vest . . . . . . . candraplus (cancer-in-general) . . . . - . algorithm accuracy sensitivity specificity ppv npv cs mcc nbdriver + chasmplus+ fathmm (cancer) + mutation taster + condel . . . . . . . chasmplus+ fathmm (cancer) + mutation taster + condel . . . . . . . a smaller ensemble that gave no significant change in composite score and mcc compared to the previous ensemble (first row of table b) nbdriver + mutation taster + condel . . . . . . . .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / figure legends figure : a diagram representing the features derived from the neighborhood nucleotide sequences of the point mutations for an arbitrary window size of is shown here. the mutated position is represented as a triplet (chromosome: position: substitution type). (i) the original sequence is represented here with the mutated nucleotide (ch : :g>t) in bold. (ii) one-hot encoding was used to derive the -bit binary one-hot encoded vector for each nucleotide. (iii) overlapping k-mers of sizes , and have been represented here . in this case, the neighborhood features also include the wildtype nucleotide at the mutated position. the overlapping k-mers were encoded into a numerical format using the countvectorizer and the tfidf vectorizer and the resulting word matrix was derived. the samples (or individual neighborhoods) are represented as rows and the k-mers are represented as columns. for both types of feature representation, the chromosome number and the substitution type (a>t, g>c etc) were included as additional features. figure : the workflow depicting one run of the kernel density estimation experiment is shown in this figure. all mutations from the brown et al. study were used to derive the estimates. (a) first, an equal number of driver and passenger mutations were sampled with replacement. (b) the “bandwidth” hyperparameter was tuned using a -fold cross-validation approach, and the resulting tuned hyperparameter was used to estimate the densities. (c) the kernel density estimates for the driver and passenger neighborhoods were obtained separately, and the distance between them was calculated using the jensen-shannon (js) distance. the js distance is used to quantify how “distinguishable” two probability distributions are from each other. it is bounded between and , where represents the case where the two probability distributions are equal and vice versa. (d) the bootstrapping experiment to compute the significance of the density estimates calculated in (c) is shown in this figure. first, it involved random sampling of twice the driver or passenger mutations from (a) irrespective of the labels, followed by randomly splitting the data into driver and passenger labels. (e) hyperparameter tuning and density estimation was performed similarly to (b). (f) the bootstrapped js distance between the driver and passenger neighborhoods was derived. all six steps (a-f) of the density estimation experiments were repeated times for all possible window sizes between and and seven different feature representations. the significance of the difference between the medians of the original and the bootstrapped js distances was then reported. figure : the workflow depicting one run of the -fold cross-validation experiments is shown in this figure. (a) in the first step, the entire dataset was split into ten equal parts. nine of the ten subsets were combined into one training set, and one part was left as the test set. (b) seven different feature representations [ohe, count vectorizer (k= , , ) and tf-idf vectorizer (k= , , )] were considered for further analysis. after feature selection using a tree-based classifier, hyperparameter tuning was performed for three classifiers, and the corresponding .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / models were derived. finally, validation of each of the classifiers on the test set was performed, and the corresponding performance metrics were reported. figure : variation in js distances between the estimated densities for every window size between and is shown in this figure. all mutations from the original study were used here. two types of boxplots, one for the original and another for the randomized experiments have been shown here along with the p-values, which approximates the probability that the original median distance can be obtained by chance. except window , all other window sizes had a significant (** ) difference between the original and the randomized js distances.. p < figure : the variation in the classification performances with different window sizes obtained during the repeated cross-validation experiments using the initial training set of mutations is shown in this figure. for each window size, feature representations among cv (countvectorizer), tf (tf-idf vectorizer) and ohe (one-hot encoding) that gave the best performances in terms of (a) sensitivity (b) specificity (c) auc and (d) mcc is displayed. figure : plot showing the variation in auroc with the different classification thresholds obtained while deriving nbdriver is shown here. nbdriver was trained on a reduced training set of mutations after removing all overlapping mutations from the original study and martelotto et al. for an imbalanced classification problem, using the default threshold of . is often not advisable. in our case, the best auroc was obtained using a threshold of . . consequently, all mutations with prediction scores greater than this threshold were classified as drivers and vice versa. figure : differences in the distribution of features between driver and passenger mutations observed from the training data used to derive nbdriver. (a) predrsae (predicted residue solvent accessibility - exposed) gives the probability of the wild type residue being exposed. from the plot it is clear that probability of driver mutations occurring in residues that are exposed is significantly less (wilcoxon test; p= . e- ) than that of passengers. (b) predbfactors (high predicted bfactor) gives the probability that the wild type residue backbone is stiff. from the plot it is clear that the probability of driver mutations occurring in residues with stiff backbones is significantly higher (wilcoxon test; p= . e- ) than that of passengers. (c) gerp conservation scores give the evolutionary conservativeness scores for specific sites where mutations have occurred. from the plot it is clear that driver mutations occur in sites with gerp scores that are significantly higher (wilcoxon test; p< . e- ) than passenger mutations. (d) hmmphc (positional hidden markov model (hmm) conservation score) is a measure which is calculated on the basis of the degree of conservation of the residue, the mutation and the most probable amino acid. from the plot it is clear that driver mutations tend to occur in residues with hmmphc scores significantly higher (wilcoxon test; p= . e- ) than passenger mutations. (e) uniprotdom_postmodenz is a feature based on protein domain knowledge which tells us whether a site in an enzymatic domain is responsible for any kind of post translational modification (or ptm). ʻpresenceʼ indicates that the mutation occurs in a site responsible for ptm and vice versa. from the plot it is clear that more driver .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / mutations occur in ptm-associated sites as compared to passengers. (f) uniprotregions is a binary variable which tells us whether a mutation occurs in a region of interest in the protein sequence. ʻpresenceʼ indicates that the mutation occurs in a region of interest and vice versa. from the plot it is clear that more driver mutations cluster in regions of interest in the protein sequence as compared to passengers thereby making them mechanistically influential for the progression of the disease. figure : plot showing the class-wise variation in the mean tf-idf scores for the neighborhood-sequence features used to train nbdriver. the x-axis represents the -mers used in the analysis, and the y-axis represents the mean tf-idf scores. from the plot, it is evident that the mean tf-idf scores are consistently higher for drivers as compared to passengers. since a higher tf-idf score indicates the relevance or importance of a particular k-mer, we can conclude that the -mers used to derive nbdriver are more specific to the driver neighborhoods than passengers. .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / .cc-by-nc . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc/ . / on the application of bert models for nanopore methylation detection ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ genome analysis on the application of bert models for nanopore methylation detection yao-zhong zhang ,∗, sera hatakeyama , kiyoshi yamaguchi , yoichi furukawa , satoru miyano , rui yamaguchi , and seiya imoto ,∗ institute of medical science, the university of tokyo, tokyo, - , japan m&d data science center, tokyo medical and dental university, tokyo, - , japan aichi cancer center research institute, nagoya, - , japan ∗to whom correspondence should be addressed. abstract motivation: dna methylation is a common epigenetic modification, which is widely associated with various biological processes, such as gene expression, aging, and disease. nanopore sequencing provides a promising methylation detection approach through monitoring abnormal signal shifts for detecting modified bases in target motif regions. recently, model-based approaches, especially those with deep learning models, have achieved significant performance improvements on nanopore methylation detection. in this work, we explore using bidirectional encoder representations from transformers (bert) for doing the task, which can provide non-recurrent neural structures for fast parallel computation. results: we find original bert architecture does not work as well as the bidirectional recurrent neural network (birnn) on the nanopore methylation prediction task. through further analysis, we observe recurrent patterns of positional-signal-shift in the context window surrounding target -methylcytosine ( mc) and n -methyladenine ( ma) motifs. we propose a refined bert with relative position representation and center hidden units concatenation, which takes account of task-specific characters into modeling. we perform systematic evaluations in-sample and cross-sample. the experiment results show that the refined bert model can achieve competitive or even better results than the state-of-the-art birnn model, while the model inference speed is about x faster. besides, on the cross-sample evaluation of datasets from the different research groups, bert models demonstrate a good generalization performance. availability: the source code and data are available at https://github.com/yaozhong/methbert contact:yaozhong@ims.u-tokyo.ac.jp introduction methylation of dna/rna/histone is commonly observed in developmental disorders, aging, and genomic disease, such as cancer. fast and accurately detecting methylation status has a fundamental requirement to find distinctive biomarkers for aging/disease profiling. for a virome/metagenome study, quick and accurate epi-transcriptome detection also plays an important role in understanding unseen strains (kim et al., ). one commonly used dna methylation detection approach is whole-genome bisulfite sequencing (wgbs). to detect modified bases, wgbs first takes sodium bisulfite conversion before sequencing. as the pre-chemical bisulfite conversion is a relatively harsh process, it makes dna sequences more fragmental and a large amount of dna is usually required. also, limited to the read length, it is difficult to align short reads in low-complex regions and analyze methylation patterns in a long- range. the data processing of wgbs is sophisticated and time-consuming. various biases (e.g. gc and fragment length) including those introduced by bisulfite treatment are required to be dealt with in the data analysis. wgbs can only be used for dna samples, which limits its application of detecting rna methylation. single-molecule sequencing (e.g., pacbio and nanopore) provides a promising approach through detecting abnormal signals in target motif regions, as modified bases usually have different current signals. compared with the sodium bisulfite approach, no extra chemical treatment is required, which helps to reduce potential biases. currently exist nanopore methylation detection methods can be categorized into two types. one is testing-based (e.g.,tombo (stoiber et al., )), the other is model-based (e.g., nanopolish (simpson et al., ), deepmod(liu et al., ) and deepsignal (ni et al., )). a testing- based approach performs statistical test on paired signals (candidate and reference) and does not require any training process. also, it can be applied for any chemical modifications. a model-based approach trains a model .license cc-by-nc-nd . internationalpeer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified bythis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / http://creativecommons.org/licenses/by-nc-nd/ . / ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ zhang et al. x x x x x i n- n ...... ......embedding attention feed forwad attention feed forwad attention linear methylation status feed forwad linear c g mcc a t a ’ ’ dna sequence x ix i-k x i+k w v − k , w k − k w v k , w k k ...... ............ ...... attention feed forwad attention feed forwad attention concate methylation status feed forwad attention feed forwad attention feed forwad attention feed forwad linear (tanh) attention feed forwad attention feed forwad attention feed forwad relative position constraint window x xn (a). basic bert for methyaltion detection (b). refined bert with relative position representation fig. : basic bert’s and refined bert’s model structure used for methylation detection. compared with the basic bert, enhanced constraints and additional edges are highlighted in red color. on known chemical modifications and makes predictions whether a signal sequence contains methylation signals or not. sequential models, such as hidden markov model (hmm) and bidirectional recurrent neural network (birnn), are commonly used in the model-based approach. although model-based approaches have already achieved competitive results, the sequential computational order makes them difficult to be optimized in parallel for fast inference. meanwhile, finding discriminative signal patterns for identifying methylated signals is also important for developing novel detection algorithms. in this work, based on the bidirectional encoder representations from transformers (bert), we explore the non-recurrent modeling approach for nanopore methylation detection. though analyzing nucleotide sequences with both methylated and unmethylated signals, we profile positional signal-shift for different motifs and methyltransferases. we find ± bp region surrounding the center methylation candidate shows significant signal-shifts. different methylation types, such as -methylcytosine ( mc) and n -methyladenine ( ma), also demonstrate different signal-shift patterns. we hence propose a refined bert model to take account of signal-shift patterns in the modeling. we evaluate the proposed methods on the publicly available benchmark dataset. in both in-sample and cross-sample evaluation, the proposed refined bert model achieves a competitive or even better result when compared with the state-of-the-art birnn model, while its model inference speed is about x faster. in the cross-sample evaluation, bert models also demonstrate their transfer learning ability across different datasets. methods in this section, we introduce bert (devlin et al., ) and refined bert applied for nanopore methylation detection. the bert is built on the base of transformer (vaswani et al., ), which employs self-attention as the core module in its stacked network structure. it is proposed to replace recurrent and convolution operation with purely attention mechanisms. a typical transformer network consists of encoding and decoding module. bert only uses the encoding module of a typical transformer for pre- training on the unsupervised data. bert has achieved break-through results on many natural language understanding tasks. in this work, we explore applying the bert model for the nanopore methylation detection task to leverage the power of advanced deep learning models. . bert and refined bert model figure shows the model structures of bert models used for nanopore methylation detection. we explore two types of bert models. one is the most commonly used bert (figure (a)), the other is the refined bert (figure (b)), which is optimized for nanopore methylation detection. . . embedding module given extracted features for each position in a sequence, the embedding layer maps input vectors into hidden spaces. in the embedding layer, besides event embedding, positional embedding (pe) is also included. as a bert is used to learn bidirectional contextual information, positional information is important in the modeling. the original pe (vaswani et al., ) uses a sinusoid embedding, which is fixed and not learnable. pe(pos, i) = sin pos i/dmodel pe(pos, i + ) = cos pos i/dmodel , where pos is the position and i is the embedding dimension. for any fixed offset k, pepos+k can be represented as a linear function of pepos. according to the recent progress (huang et al., ), learnable pe and relative position embedding can help to further improve bert’s performances. therefore, in the refined bert model, we use learnable pe and relative position representation. the learnable pe takes positional embedding vectors as parameters, which are updated during the learning process. .license cc-by-nc-nd . internationalpeer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified bythis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / http://creativecommons.org/licenses/by-nc-nd/ . / ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ bert for nanopore methylation detection . . self-attention module following the embedding layer, there are three stacked transformer blocks. each transformer block consists of a multi-head self-attention layer and position-wise fully connected feed-forward network. the self-attention mechanism is a modeling approach of describing context information for different positions of inputs under a deep learning framework. the self- attention mechanism imitates the human sight mechanism and provides a model with the ability to zoom in or out in a particular position of an input sequence. it demonstrates the effectiveness in many different tasks including natural language understanding, image recognition, and several bioinformatics applications. attention function is described as mapping q and a set of key-value (k, v ) pairs to an output. formally, for an input x = (x , ..., xn) of n elements where xi ∈ rdx , we calculate query q, key k and value v vectors of dimension dk based on the embedding vector of embed(x). the attention module generates a new sequence z = (z , ..., zn) of the same length as x. zi is calculated as a weighted sum of linearly transformed input elements as follows: zi = n∑ j= aij(xjw v ) aij = exp eij∑n k= exp eik eij = (xiw q)(xjw t )t √ dz , where w q, w k, w t ∈ rdx×dz are parameter matrices. the self-attention computes a pairwise correlation of embed(xi) and embed(xj), which can be calculated in a parallel way. while in a birnn, recurrent hidden units are required to be calculated successively. this architecture difference makes bert can be optimized for fast inference. . . relative position representation in self-attention heads for nanopore sequencing, signals are supposed to be more affected by the nucleotide passing through the pore. its surrounding nucleotides may also have effects on the current signals. for those nucleotides that are too far away in a context window, it is intuitive to assume they have less effect on the detected current signals. in the refined bert model, we add relative position representation in the attention module following the method proposed by shaw et al. ( ). for any two input elements xi and xj , the relative position information is modeled with two distinct edge representations avij , a k ij . for linear sequences, those edges are used to capture the relative position differences between input elements. as the precise relative position is not useful beyond a certain distance, we clip the maximum distance (e.g. ± bp) in calculating attention aij ∈ a. a k ij = w k clip(j−i,k) a v ij = w v clip(j−i,k) clip(x, k) = max(−k, min(k, x)) . . final full connection layer after the stacked transformer blocks, hidden units of the center position feed to a full connection linear layer that makes the final prediction of whether a given input contains a methylated motif or not. in the refined bert, besides the hidden units of the center position, hidden units in its surrounding window (e.g., ± bp) are concatenated as the input of the final full connection layer. . applying bert models for nanopore methylation detection the bert models are then applied to replace different classification models (e.g. birnn) in a typical model-based methylation detection framework. in this framework, raw signals of each read are first translated into nucleotide sequences (basecalling). signals are then aligned to corresponding reference nucleotides through the re-squiggle process. after that, the target motif (e.g. cpg) and its context regions are localized through nucleotide matching and signals in a context window of a fixed length (e.g. bp) are transformed into event-based features as the input of methylation callers. typical event-based features include signal mean, signal standard deviation, event length, and nucleotide information (liu et al., ). here, we utilize the framework of deepmod and perform the same pre-process for the data. we use tombo (ver . . ) to perform re- squiggling and utilize minimap (ver . -r ) to align events to the reference genome. here, we use e.coli k- mg and h.sapiens grch as the reference genomes. experiments we compare bert models with the state-of-the-art birnn model, which is used as the basic network structure in deepmod (liu et al., ) and deepsignal (ni et al., ). to compare with other non-deep-learning- based methods, we utilized the cpg benchmark pipeline (yuen et al., ) as a pivot. . data and model parameters we train and test the models on the public accessible mc (stoiber et al., ; simpson et al., ) and ma (stoiber et al., ) datasets. the datasets include samples of e.coli k- mg , k- er , and h.sapiens na . negative control samples are amplified with pcr and no modified bases are included. positive control samples are synthetically introduced by specific enzymes after pcr amplification, which includes sssi, hhal, mpei methylases for mc, and taqi, ecori, and dam for ma modification. we use the samples that are sequenced with oxford nanopore r flow cells. for each dataset, we randomly shuffle reads in positive and negative controls and construct the training, validate and test set according to a split proportion of / / for in-sample evaluation. for the cross-sample evaluation, we train models on one dataset and test on the other dataset. birnn uses the default model architecture and parameter setting of deepmod, which consists of three stacked bi-directional recurrent layers (hidden_size= ) and one full connection layer for the center position. the total number of birnn parameters is , for an input length of bp. berts use three attention layers (hidden_size= , attention_head= ) and one full connection layer. for the refined bert, learnable positional encoding, attention with relative position representation and center-hidden-concatenation are used. for bert and refined bert, there are total of , and , parameters, which are around % less than that of birnn. more detailed information on the model structures is described in the supplement material. we implement the three models using pytorch. all the models are optimized using adam optimizer (kingma and ba, ) with the learning rate of e − and maximum iteration epoch of . model parameters are selected based on the minimum validation loss. . exploring differentiated signal positions in the context window surrounding target motifs ideally, we assume a modified nucleotide (e.g., the center position of xxxxxxxxxxc mcgxxxxxxxxx) has different current signals, .license cc-by-nc-nd . internationalpeer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified bythis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / http://creativecommons.org/licenses/by-nc-nd/ . / ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ zhang et al. (a ) stoiber-e.coli_cg_sssi (a ) stoiber-e.coli_cg_mpei (a ) stoiber-e.coli_gcgc_hhal (b ) simpson-e.coli_cg_sssi (b ) simpson-h.sapiens_cg_sssi (c ) stoiber-e.coli_gaattc_ecori (c ) stoiber-e.coli_tcga_taqi (c ) stoiber-e.coli_gatc_dam fig. : boxplot of positional signal-shift for mc and ma datasets of the specific motif and methyltransferase. (a ),(a ) and (a ) are on stoiber’s e.coli mc dataset. (b ) and (b ) are on simpson’s mc dataset. (c ), (c ) and (c ) are on stoiber’s e.coli ma dataset. each dataset is represented in a format of datasource_motif_methltansferase. when compared with the unmodified one. as the boundary of nucleotide/k- mer signals are not rigorous and surrounding nucleotides may also be affected, it is worthwhile investigating signal-shift patterns related to methylation in a large context. to identify signal-shift affected by methylation for a specific dataset, we use a simple quantification approach to calculate significant signal changes of each position in the context window. given a dataset of a specific motif and methyltransferase, we first cluster instances with the same nucleotide sequence to avoid the effect of nucleotide sequences. we reserve sequence clusters that contain both methylation and unmethylation instances (≥ ). for each sequence cluster, we normalize event signal values of methylation samples with their according unmodified averaged event signal values for each position. the i-th positional signal-shift is then calculated as smethi − avg(s unmeth i ). for those normalized methylation samples, we calculate basic statistics of signal-shift for each position and draw boxplots for mc and ma training sets. shown in figure , for all datasets, we can observed positions of significantly signal-shift are located in a range of ± bp to the center position (the th) in which the target nucleotide is located. for the rest off-center positions, the averaged signal-shift values are close to . this indicates a modified nucleotide not only affect its corresponding current signals but also the signals of its surrounding nucleotides. besides, mc and ma datasets show different positional-signal-shift patterns. specific positions, such as - bp position ( th) in the mc dataset and + bp position ( th) in the ma dataset, have larger averaged signal- shift values. such pattern can be generalized across the different dataset with the same motif and methyltransferase. for example, figure (a ), (b ) and (b ) show a similar positional signal-shift pattern. for different methyltransferases, such as hhal (figure (a )) also shows a similar pattern as in sssi, while mpei does not have a similar pattern obviously (figure (a )). those positional signal patterns can be directly modeled by a birnn, while for the basic bert, they are not specifically considered in its model structure. in a birnn, such as the implementation of deepmod, the last full connection layer uses hidden units of the center time step as the input. meanwhile, the bi-directional structure and the information decay from both ends to the center position render the model focusing more on center positions. for the basic bert, as any arbitrary time- step pair is processed with the same attention module, the importance of center positions are not specifically considered in the model. therefore, we propose a refined bert model to solve this problem. we incorporate relative-position attention and center-hidden-units concatenation to enable a bert model to pay more attention to center positions. . in-sample evaluation to evaluate model performance, we first perform the in-sample evaluation on mc and ma datasets. the predictions of different models are evaluated on the read and genomic level. for the genomic level evaluation, we group all reads aligned to the same genomic coordinate, and uses a threshold of prediction methylation percentage ≥ . (same as deepmod) as a genomic position prediction. in general, on the five mc datasets, the auc performance of the three models are relatively close on both read level and genomic level. the basic bert model does not work as well as the birnn model that auc scores are lower. the refined bert model achieves equivalent or better auc scores on the genomic- level. note that on the dataset stoiber_e.coli_cg_mpei and .license cc-by-nc-nd . internationalpeer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified bythis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / http://creativecommons.org/licenses/by-nc-nd/ . / ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ bert for nanopore methylation detection dataset species motif_methyltransferase model single (read-level) group (>= , genomic-level) auc precision recall auc precision recall stoiber e.coli gcgc_hhai birnn . . . . . . bert_basic . . . . . . bert_refined . . . . . . cg_mpei birnn . . . . . . bert_basic . . . . . . bert_refined . . . . . . cg_sssi birnn . . . . . . bert_basic . . . . . . bert_refined . . . . . . simpson e. coli cg_sssi birnn . . . . . . bert_basic . . . . . . bert_refined . . . . . . h.sapiens cg_sssi birnn . . . . . . bert_basic . . . . . . bert_refined . . . . . . table . in-sample evaluation of different deep learning models on mc datasets. the best score of each dataset is highlighted in bold. dataset species motif_methyltransferase model single (read-level) group (>= , genomic level) auc precision recall auc precision recall stoiber e.coli gaattc_ecori birnn . . . . . . bert_basic . . . . . . bert_refined . . . . . . tcga_taqi birnn . . . . . . bert_basic . . . . . . bert_refined . . . . . . gatc_dam birnn . . . . . . bert_basic . . . . . . bert_refined . . . . . . table . in-sample evaluation of different deep learning models on ma datasets.the best score of each dataset is highlighted in bold. simpson_e.coli_cg_sssi, although the read-level auc of the refined bert are . and . lower than that of birnn, the genomic-level performance of the refined bert is equal or significantly better than birnn. this can be explained by the more accurate prediction in several low read-coverage regions. on the ma dataset, the refined bert model achieves the best auc performance on both read-level and genomic-level. the performance of the basic bert model is variant and unstable. on stobier_e.coli_gaattc_ecori and stoiber_e.coli_gatc_dam, the basic bert performs slightly better than birnn on the read-level auc, but has a large performance gap on stoiber_e.coli_gaattc_ecori. in summary, in the in-sample evaluation, the refined bert model can achieve competitive or better results when compared with the birnn model on benchmark mc and ma datasets. . cross-sample evaluation we then conduct the cross-sample evaluation. to compare with other non- deep-learning based methods, we utilize the benchmark pipeline (yuen et al., ) as a pivot. we test models on the same benchmark dataset , which is generated based on simpson’s e.coli dataset with different methylation levels. in the dataset, arbitrary sites are selected, which contain singleton cpg in a window of nt from both methylated and unmethylated instances in the simpson’s e.coli dataset. yuen et al. created specific mixtures of methylated and unmethylated reads, containing %, %, ..., % of methylated reads. each mixture contains approximately reads. more detailed information can be found in (yuen et al., ). different from the deepmod model used in the original benchmark pipeline, which is pre-trained on a mixture dataset of all mc positive (cg_sssi, cg_mpei, and gcgc_hhal) and negative controls (umr, con , and con ). here, we test two different models trained on a single dataset with the same methyltransferase to reduce potential overlapping between the training and testing set. all three models are trained on stoiber_ecoli_cg_sssi and simpson_hsapiens_cg_sssi, separately. simpson_hsapiens_cg_sssi is sequenced by the same group on different species, while stoiber_ecoli_cg_sssi is sequenced by a different group on the same species. we use meteore pipeline (yuen et al., ) to generate violin plots for model predictions on each mixture. the pearson’s correlation r, coefficient of determination r and root mean square error (rmse) are used as the evaluation metrics for each model. with the training data of simpson_hsapiens_cg_sssi, all three models achieve performances ranked next to the best reported results of megalodon (r= . , r = . , rmse= . ) on the dataset (yuen et al., ). birnn achieves the best pearson correlation r= . and r = . , while refine bert achieves minimal rmse of . among the evaluated three models. when using stoiber_ecoli_cg_sssi for training models, the performances of all three models decrease. this indicates the challenge of using datasets sequenced by different research groups. here, both bert models show better performances than birnn, as in figure b. the refined .license cc-by-nc-nd . internationalpeer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified bythis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / http://creativecommons.org/licenses/by-nc-nd/ . / ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ zhang et al. (a) models trained with simpson_hsapiens_cg_sssi dataset. (b) models trained with stoiber_ecoli_cg_sssi dataset. fig. : violin plots of prediction results of models trained on different datasets. bert achieves the best r= . , r = . and rmse of . among the three models, which demonstrate the generalization ability on datasets sequenced by different research groups. based on the reported benchmark results, the pearson correlation ranks between reported deepmod and deepsignal (megalodon > deepmodmixmodel ( . ) > refined bert > deepsignalhuman_hx ( . ) >guppy>nanopolish>tombo). . model inference speed the main motivation of applying bert models is to use a non-recurrent modeling approach for the nanopore methylation detection task to improve the model inference speed. we performed a speed test on a server with cpu cores (intel(r) xeon(r) gold cpu @ . ghz) and .license cc-by-nc-nd . internationalpeer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified bythis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / http://creativecommons.org/licenses/by-nc-nd/ . / ✐ ✐ ✐ ✐ ✐ ✐ ✐ ✐ bert for nanopore methylation detection model model inference time total running time birnn . s . s bert_basic . s . s bert_refined . s . s table . model inference and total running time on the benchmark dataset for all reads. one v nivida gpu card. in the running, cpus are responsible for data loading and feature extraction, while gpu works for model inference. we tested the model inference time and total running time of the three models on the benchmark dataset . for each mixture split, we repeated times running and took the averaged value. as shown in table , the model inference speed of bert models is around x∼ x faster than birnn model (bert_refined: . x, bert_basic: . x). the inference time of refined bert is only slightly slower than the basic bert model. the gap of the total time is not that large (bert_refined: . x, bert_basic: . x), as the data i/o and feature extraction take major time. in the current implementation of bert, we use reads as the basic data unit and integrate the data pre-processing part during a read-batch loading process. the data i/o and feature extraction part can be further accelerated. discussion a bert commonly works in a pre-training and fine-tuning approach. in the pre-training phase, a bert learns bi-directional representations from unlabeled data. after that, learned feature representations are used on task- specific data for further fine-tuning. it has lead to several state-of-the-art results on many downstream tasks in language understanding. according to the data scale, the number of bert parameters is usually large, and training such a model requires a huge amount of computational resources. for example, the bert used for natural language modeling has a parameter scale ranging from m to m (devlin et al., ). in this work, we did not follow this schema. instead, we utilized the model architecture of bert to provide a lightweight and non-recurrent solution to replace the recurrent birnn model. in our experiment, the bert uses three attention layers with attention heads and hidden units for each layer. the total number of model parameters is around . m, which is even less than that of birnn ( . m). in the future, when more nanopore methylation data becomes available, a larger bert model and pre-training and fine-tuning scheme can be further explored. conclusion in this work, we explored applying bert models for nanopore methylation detection, which aims to use a non-recurrent modeling approach for fast inference. we quantified positional signal-shift related to methylation for different datasets of specific motif/methylase and found patterns across datasets. in the process of evaluation, we found the original bert architecture does not work as well as birnn. we proposed a refined bert considering task-specific characters into the modeling. compared with the original bert, the refined bert uses learnable positional encoding and self-attention with relative position representation, and focuses more on the center positions in a ± bp range. the experiment results show that the refined bert can achieve competitive and even better results than the state- of-the-art birnn model on a set of mc and ma benchmark datasets, while the model inference speed is about x faster. on the cross-sample evaluation, for the case that train and test data from different research groups, berts (include the original bert) show a better performance than birnn. acknowledgements we would like to thank marcus stoiber and jared simpson for making nanopore methylation data publicly available, zaka wing-sze yuen for providing the benchmark dataset and pipeline, authors of deepmod and deepsignal for providing their source codes. references devlin, j. et al. ( ). bert: pre-training of deep bidirectional transformers for language understanding. arxiv preprint arxiv: . . huang, z. et al. ( ). improve transformer models with better relative position embeddings. arxiv preprint arxiv: . . kim, d. et al. ( ). the architecture of sars-cov- transcriptome. cell, ( ), – . kingma, d. p. and ba, j. ( ). adam: a method for stochastic optimization. arxiv preprint arxiv: . . liu, q. et al. ( ). detection of dna base modifications by deep recurrent neural network on oxford nanopore sequencing data. nature communications, ( ), – . ni, p. et al. ( ). deepsignal: detecting dna methylation state from nanopore sequencing reads using deep-learning. bioinformatics, ( ), – . shaw, p. et al. ( ). self-attention with relative position representations. arxiv preprint arxiv: . . simpson, j. t. et al. ( ). detecting dna cytosine methylation using nanopore sequencing. nature methods, ( ), . stoiber, m. h. et al. ( ). de novo identification of dna modifications enabled by genome-guided nanopore signal processing. biorxiv, page . vaswani, a. et al. ( ). attention is all you need. pages – . yuen, z. w.-s. et al. ( ). systematic benchmarking of tools for cpg methylation detection from nanopore sequencing. biorxiv. .license cc-by-nc-nd . internationalpeer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made available under a the copyright holder for this preprint (which was not certified bythis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / http://creativecommons.org/licenses/by-nc-nd/ . / a comparative study of genomic adaptations to low nitrogen availability in genlisea aurea thibaut goldsborough (university of st andrews, tg @st-andrews.ac.uk) a comparative study of genomic adaptations to low nitrogen availability in genlisea aurea abstract: genlisea aurea is a carnivorous plant that grows on nitrogen-poor waterlogged sandstone plateaus and is thought to have evolved carnivory as an adaptation to very low nitrogen levels in its habitat. the carnivorous plant is also unusual for having one of the smallest genomes among flowering plants. genomic dna is known to have a high nitrogen content and yet, to the author's knowledge, no published study has linked nitrogen starvation of g. aurea with genome size reduction. this comparative study of the carnivorous plant g. aurea, the model organism arabidopsis thaliana (brassicaceae) and the nitrogen fixing trifolium pratense (fabaceae) attempts to investigate whether the genome, transcriptome and proteome of g. aurea showed evidence of adaptations to low nitrogen availability. it was found that although g. aurea's genome, cds and non-coding dna were much lower in nitrogen than the genome of t. pratense and a. thaliana this was solely due to the length of the genome, cds and non-coding sequences rather than the composition of these sequences. introduction: genlisea aurea (lentibulariaceae) is a carnivorous plant found in brazil that grows on waterlogged sandstone plateaus. it is thought to have evolved carnivory as an adaptation to very low nitrogen levels in its habitat (müller k. et al. ). g. aurea is also unusual for having one of the smallest genomes among flowering plants with a genome length of just . mb, resulting from a process called genome reduction in which intergenic regions and duplicated genes are removed (leushkin, e.v. et al. ). the tiny genome of another carnivorous plant utricularia gibba (lentibulariaceae) shows that g. aurea is not the only carnivorous plants that grows in nitrogen poor habitats to have undergone genome size reduction (ibarra-laclette, e. et al. ). despite dna having a high nitrogen content, to the author’s knowledge, no published study has linked nitrogen starvation of g. aurea with genome size reduction. this project investigates whether the genome, transcriptome and proteome of g. aurea show evidence of adaptations to low nitrogen availability. in this study, the genome of g. aurea is compared to the model organism arabidopsis thaliana (brassicaceae) and to the nitrogen fixing trifolium pratense (fabaceae). t. pratense is known to fix nitrogen gas (n ) from the atmosphere with the help of nitrogen-fixing bacteria found in its roots (davey a.g. et al. ). for this reason, t. pratense was taken as a control for a plant that is not nitrogen deprived. reduction of nitrogen usage in proteomes has already been recorded when comparing plant proteins and animal proteins. plants are generally regarded as nitrogen limited in comparison with animals and one study found a . % reduction in nitrogen use in amino acid side chains of plant proteins compared to animal proteins (acquisti, c., kumar and s., elser, j.j. ). another study found that parasitic microorganisms showed altered codon usage and genome composition as a response to nitrogen limitations (seward, e.a. and kelly, s. ( ). .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / methods: all genomic data was obtained from the ncbi genome database genbank (available at ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant) . for all three species, the genome fasta file, the genome annotation gff file and the complete cds was obtained. ) determination of genomic nitrogen content the genomic nitrogen content of each species was calculated by counting the number of occurrences of each nucleotide and multiplying by the corresponding number of nitrogen atoms using a python script and the genome fasta file. while a guanine-cytosine pair has eight nitrogen atoms, an adenine-thymine pair only has seven. the nitrogen content of the entire genome was determined first, then the nitrogen content of the cds and non-cds regions were also calculated using the cds file. ) determination of transcriptomic nitrogen content and codon usage bias the nitrogen content of all the pre-mrna was determined by transcribing the regions annotated by ‘gene’ in the gff file using the biopython (bio.seq) library. adenine has nitrogen atoms, uracil has , guanine has and cytosine has . the nitrogen content of the introns and exons were also determined separately. finally, a codon usage table was obtained to examine preferential codon usage. ) determination of proteome nitrogen content the nitrogen content of all the protein encoded by the cds regions of each species was determined by counting the occurrences of each amino-acid and multiplying by the corresponding number of nitrogen atoms. the cds sequences were converted to amino acid sequences using the biopython library. ) determination of transfer-rna nitrogen content and usage trna genes were identified in each genome using the trnascan-se software by lowe, t.m. and chan, p.p. ( ). trnascan-se uses an advanced methodology for trna gene detection and functional prediction (determination of trna anticodon). combining the results from the trnascan- se and a python script, the nitrogen content of the trnas was determined. using the data obtained from the codon usage tables obtained in part ), it was possible to link codon biases with the corresponding trna nitrogen content. the aim was to examine whether trnas that had low nitrogen content had their corresponding codons more frequently represented than codons that were associated with higher nitrogen content trnas (among codons that are coding for the same amino acid). results and discussion: investigating the nitrogen content of the genome of the three species reveals that g. aurea has a considerably lower number of nitrogen atoms in its genome than the two other plant species. a comparison with t. pratense shows the vast difference in nitrogen content can mostly be explained .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / by the reduction of the number of nitrogen atoms in the non-coding dna sequences of g. aurea (fig. ). although t. pratense has times more nitrogen in its cds than g. aurea, the carnivorous plant has times less nitrogen in its non-coding dna sequences. at first glance, this observation supports the theory that genome reduction of g. aurea was motivated by nitrogen starvation. however, it might not come as a surprise to the reader that a vast reduction is genome size is accompanied by a vast reduction in genomic nitrogen content. this observation alone does not explain whether g. aurea has preferential usage of nitrogen-poor nucleotides (a-t base pairs). figure : number of nitrogen atoms in the entire genomic dna, cds and non-coding dna of g. aurea (red), a. thaliana (blue) and t. pratense (green). in this report, the term molecular unit refers to a dna base-pair, an rna nucleotide or a protein amino acid. relative nitrogen content refers to the average number of nitrogen atoms per molecular unit. upon examination of the relative nitrogen content of dna, rna and protein of the three-plant species, an unexpected pattern occurs (fig. ). the nitrogen starved carnivorous plant has higher nitrogen counts per molecular unit in genomic dna, cds, non-coding dna, protein, mrna, exons and introns. this data does not support the hypothesis that nitrogen starvation has caused preferential usage of molecular units that are lower in nitrogen. inter species variations aside, cds dna was found to be higher in nitrogen than non-coding dna and similarly exons were found to be higher in nitrogen than introns. interestingly, in all plots, a. thaliana, which has an intermediary genome length compared to g. aurea and t. pratense, was also found to have to have an intermediary nitrogen usage as well. g . a ur ea a . t ha lia na t. p ra te ns e . e+ . e+ . e+ . e+ . e+ genomic dna n itr og en a to m s g . a ur ea a . t ha lia na t. p ra te ns e . e+ . e+ . e+ . e+ . e+ cds n itr og en a to m s g . a ur ea a . t ha lia na t. p ra te ns e . e+ . e+ . e+ . e+ . e+ non−coding dna n itr og en a to m s .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / the first explanation of why g. aurea has a higher nitrogen usage in its dna, rna and proteins could be that there is not enough selective pressure on each molecular unit due to the small difference of nitrogen atoms gained for each molecular change. for example, a single substitution of a gc base-pair to an at base-pair only lowers nitrogen usage by one nitrogen atom. in practice, it may be easier to remove whole sequences of non-coding or repeating sequences of dna to optimize nitrogen usage. some rna transcripts may only be expressed for very short periods of time in the plant’s life cycle, reducing once again selective pressure on nitrogen optimization in these transcripts. however, this cannot explain why longer genomes are associated with lower nitrogen usage per molecular unit, at least for the three-species considered in this project. it is possible that the tiny genome of g. aurea combined with additional nitrogen captured from carnivory enables the species to have more leeway in using nitrogen rich amino acids and nucleotides. finally, a last hypothesis is that g. aurea is actually using its transcriptome and proteome as a nitrogen bank. the high nitrogen content and the ubiquitous recycling of rna and proteins in cells could make nitrogen storage in proteomes and transcriptomes possible. figure : average number of nitrogen atoms per molecular unit in genomic dna, cds, non-coding dna, protein, mrna, exons, introns and trna of g. aurea (red), a. thaliana (blue) and t. pratense (green). error bars correspond to % confidence intervals. in this figure, mrna refers to the pre-mrna that hasn’t undergone removal of introns by splicing. three different scales have been set for dna sequences (a-c), protein sequences (d) and rna sequences (e-h). g . a ur ea a . t ha lia na t. p ra te ns e n itr og en a to m s pe r b as ep ai r . . . . . genomic dna g . a ur ea a . t ha lia na t. p ra te ns e n itr og en a to m s pe r b as ep ai r . . . . . cds g . a ur ea a . t ha lia na t. p ra te ns e n itr og en a to m s pe r b as ep ai r . . . . . non−coding dna g . a ur ea a . t ha lia na t. p ra te ns e n itr og en a to m s pe r a m in o ac id . . . . . . protein g . a ur ea a . t ha lia na t. p ra te ns e n itr og en a to m s pe r n uc le ot id e . . . . mrna g . a ur ea a . t ha lia na t. p ra te ns e n itr og en a to m s pe r n uc le ot id e . . . . exons g . a ur ea a . t ha lia na t. p ra te ns e n itr og en a to m s pe r n uc le ot id e . . . . introns g . a ur ea a . t ha lia na t. p ra te ns e n itr og en a to m s pe r n uc le ot id e . . . . trnag a f e d c b h .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / interestingly, the relative nitrogen content of trna may be lower in g. aurea than in the two- other species (fig. , plot h). rna sequencing data from westermann, a., gorski, s. and vogel, j. ( ) shows that in eukaryotic cells there is about times more trna than mrna (as a measure of weight). the paper also states that cells contain almost times more rrna than trna, however, due to time constraints and the generally poorly annotated rrna genes, rrna nitrogen content was not determined. the fact that g. aurea had lower nitrogen content in trna sequences but not in other types of rna or dna sequences supports the hypothesis that there isn’t enough selective pressure on each molecular unit of dna and mrna to motivate nucleotide substitutions. figure : bar graph representing the codon usage bias and trna nitrogen content in g. aurea. for each amino acid, the codon usage bias was determined and the relative proportion of each codon is represented. codons that are complementary to trnas that are low in nitrogen are lighter in colour than codons complementary to trnas that are rich in nitrogen. when no trna sequences were found, the codon is represented in grey. when multiple trna sequences were found for a single codon, the average nitrogen of the trnas is represented. the colour scale bar ranges from n atoms (pure white) to n atoms (pure red). note that tryptophan (w) was removed for aesthetic reasons, no trna gene was found by trnascan-se for tryptophan (grey). i i n n no data .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / studying the codon usage bias of g. aurea (figure ) revealed that the carnivorous plant uses the entire genetic code and this preliminary attempt to link codon usage bias with trna nitrogen concentration cannot conclusively draw a conclusion on whether codons complementary to trnas that are rich in nitrogen are less represented than codons complementary to trnas poor in nitrogen. for the majority of codons, multiple trna sequences were found to encode that codon, taking the mean of the nitrogen content of these sequences makes the assumption that all the trnas are equally expressed in g. aurea. this of course is extremely unlikely, thus sequencing the trna transcriptome of g. aurea is the only way to make this data more accurate. conclusion: this comparative study of the carnivorous plant genlisea aurea, the model organism arabidopsis thaliana (brassicaceae) and the nitrogen fixing trifolium pratense (fabaceae) attempted to investigate whether the genome, transcriptome and proteome of g. aurea showed evidence of adaptations to low nitrogen availability. it was found that although g. aurea’s genome, cds and non- coding dna were much lower in nitrogen than the genome of t. pratense and a. thaliana this was solely due to the length of the genome, cds and non-coding sequences rather than the composition of these sequences. in fact, in the genomic dna, cds, non-coding dna, mrna, exons, introns and proteins of g. aurea, the relative nitrogen content was found to be greater than in the two-other species suggesting that nitrogen starvation might not put enough selective pressure on each molecular unit to motivate nucleotide substitutions. it was found that in trna sequences, which are about times more abundant than mrna in eukaryotes, g. aurea may have lower relative nitrogen. finally, an attempt to link codon usage bias with the nitrogen content of complementary trnas proved inconclusive possibly due to the fact that multiple trnas can be complementary to a single codon. future studies should determine the relative nitrogen content of ribosomal rnas and perform transcriptome sequencing to determine the nitrogen content of the three species’ transcriptomes. references: acquisti, c., kumar and s., elser, j.j. ( ) signatures of nitrogen limitation in the elemental composition of the proteins involved in the metabolic apparatus. royal society, biological sciences ; : – . carlsson, g., huss-danell, k. ( ) nitrogen fixation in perennial forage legumes in the field. plant and soil , – https://doi.org/ . /a: ibarra-laclette, e., lyons, e., hernández-guzmán, g. et al. ( ) architecture and evolution of a minute plant genome. nature , – . https://doi.org/ . /nature leushkin, e.v., sutormin, r.a., nabieva, e.r. et al. ( ) the miniature genome of a carnivorous plant genlisea aurea contains a low number of genes and short non-coding sequences. bmc genomics , ( ). https://doi.org/ . / - - - .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / lowe, t.m. and chan, p.p. ( ) trnascan-se on-line: search and contextual analysis of transfer rna genes. nucleic acids research. : w - . müller k, borsch t, legendre l, porembski s, theisen i, barthlott w. evolution of carnivory in lentibulariaceae and the lamiales, ( ), plant biology, jul; ( ): - . doi: . /s- - . pmid: . seward, e.a. and kelly, s. ( ), dietary nitrogen alters codon bias and genome composition in parasitic microorganisms. genome biology , . https://doi.org/ . /s - - - westermann, a., gorski, s. and vogel, j. ( ), dual rna-seq of pathogen and host. nature reviews microbiology , – . https://doi.org/ . /nrmicro .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ace: explaining cluster from an adversarial perspective ace: explaining cluster from an adversarial perspective yang young lu timothy c. yu giancarlo bonora william stafford noble abstract a common workflow in single-cell rna-seq anal- ysis is to project the data to a latent space, cluster the cells in that space, and identify sets of mark- er genes that explain the differences among the discovered clusters. a primary drawback to this three-step procedure is that each step is carried out independently, thereby neglecting the effects of the nonlinear embedding and inter-gene depen- dencies on the selection of marker genes. here we propose an integrated deep learning frame- work, adversarial clustering explanation (ace), that bundles all three steps into a single workflow. the method thus moves away from the notion of “marker genes” to instead identify a panel of ex- planatory genes. this panel may include genes that are not only enriched but also depleted rela- tive to other cell types, as well as genes that exhib- it differences between closely related cell types. empirically, we demonstrate that ace is able to identify gene panels that are both highly discrimi- native and nonredundant, and we demonstrate the applicability of ace to an image recognition task. . introduction single-cell sequencing technology has enabled the high- throughput interrogation of many aspects of genome biolo- gy, including gene expression, dna methylation, histone modification, chromatin accessibility and genome d archi- tecture (stuart & satija, ) in each of these cases, the resulting high-dimensional data can be represented as a s- parse matrix in which rows correspond to cells and columns correspond to features of those cells (gene expression val- ues, methylation events, etc.). empirical evidence suggests that this data resides on a low-dimensional manifold with latent semantic structure (welch et al., ). accordingly, department of genome sciences, university of washington, seattle, wa graduate program in molecular and cellular biology, university of washington, seattle, wa paul g. allen school of computer science and engineering, university of washington, seattle, wa. correspondence to: william stafford noble . preliminary work. under review. identifying groups of cells in terms of their inherent latent semantics and thereafter reasoning about the differences be- tween these groups is an important area of research (plumb et al., ). in this study, we focus on the analysis of single cell rna- seq (scrna-seq) data. this is the most widely available type of single-cell sequencing data, and its analysis is chal- lenging not only because of the data’s high dimensionality but also due to noise, batch effects, and sparsity (amodio et al., ). the scrna-seq data itself is represented as a sparse, cell-by-gene matrix, typically with tens to hundreds of thousands of cells and tens of thousands of genes. a com- mon workflow in scrna-seq analysis (pliner et al., ) consists of three steps: ( ) learn a compact representation of the data by projecting the cells to a lower-dimensional space; ( ) identify groups of cells that are similar to each other in the low-dimensional representation, typically via clustering; and ( ) characterize the differences in gene ex- pression among the groups, with the goal of understanding what biological processes are relevant to each group. op- tionally, known “marker genes” may be used to assign cell type labels to the identified cell groups. a primary drawback to the above three-step procedure is that each step is carried out independently. here, we pro- pose an integrated, deep learning framework for scrna-seq analysis, adversarial clustering explanation (ace), that projects scrna-seq data to a latent space, clusters the cells in that space, and identifies sets of genes that succinctly explain the differences among the discovered clusters (fig- ure ). at a high level, ace first “neuralizes” the clustering procedure by reformulating it as a functionally equivalent multi-layer neural network (kauffmann et al., ). in this way, in concatenation with a deep autoencoder that gen- erates the low-dimensional representation, ace is able to attribute the cell’s group assignments all the way back to the input genes by leveraging gradient-based neural network explanation methods. next, for each sample, ace seeks small perturbations of its input gene expression profile that lead the neuralized clustering model to alter the group as- signments. these adversarial perturbations allow ace to define a concise gene set signature for each cluster or pair of clusters. in particular, ace attempts to answer the question, “for a given cell cluster, can we identify a subset of genes whose expression profiles are sufficient to identify members (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation of this cluster?” we frame this problem as a ranking task, where thresholding the ranked list yields a set of explanatory genes. ace’s joint modeling approach offers several benefits rela- tive to the existing state of the art. first, most existing meth- ods for the third step of the analysis pipeline—identifying genes associated with a given group of cells—treat each gene independently (love et al., ). these approach- es ignore the dependencies among genes that are induced by gene networks, and often yield lists of genes that are highly redundant. ace, in contrast, aims to find a smal- l set of genes that jointly explain a given cluster or pair of clusters. second, most current methods identify genes associated with a group of cells without considering the nonlinear embedding model which maps the gene expres- sion to the low-dimensional representation where the groups are defined in the first place. to our knowledge, the only exception is the global counterfactual explanation (gce) algorithm (plumb et al., ), but that algorithm is limited to using a linear transformation. a third advantage of ace’s integrated approach is its ability to take into account batch effects during the assignment of genes to clusters. stan- dard nonlinear embedding methods, such as t-sne (van der maaten & hinton, ) and umap (mcinnes & healy, ; becht et al., ), cannot take such structure into account and hence may lead to incorrect interpretation of the data (amodio et al., ; li et al., ). to address this problem, deep autoencoders with integrated denoising and batch correction can be used for scrna-seq analysis (lopez et al., ; amodio et al., ; li et al., ). we demonstrate below that batch effect structure can be usefully incorporated into the ace model. a notable feature of ace’s approach is that, by identify- ing genes jointly, the method moves away from the notion of a “marker gene” to instead identify a “gene panel”. as such, genes in the panel may not be solely enriched in a single cluster, but may together be predictive of the clus- ter. in particular, in addition to a ranking of genes, ace assigns a boolean to each gene indicating whether its inclu- sion in the panel is positive or negative, i.e., whether the gene’s expression is enriched or depleted relative to clus- ter membership. we have applied ace to both simulated and real datasets to demonstrate its empirical utility. our experiments demonstrate that ace identifies gene panels that are highly discriminative and exhibit low redundancy. we further provide results suggesting that ace is useful in domains beyond biology, such as image recognition. the apache licensed source code of ace (see submitted file) will be made publicly available upon acceptance. . related work ace falls into the paradigm of deep neural network interpre- tation methods, which have been developed primarily in the context of classification problems. these methods can be loosely categorized into three types: feature attribution meth- ods, counterfactual-based methods, and model-agnostic ap- proximation methods. feature attribution methods assign an importance score to individual features so that higher scores indicate higher importance to the output prediction (simonyan et al., ; shrikumar et al., ; lundberg & lee, ). counterfactual-based methods typically i- dentify the important subregions within an input sample by perturbing the subregions (by adding noise, rescaling (sun- dararajan et al., ), blurring (fong & vedaldi, ), or inpainting (chang et al., )) and measuring the resulting changes in the predictions. lastly, model-agnostic approxi- mation methods approximate the model being explained by using a simpler, surrogate function which is self-explainable (e.g., a sparse linear model, etc.) (ribeiro et al., ). recently, some interpretation methods have emerged to un- derstand models beyond classification tasks (samek et al., ; kauffmann et al., ; ), including the one we present in this paper for the purpose of cluster explanation. ace’s perturbation approach draws inspiration from ad- versarial machine learning (xu et al., ) where imper- ceivable perturbations are maliciously crafted to mislead a machine learning model to predict incorrect outputs. in particular, ace’s approach is closest to the setting of a “white-box attack,” which assumes complete knowledge to the model, including its parameters, architecture, gradients, etc. (szegedy et al., ; kurakin et al., ; madry et al., ; carlini & wagner, ). in contrast to these meth- ods, ace re-purposes the malicious adversarial attack for a constructive purpose, identifying sets of genes that explain clusters in scrna-seq data. ace operates in concatenation with a deep autoencoder that generates the low-dimensional representation. in this paper, ace uses saucie (amodio et al., ), a commonly- used scrna-seq embedding method that incorporates batch correction. in principle, ace is generalizable to any off-the- shelf scrna-seq embedding methods, including slicer (welch et al., ), scvi (way & greene, ), scanvi (xu et al., ), desc (li et al., ), and itclust (hu et al., ). . approach . . problem setup we aim to carry out three analysis steps for a given scrna- seq dataset, producing a low-dimensional representation of each cell’s expression profile, a cluster assignment for each cell, and a concise set of “explanatory genes” for each (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation genes c el ls encoder decoder gene ... embeddings gene gene gene p cell ... gene ... gene gene gene p neuralized clusteringencodercell ... source group assignment target group assignment gene p rank gene score # gene # gene # gene # ... ... ... input: gene expression matrix deep autoencoder learns low-dimensional representation embedding clustering clustering is neuralized and concatenated with the encoder differentiation analysis by ace output: gene relevance + ... ... p er tu rb at io n vesus ... figure . ace workflow. ace takes as input a single-cell gene expression matrix and learns a low-dimensional representation for each cell. next, a neuralized version of the k-means algorithm is applied to the learned representation to identify cell groups. finally, for pairs of groups of interest (either each group compared to its complement, or all pairs of groups), ace seeks small perturbations of its input gene expression profile that lead the neuralized clustering model to alter the assignment from one group to the other. the workflow employs a combined objective function to induce the nonlinear embedding and clustering jointly. ace produces as output the learned embedding, the cell group assignments, and a ranked list of explanatory genes for each cell group. cluster or pair of clusters. let x = (x ,x , · · · ,xn) t ∈ rn×p be the normalized gene expression matrix obtained from a scrna-seq experiment, where rows correspond to n cells and columns correspond to p genes. ace relies on the following three components: ( ) an autoencoder to learn a low-dimensional representation of the scrna-seq data, ( ) a neuralized clustering algorithm to identify groups of cells in the low-dimensional representation, and ( ) an adversarial perturbation scheme to explain differences between groups by identifying explanatory gene sets. . . learning the low-dimensional representation embedding scrna-seq expression data into a low- dimensional space aims to capture the underlying structure of the data, based upon the assumption that the biological manifold on which cellular expression profiles lie is inher- ently low-dimensional. specifically, ace aims to learn a mapping f(·) : rp → rd that transforms the cells from the high-dimensional input space rp to a lower-dimensional embedding space rd, where d � p. to accurately represent the data in rd, we use an autoencoder consisting of two components, an encoder f(·) : rp → rd and a decoder g(·) : rd → rp. this autoencoder optimizes the generic loss min θ n∑ i= ‖xi −g(f(xi))‖ ( ) finally, we denote z = (z ,z , · · · ,zn) t ∈ rn×d as the low-dimensional representation obtained from the encoder, where zi ∈ rd = f(xi) is the embedded representation of cell xi. the autoencoder in ace can be extended in several impor- tant ways. for example, in some settings, equation is augmented with a task-specific regularizer Ω(x): min θ n∑ i= ‖xi −g(f(xi))‖ + Ω(x). ( ) as mentioned in section , the scrna-seq embedding method used by ace, saucie, encodes in Ω(x) a batch correction regularizer by using maximum mean discrepancy. in this paper, ace uses saucie coupled with a feature se- lection layer (abid et al., ), with the aim of minimizing redundancy and facilitating selection of diverse explanatory gene sets. . . neuralizing the clustering step to carry out clustering in the low-dimensional space learned by the autoencoder, ace uses a neuralized version of the k-means algorithm. this clustering step aims to partition z ∈ rn×d into c groups, where each group potentially corresponds to a distinct cell type. the standard k-means algorithm aims to minimize the fol- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation lowing objective function by identifying a set of group cen- troids { µc ∈ rd : c = , , · · · ,c } : min ∑ ic δicoc(zi) ( ) where δic indicates whether cell zi belongs to group c and the “outlierness” measure oc(zi) of cell zi relative to group c is defined as oc(zi) = ‖zi −µc‖ . following kauffmann et al. ( ), we neuralize the k- means algorithm by creating a neural network containing c modules, each with two layers. the architecture is mo- tivated by a soft assignment function that quantifies, for a particular cell zi and a specified group c, the group assign- ment probability score pc(zi) = exp(−βoc(zi))∑ k exp(−βok(zi)) ( ) where the hyperparameter β controls the clustering fuzzi- ness. as β approaches infinity, equation approaches the indicator function for the closest centroid and thus reduces to hard clustering. to measure the confidence of group assignment, we use a logit function written as mc(zi) = log ( pc(zi) −pc(zi) ) = β · β min k =c { ‖zi −µk‖ −‖zi −µc‖ } ( ) where minβk =c{·} = − β log ∑ exp(−β(·)) indicates a soft min-pooling layer. (see kauffmann et al. ( ) for a detailed derivation.) the rationale for using the logit func- tion is that if there is as much confidence supporting the group membership as against it, then the confidence score mc(z) = . additionally, equation has the following interpretation: the data point z belongs to the group c if and only if the distance to its centroid is smaller than the distance to all other competing groups. equation further decomposes into a two-layer neural network module: hck(zi) = w t ckzi + bck mc(zi) = β · β min k =c {hck(zi)} ( ) where the first layer is a linear transformation layer with parameters wck = · (µc −µk) and bck = ‖µk‖ −‖µc‖ , and the second layer is the soft min-pooling layer introduced in equation . ace constructs one such module for each of the c clusters, as illustrated in figure . . . explaining the groups ace’s final step aims to induce, for each cluster identified by the neuralized k-mean algorithm, a ranking on genes such that highly ranked genes best explain that cluster. we consider two variants of this task: the one-vs-rest setting compares the group of interest zs = f(xs) ⊆ z to its complement set zt = f(xt) ⊆ z, where xt = x\xs; the one-vs-one setting compares one group of interest in zs = f(xs) ⊆ z to a second group of interest zt = f(xt) ⊆ z. in each setting, the goal is to identify the key differences between the source group xs ⊆ x and the target group xt ⊆ x in the input space, i.e., in terms of the genes. we treat this as a neural network explanation problem by finding the minimal perturbation within the group of interest, x ∈ xs, that alters the group assignment from the source group s to the target group t. specifically, we optimize an objective function that is a mixture of two terms: the first term is the difference between the current sample x and the perturbed sample x̂ = x + δ where δ ∈ rp, and the second term quantifies the difference in group assignments induced by the perturbation. the objective function for the one-vs-one setting is min δ ‖δ‖ + λ max( ,α + ms(x + δ) −mt(x + δ)) ( ) where λ > is a tradeoff coefficient to either encourage a small perturbation of x when small or a stronger alternation to the target group when large. the second term penalizes the situation where the group logit for the source group s is still larger than the target group t, up to a pre-specified margin α > . in this paper we fix α = . . the difference between the current sample x and the potentially perturbed x̂ is measured by the l norm to encourage sparsity and non-redundancy. note that equation assumes that the input expression matrix is normalized so that a perturbation added to one gene is equivalent to that same perturbation added to a different gene. analogously, in the one-vs-rest case, the objective function for the optimization is min δ ‖δ‖ +λ max( ,α+ms(x+δ)−max t =s mt(x+δ)) ( ) where the second term penalizes the situation in which the group logit for the source group s is larger than all non- source target groups. finally, with the δ ∈ rp obtained by optimizing either equation or equation , ace quantifies the importance of the ith gene relative to a perturbation from source group s to target group t as the absolute value of δi, thereby inducing a ranking in which highly ranked genes are more specific to the group of interest. . baseline methods we compare ace against six methodologically distinct base- line methods, each of which induces a ranking on genes in terms of group-specific importance, analogous to ace. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation deseq (love et al., ) is a representative statistical hypothesis testing method that tests for differential gene expression based on a negative binomial model. the main caveat of deseq is that it treats each gene as independent. the jensen-shannon distance (jsd) (cabili et al., ) is a representative distribution distance-based method which quantifies the specificity of a gene to a cell group. similar to deseq , jsd considers each gene independently. global counterfactual explanation (gce) (plumb et al., ) is a compressed sensing method that aims to identify consistent differences among all pairs of groups. unlike ace, gce requires a linear embedding of the scrna-seq data. the gene relevance score (grs) (angerer et al., ) is a gradient-based explanation method that aims to attribute a low-dimensional embedding back to the genes. the main limitations of grs are two-fold. first, the embedding used in grs is constrained to be a diffusion map, which is chosen specifically to make the gradient easy to calculate. second, taking the gradient with respect to the embedding only indi- rectly measures the group differentiation compared to taking the gradient with respect to the group difference directly, as in ace. smoothgrad (smilkov et al., ) and shap (lundberg & lee, ), which are designed primarily for classification problems, are two representative feature attribution methods. each one computes an importance score that indicates each gene’s contribution to the clustering assignment. smooth- grad relies on knowledge to the model, whereas shap does not. . results . . performance on simulated data to compare ace to each of the baseline methods, we used a recently reported simulation method, symsim (zhang et al., ), to generate two synthetic scrna-seq datasets: one “clean” dataset and one “complex” dataset. in both cases, we simulated many redundant genes, in order to adequately challenge methods that aim to detect a minimal set of informative genes. the simulation of the clean dataset uses a protocol similar to that of plumb et al. ( ). we first used symsim to generate a background matrix containing simulated counts from cells, genes, and five distinct clusters. we then used this background matrix to construct our simu- lated dataset of cells by genes. the simulated data is comprised of three sets of genes: causal genes, dependent genes, and noise genes. to select the causal genes, we identified all genes that are differentially expressed by symsim’s criteria (ndiff-evfgene > and | log fold-change| > . ) between at least one pair of clus- ters, and we selected the genes that exhibit the largest average fold-change across all pairs of clusters in which the gene was differentially expressed. a umap embedding on these causal genes alone confirms that they are jointly capa- ble of separating cells into their respective clusters (fig. a). next, we simulated dependent genes, which are weight- ed sums of – randomly selected causal genes, with added gaussian noise. as such, a dependent gene is highly cor- related with a causal gene or with a linear combination of multiple causal genes. the weights were sampled from a continuous uniform distribution, u( . , . ), and the gaus- sian noise was sampled from n( , ). as expected, the dependent genes are also jointly capable of separating cells into their respective clusters (fig. a). lastly, we found all genes that were not differentially expressed between any cluster pair in the ground truth, and we randomly sampled noise genes. these genes provide no explanation of the clustering structure (fig. a). to simulate the complex dataset, we used symsim to add dropout events and batch effects to the background ma- trix generated previously. we then selected the same exact causal and noise genes as in the clean dataset, and used the same exact random combinations and weights to generate the dependent genes. thus, the clean and complex datasets contain the same genes; however, the complex dataset enables us to gauge how robust ace is to artifacts of tech- nical noise observed in real single-cell rna-seq datasets (fig. b). to compare the different gene ranking methods, we need to specify the ground truth cluster labels and a performance measure. we observe that the embedding representation learned by ace exhibits clear cluster patterns even in the p- resence of dropout events and batch effects, and thus ace’s k-means clustering is able to recover these clusters (ap- pendix figure a. ). accordingly, to compare different meth- ods for inducing gene rankings, we provide ace and each baseline method with the ground truth clustering labels from the original study (zheng et al., ). ace then calculates the group centroid used in equation by averaging the data points of the corresponding ground truth cluster. the em- bedding layer together with the group centroids are then used to build the neuralized clustering model (equation ). each method produces gene rankings for every cluster in a one-vs-rest fashion. to measure how well a gene ranking captures clustering structure, we use the jaccard distance to measure the similarity between a cell’s k nearest neighbors (k-nn) when using a subset of top-ranked genes and a cell’s k-nn when using all genes. to compute the k-nn, we use the euclidean distance metric. the jaccard distance is defined as jd(i) = − sfull ∩ssub sfull ∪ssub ( ) (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation figure . comparing ace to baseline methods on simulated scrna-seq datasets. each dataset consists of causal genes, dependent genes, and noise genes. (a) umap embeddings of cells composing the clean dataset. panels correspond to embeddings using the three subsets of genes (causal, dependent, and noise), as well as all of the genes together. (b) same as panel a, but for the complex dataset. (c) comparison of methods via jaccard distance as a function of the number of genes in the ranking. ace performs substantially better than each of the baseline methods on the clean dataset. the gray dashed line indicates the mean jaccard distance achieved by the causal genes alone. (d) same as panel c but for the complex dataset. where sfull represents cell i’s k-nn’s when using all genes, and ssub represents cell i’s k-nn’s when using a subset of top-ranked genes. if the subset of top-ranked genes does a good job of explaining a cluster of cells, then sfull ∩ssub and sfull ∪ ssub should be nearly equal, and the jaccard distance should approach . we select the gene ranking used to derive a subset of top-ranked genes based on the cell cluster assignment. for example, if the cell belongs in cluster , we use the cluster vs. rest gene ranking. thus, to obtain a global measure of how well a clustering structure is captured on a subset of top-ranked genes, we report the mean jaccard distance across all cells. our analysis shows that ace considerably outperforms each of the baseline methods on the clean dataset, indicating that it is superior at identifying the minimal set of informative genes (fig. b). notably, ace outperforms the mean jac- card distance achieved by the causal genes alone before reaching genes used, suggesting that the method success- fully identifies dependent genes that are more informative than individual causal genes. ace also performs strongly on the complex dataset, though it appears to perform on par with smoothgrad and shap) (fig. d). notably, these three methods —ace, shap, and smoothgrad —share a common feature, employing the saucie framework that facilitates automatic batch effect correction, highlighting the utility of dnn-based dimensionality reduction and in- terpretation methods for single-cell rna-seq applications. . . real data analysis we next applied ace to a real dataset of peripheral blood mononuclear cells (pbmcs) (zheng et al., ), repre- sented as a cell-by-gene log-normalized expression matrix containing cells and highly variable genes. the cells in the dataset were previously categorized into eight cell types, obtained by performing louvain clustering (blon- del et al., ) and annotating each cluster on the basis of differentially expressed marker genes. as shown in fig- ure a and appendix figure a. , ace’s k-means clustering successfully recovers the reported cell types based upon the -dimensional embedding learned by saucie. we first aimed to quantify the discriminative power of the top-ranked genes identified by ace in comparison to the six baseline methods. to do this, we applied all the six baseline methods to the pbmc dataset using the groups identified by the k-means clustering based on the saucie embedding. for each group of cells, we extracted the top- k group-specific genes reported by each method, where k ranges from %, %, · · · , % among all genes. given the selected gene subset, we then trained a support vector machine (svm) classifier with a radial basis function kernel to separate the target group from the remaining groups. the svm training involves two hyperparameters, the regular- ization coefficient c and the bandwidth parameter σ. the σ parameter is adaptively chosen so that the training data is z-score normalized, using the default settings in scikit- learn (pedregosa et al., ). the c parameter is selected (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation top % genes the intersection among different methods (b) (c) (d) # of genes included # of genes included from most to least important important genes specific to cd t cell umap u m a p b cells cd + monocytes cd t cells cd t cells dendritic cells fcgr a+ monocytes megakaryocytes nk cells (a) from most to least important a u r o c p e a rs o n c o rr e la tio n figure . comparing ace to baseline methods on pbmc dataset. (a) umap embedding of pbmc cells labelled by ace’s k-means clustering assignment. (b) classification performance of each method, as measured by auroc, as a function of the number of genes in the set. error bars correspond to the standard error of the mean of auroc scores from each test split across different target groups. (c) redundancy among the top k genes, as measured by pearson correlation, as a function of k. error bars correspond to the standard error of the mean calculated from the group-specific correlations. (d) the figure plots overlaps among the top genes (corresponding to % of genes) identified by all seven methods with respect to the cd t cell cluster. by grid search from { − , − , · · · , , · · · , , } . the classification performance, in terms of area under the receiv- er operating characteristic curve (auroc), is evaluated by -fold stratified cross-validation, and an additional -fold cross-validation is applied within each training split to de- termine the optimal c hyperparameter. finally, auroc scores from each test split across different target groups are aggregated and reported, in terms of the mean and the stan- dard error of the mean. two cell types—megakaryocytes and dendritic cells—are excluded due to insufficient sample size (< ). as shown in figure b, the top-ranked genes reported by ace are among the most discriminative across all methods, particularly when the inclusion size is small (≤ %). the only method that yields superior performance is deseq . we next tested the redundancy of top-ranked genes, as it is desirable to identify diverse explanatory gene sets with minimum redundancy. specifically, for each target group of cells, we calculate the pearson correlations between all gene pairs within top k genes, for varying values of k. the mean and standard error of the mean of these correlations are computed within each group and then averaged across dif- ferent target groups. the results of this analysis (figure c) suggest that the top-ranked genes reported by ace are a- mong the least redundant across all methods. other methods that exhibit low redundancy include grs and the two meth- ods that use the same saucie model (i.e., smoothgrad and shap). in conjunction with the discriminative power analysis in figure b, we conclude that ace achieves a powerful combination of high discriminative power and low redundancy. finally, to better understand how these methods differ from one another, we investigated the consistency among the top- ranked genes reported by each method. for this analysis, we focused on one particular group, cd t cells. we discover strong disagreement among the methods (figure d). sur- prisingly, no single gene is selected among the top % by all methods. among all methods, ace covers the most that are reported by at least one other method ( out of genes). the four genes that ace uniquely identifies (red bar in fig- ure d)—ccl , gzmk, spocd , and snrnp —are depleted rather than enriched relative to other cell types. it is worth mentioning that both ccl and gzmk are enriched in cd t cells (thul et al., ), the closest cell type to cd t cell (figure a). this observation suggests ace identifies cells that exhibit highly discriminative changes in expression between two closely related cell types. in- deed, among ace’s -gene panel, genes are depleted rather than enriched, suggesting that much of cd ’s cell identity may be due to inhibition rather than activation of specific genes. in summary, ace is able to move away from the notion of a “marker gene” to instead identify a highly discriminative, nonredundant gene panel. . . image analysis although we developed ace for application to scrna-seq data, we hypothesized that the method would be useful in do- mains beyond biology. explanation methods are potentially useful, for example, in the analysis of biomedical images, where the explanations can identify regions of the image responsible for assignment of the image to a particular phe- notypic category. as a proof of principle for this general domain, we applied ace to the mnist handwritten digits dataset (lecun, ), with the aim of studying whether ace can identify which pixels in a given image explain why the image was assigned to one digit versus another. specif- ically, we solve the optimization problem for each input image in equation , seeking an image-specific set of pixel modifications, subject to the constraint that the perturbed image pixel values are restricted to lie in the range [ , ]. note that this task is somewhat different from the scrna- seq case: in the mnist case, ace finds a different set of explanatory pixels for each image, whereas in the scrna- (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation - + perturbation range: pixels in initial digit: figure . applying ace to the mnist dataset. ace is able to explain types of digit transitions in a pixel-wise manner. these digit transitions are chosen such that each digit category is covered at least once in both directions. seq case, ace seeks a single set of genes that explains label differences across all cells in the dataset. ace was applied to this dataset as follows. we used a sim- ple convolution neural network architecture containing two convolution layers, each with a modest filter size ( , ), a modest number of filters ( ) and relu activation, followed by a max pooling layer with a pool size ( , ), a fully con- nected layer, and a softmax layer. the model was trained on the mnist training set ( , examples) for epochs, using adam (kingma & ba, ) with an initial learning rate of . . the network achieves . % classification accuracy on the test test of , images. we observe that the embedding representation in the last pooling layer ex- hibits well-separated cluster patterns (appendix figure a. ). since our goal is not to learn the cluster structure per se, for simplicity, we fixed the number of groups to be the number of digit categories (i.e., ) and calculated the group cen- troid used in equation by averaging the data points of the corresponding category. the embedding layer together with the group centroids are then used to build the neuralized clustering model (equation .) the results of this analysis show that ace does a good job of identifying sets of pixels that accurately explain differ- ences between pairs of digits. we examined the pixel-wise explanations of pairs of digits, randomly selected to cov- er each digit category at least once in both directions (fig. ). for example, to convert “ ” to “ ,” ace disconnects the top right and bottom left of “ ,” as expected. similarly, to convert “ ” to “ ,” ace disconnects the top left and bottom left of “ .” it is worth noting that the modifications intro- duced by ace are inherently symmetric. for example, to convert “ ” to “ ” and back again, ace suggests adding and removing the same part of “ .” . discussion and conclusion in this work, we have proposed a deep learning-based scrna-seq analysis pipeline, ace, that projects scrna- seq data to a latent space, clusters the cells in that space, and identifies sets of genes that succinctly explain the d- ifferences among the discovered clusters. compared to existing state-of-the-art methods, ace jointly takes into consideration both the nonlinear embedding of cells to a low-dimensional representation and the intrinsic dependen- cies among genes. as such, the method moves away from the notion of a “marker gene” to instead identify a panel of genes. this panel may include genes that are not only enriched but also depleted relative to other cell types, as well as genes that exhibit important differences between closely related cell types. our experiments demonstrate that ace identifies gene panels that are highly discriminative sets and exhibit low redundancy. we also provide results suggesting that ace’s approach may be useful in domains beyond biology, such as image recognition. this work points to several promising directions for future research. in principle, ace can be used in conjunction with any off-the-shelf scrna-seq embedding method. thus, empirical investigation of the utility of generalizing ace to use embedders other than saucie would be interest- ing. another possible extension is to apply neuralization to alternative clustering algorithms. for example, in the con- text of scrna-seq analysis the louvain algorithm (blondel et al., ) is commonly used and may be a good candidate for neuralization. a promising direction for future work is to provide confidence estimation for the top-ranked group- specific genes, in terms of q-values (storey, ), with the help of the recently proposed knockoffs framework (barber & candès, ; lu et al., ). (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation references abid, a., balin, m. f., and zou, j. concrete autoencoders for differentiable feature selection and reconstruction. international conference on machine learning, . amodio, m., dijk, d. v., srinivasan, k., chen, w. s., mohsen, h., moon, k. r., campbell, a., zhao, y., wang, x., venkataswamy, m., and krishnaswamy, s. exploring single-cell data with deep multitasking neural networks. nature methods, pp. – , . angerer, p., fischer, d. s., theis, f. j., scialdone, a., and marr, c. automatic identification of relevant genes from low-dimensional embeddings of single cell rnaseq data. bioinformatics, . barber, r. f. and candès, e. j. controlling the false discov- ery rate via knockoffs. the annals of statistics, ( ): – , . becht, e., mcinnes, l., healy, j., dutertre, c., kwok, i. w. h., ng, l. g., ginhoux, f., and newell, e. w. dimen- sionality reduction for visualizing single-cell data using umap. nature biotechnology, ( ): – , . blondel, v. d., guillaume, j.-l., lambiotte, r., and lefeb- vre, e. fast unfolding of communities in large net- works. journal of statistical mechanics: theory and experiment, ( ):p , . cabili, m. n., trapnell, c., goff, l., koziol, m., tazon- vega, b., regev, a., and rinn, j. l. integrative annotation of human large intergenic noncoding rnas reveals global properties and specific subclasses. genes dev, ( ): – , . carlini, n. and wagner, d. towards evaluating the robust- ness of neural networks. in ieee symposium on security and privacy (sp), pp. – . ieee, . chang, c., creager, e., goldenberg, a., and duvenaud, d. explaining image classifiers by counterfactual generation. arxiv preprint arxiv: . , . fong, r. and vedaldi, a. interpretable explanations of black boxes by meaningful perturbation. in proceedings of the ieee international conference on computer vision, pp. – , . hu, j., li, x., hu, g., lyu, y., susztak, k., and li, m. iterative transfer learning with neural network for clus- tering and cell type classification in single-cell rna-seq analysis. nature machine intelligence, ( ): – , . kauffmann, j., esders, m., montavon, g., samek, w., and müller, k. from clustering to cluster explanations via neural networks. arxiv preprint arxiv: . , . kauffmann, j., müller, k., and montavon, g. towards ex- plaining anomalies: a deep taylor decomposition of one- class models. pattern recognition, : , . kingma, d. and ba, j. adam: a method for stochastic optimization. in proceedings of the rd international conference on learning representations, . kurakin, a., goodfellow, i., and bengio, s. adversar- ial examples in the physical world. arxiv preprint arxiv: . , . lecun, y. the mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, . li, x., wang, k., lyu, y., pan, h., zhang, j., stambo- lian, d., susztak, k., reilly, m. p., hu, g., and li, m. deep learning enables accurate clustering with batch ef- fect removal in single-cell rna-seq analysis. nature communications, ( ): – , . lopez, r., regier, j., cole, m. b., jordan, m. i., and yosef, n. deep generative modeling for single-cell transcrip- tomics. nature methods, ( ): – , . love, m., huker, w., and anders, s. moderated estimation of fold change and dispersion for rna-seq data with deseq . genome biology, ( ), . lu, y. y., fan, y., lv, j., and noble, w. s. deeppink: reproducible feature selection in deep neural networks. in advances in neural information processing systems, . lundberg, s. and lee, s. a unified approach to interpret- ing model predictions. advances in neural information processing systems, . madry, a., makelov, a., schmidt, l., tsipras, d., and vladu, a. towards deep learning models resistant to ad- versarial attacks. arxiv preprint arxiv: . , . mcinnes, l. and healy, j. umap: uniform manifold approximation and projection for dimension reduction. arxiv, . pedregosa, f., varoquaux, g., gramfort, a., michel, v., thirion, b., grisel, o., blondel, m., prettenhofer, p., weiss, r., dubourg, v., vanderplas, j., passos, a., cour- napeau, d., brucher, m., perrot, m., and duchesnay, e. scikit-learn: machine learning in python. journal of machine learning research, : – , . pliner, h. a., shendure, j., and trapnell, c. supervised clas- sification enables rapid annotation of cell atlases. nature methods, ( ): – , . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation plumb, g., terhorst, j., sankararaman, s., and talwalka- r, a. explaining groups of points in low-dimensional representations. icml, . ribeiro, m., singh, s., and guestrin, c. "why should i trust you?": explaining the predictions of any classifier. in proceedings of the nd acm sigkdd international conference on knowledge discovery and data mining, kdd ’ , pp. – , new york, ny, usa, . acm. samek, w., montavon, g., lapuschkin, s., anders, c. j., and müller, k. r. toward interpretable machine learning: transparent deep neural networks and beyond. arxiv preprint arxiv: . , . shrikumar, a., greenside, p., shcherbina, a., and kunda- je, a. learning important features through propagating activation differences. in international conference on machine learning, . simonyan, k., vedaldi, a., and zisserman, a. deep in- side convolutional networks: visualising image clas- sification models and saliency maps. arxiv preprint arxiv: . , . smilkov, d., thorat, n., kim, b., viégas, f., and watten- berg, m. smoothgrad: removing noise by adding noise. arxiv preprint arxiv: . , . storey, j. d. the positive false discovery rate: a bayesian interpretation and the q-value. the annals of statistics, ( ): – , . stuart, t. and satija, r. integrative single-cell analysis. nature reviews genetics, : – , . sundararajan, m., taly, a., and yan, q. axiomatic attribu- tion for deep networks. in international conference on machine learning, . szegedy, c., zaremba, w., sutskever, i., bruna, j., erhan, d., goodfellow, i., and fergus, r. intriguing properties of neural networks. arxiv preprint arxiv: . , . thul, p., Åkesson, l., wiking, m., mahdessian, d., gelada- ki, a., blal, h., alm, t., asplund, a., björk, l., breckels, l., et al. a subcellular map of the human proteome. science, ( ), . van der maaten, l. and hinton, g. visualizing data using t-sne. journal of machine learning research, ( - ): , . way, g. and greene, c. bayesian deep learning for single- cell analysis. nature methods, ( ): – , . welch, j., hartemink, a., and prins, j. slicer: inferring branched, nonlinear cellular trajectories from single cell rna-seq data. genome biology, ( ): – , . welch, j. d., hartemink, a. j., and prins, j. f. matcher: manifold alignment reveals correspondence between sin- gle cell transcriptome and epigenome dynamics. genome biology, ( ): , . xu, c., lopez, r., mehlman, e., regier, j., jordan, m., and yosef, n. probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. molecular systems biology, ( ):e , . xu, h., ma, y., liu, d., liu, h., tang, j., and jain, a. adversarial attacks and defenses in images, graphs and text: a review. international journal of automation and computing, ( ): – , . zhang, x., xu, c., and yosef, n. simulating multiple faceted variability in single cell rna sequencing. nature communications, ( ): – , . zheng, g. x. y., terry, j. m., belgrader, p., ryvkin, p., bent, z. w., wilson, r., ziraldo, s. b., wheeler, t. d., mcdermott, g. p., zhu, j., gregory, m. t., shuga, j., montesclaros, l., underwood, j. g., masquelier, d. a., nishimura, s. y., schnall-levin, m., wyatt, p. w., hind- son, c. m., bharadwaj, r., wong, a., ness, k. d., beppu, l. w., deeg, h. j., mcfarland, c., loeb, k. r., va- lente, w. j., ericson, n. g., stevens, e. a., radich, j. p., mikkelsen, t. s., hindson, b. j., and biela, j. h. mas- sively parallel digital transcriptional profiling of single cells. nature communications, : , . (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation umap um ap label= label= label= label= label= label= label= label= label= label= figure a. . the embedding representation in the last pooling layer of the convolutional neural network exhibits well-separated cluster patterns among digits on the mnist dataset. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . adversarial clustering explanation − . − . . . − − − umap u m a p − − − umap u m a p group (a) (b) figure a. . the embedding representation learned by saucie exhibits well-separated cluster patterns on both (a) clean and (b) complex simulated scrna-seq datasets. figure a. . the embedding representation learned by saucie exhibits similar cluster patterns by using either (a) the louvain algorithm or (b) k-means clustering on the pbmc dataset. (which was not certified by peer review) is the author/funder. all rights reserved. no reuse allowed without permission. the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . patient-specific cell communication networks associate with disease progression in cancer david l gibbs , boris aguilar , vésteinn thorsson , alexander v ratushny , ilya shmulevich institute for systems biology, terry avenue north, seattle, wa , usa; bristol- myers squibb, dexter avenue north, suite , seattle, wa , usa correspondence: david l gibbs david.gibbs@isbscience.org abstract the maintenance and function of tissues in health and disease depends on cell-cell communication. this work shows how high-level features, representing cell-cell communication, can be defined and used to associate certain signaling 'axes' with clinical outcomes. using cell-sorted gene expression data, we generated a scaffold of cell-cell interactions and define a probabilistic method for creating per-patient weighted graphs based on gene expression and cell deconvolution results. with this method, we generated over , graphs for tcga patient samples, each representing likely channels of intercellular communication in the tumor microenvironment. it was shown that particular edges were strongly associated with disease severity and progression, in terms of survival time and tumor stage. within individual tumor types, there are predominant cell types and the collection of associated edges were found to be predictive of clinical phenotypes. additionally, genes associated with differentially weighted edges were enriched in gene ontology terms associated with tissue structure and immune response. code, data, and notebooks are provided to enable the application of this method to any expression dataset (https://github.com/ilyalab/pan-cancer-cell-cell-comm-net). keywords networks, cell communication, immuno-oncology, computational oncology, bioinformatics, systems biology introduction the maintenance and function of tissues depends on cell-cell communication (wilson et al., ; haass and herlyn, ). while cell communication can take place through physically binding cell membrane surface proteins, cells also release ligand molecules that diffuse and bind to receptors on other cells .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/qcquj https://paperpile.com/c/ tes g/qcquj https://paperpile.com/c/ tes g/p kbb https://paperpile.com/c/ tes g/p kbb https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / (paracrine or endocrine), or even the same cell (autocrine), triggering a signaling cascade that can potentially activate a gene regulatory program (cameron and kelvin, ; heldin et al., ; cohen and nelson, ). more generally, a message is sent and received, transferring some information as part of a large network (frankenstein et al., ). cells communicate in order to coordinate activity, such as, to correctly (and jointly) respond to environmental changes (song et al., ). altered cellular communication can cause disease, and conversely diseases can alter communication (wei et al., ). cancer, once thought of as purely a disease of genetics, is now recognized as being enmeshed in complex cellular interactions within the tumor microenvironment (tme) (trosko and ruch, ). the cell-cell interactions are important for cell differentiation, tumor growth (west and newton, ), and response to therapeutics (kumar et al., ). between cells, information transfer is directional in nature, where cells produce molecules that are received by the properly paired, and expressed, receptor. there is often a sender and receiver, which makes the cell-cell networks directionally linked by molecules. the dynamics of the signal is greatly important (fridman et al., , behar et al., ), but unfortunately is difficult to detect in bulk sequencing experiments. one approach to studying cell interactions is through the use of graphical models of communication networks (morel et al., ). by incorporating experimental data, the graphical models can become quantitative, providing predictions that can be tested and used in discovering novel drug targets and developing optimal intervention strategies. in recent work (thorsson et al., ), we developed a method used to identify cellular communication networks at work in the tumor microenvironment. given a set of samples with a similar tumor microenvironment, the method identified ligands, receptors and cells meeting certain criteria of abundance and concordance within that set of samples. the method was applied to identify networks playing a role within specific tumor types and molecular subtypes and is available as a workflow and interactive module on the iatlas portal for immuno-oncology (eddy et al., ). in this work, we have combined multiple sources of data with a new probabilistic method for constructing patient-specific cell-cell communication networks (figure ). in total, we built networks for , samples in the cancer genome atlas (tcga), starting from a network of cell types and , ligand-receptor pairs. this is a rich feature set from which to investigate biological alterations in cell communication within the tumor microenvironment. we identified informative network features that are associated with disease progression. the method can be applied to any cancer type, but in this manuscript we focus on a selection of cancer types with very high mortality rates, including pancreatic adenocarcinoma (paad), melanoma (skcm), lung (lusc), and cancers of the gastrointestinal tract (esca, stad, coad, read) (cancer genome atlas network, ). this represents a new method that provides information on possible modes of intercellular signaling in the tme, something that is currently lacking. while there are many methods on gene set scoring, cellular abundance estimation, differential expression, there are still few ways to investigate cell- cell communication diversity in the tme with respect to patient outcomes. fortunately, new databases of receptor-ligand pairs are becoming available, making work in this area possible (efremova et al., ; jin et al., ; nath and leier, ; shao et al., ). the methods, code, data, and complete results are available and open to all researchers (https://github.com/ilyalab/pan-cancer-cell-cell-comm-net). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/ f wh+otjgl+kvv o https://paperpile.com/c/ tes g/ f wh+otjgl+kvv o https://paperpile.com/c/ tes g/llbfv https://paperpile.com/c/ tes g/h jgr https://paperpile.com/c/ tes g/vadrs https://paperpile.com/c/ tes g/rmqgg https://paperpile.com/c/ tes g/qtdx https://paperpile.com/c/ tes g/ndijk https://paperpile.com/c/ tes g/kwodd https://paperpile.com/c/ tes g/ wed https://paperpile.com/c/ tes g/ wed https://paperpile.com/c/ tes g/ctvjs https://paperpile.com/c/ tes g/efyse https://paperpile.com/c/ tes g/cxcxs https://paperpile.com/c/ tes g/so k https://paperpile.com/c/ tes g/is ix+zgbvl+uybhs+ffndp https://paperpile.com/c/ tes g/is ix+zgbvl+uybhs+ffndp https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / methods data aggregation and integration data sources including tcga and cell-sorted gene expression, bulk tumor expression, cell type scores, cell-ligand and cell-receptor presence estimations were used for network construction and probabilistic weighting on a per-sample basis. each tumor sample is composed of a mixture of cell types including tumor, immune, and stromal cells. recently, methods have been developed to 'deconvolve' mixed samples into estimated fractions of cell type quantities. for example, xcell, which resembles gene set enrichment, has performed this estimation for cell types across most tcga samples (aran et al., ). we use these xcell estimates of cellular fractions in this work. ramilowski et al. performed a comprehensive survey of cellular communication, generating a compendium that includes , ligand-receptor pairs, and a mapping between cell types and expression of ligand or receptor molecules (ramilowski et al., ) the compendium was shared via th edition of the fantom project, fantom . these ligand-receptor pairs were adopted for this study. unfortunately, the fantom collection of cell types does not overlap well with cell types in xcell. in order to integrate the xcell and fantom data resources, it was necessary to determine the expressed ligands and receptors for each of the cell types in xcell, using the source gene expression data. the xcell project used six public cell sorted bulk gene expression data sets in order to generate gene signatures and score each tcga sample. across the data sets, there is some discrepancy in cell type nomenclature, making it necessary to manually curate cell type names to improve alignment across experiments (supplementary table ). typically, for a given cell type, there are several replicate expression profiles, often across the data sets. building the cell-cell communication network scaffold in the fantom 'draft of cellular communication', an expression threshold of tpms was used to link a cell type to a ligand or receptor. when considering the distribution of expression in the fantom project, tpms is close to the median. to construct our scaffold, we used a majority voting scheme based on comparing expression levels to median levels. for each cell type, paired with ligands and receptors, if the expression level was greater than the median, it was counted as a vote (i.e., ligand expressed in this cell type). if a ligand or receptor recieved a majority vote across all available data sources, it was accepted, and entered into the cell-cell scaffold. with this procedure, a network scaffold is induced, where cells produce ligands that bind to receptors on receiving cells. one edge in the network is composed of components cell - ligand - receptor - cell. this produced a cell-cell communication network with over m edges. each edge represents a possible .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/yaa p https://paperpile.com/c/ tes g/p pxd https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / interaction in the tumor microenvironment. we subsequently determine the probability that an edge is active in a particular patient sample using a probabilistic method described below. patient level cell-cell communication network weights with a cell-cell scaffold, expression values and cell type estimations per sample, we can produce a per- sample weighted cell-cell communication network (figure ). this is done probabilistically, using the following definition: 𝑃(𝑒𝑖 ) = 𝑃(𝑙𝑎 , 𝑐𝑙 ) · 𝑃(𝑟𝑏 , 𝑐𝑟 ), (eq. ) where 𝑒𝑖 is edge i, 𝑙𝑎 is ligand a, 𝑟𝑏 is receptor b, and 𝑐𝑙 and 𝑐𝑟 are cells that can produce ligand a and receptor b respectively. 𝑃(𝑒𝑖 ) represents a probability that edge i is active and is based on the premise that the physical and biochemical link and activation is possible only if all the components are present, and that activity becomes increasingly possible with greater availability of those components. the joint probabilities can be decomposed to: 𝑃(𝑙𝑎 , 𝑐𝑙 ) = 𝑃(𝑙𝑎 | 𝑐𝑙 ) 𝑃(𝑐𝑙 ) and 𝑃(𝑟𝑏 , 𝑐𝑟 ) = 𝑃(𝑟𝑏 | 𝑐𝑟 ) 𝑃(𝑐𝑟 ). (eq. ) the 𝑃(𝑐𝑙𝑘 ) is short for cdf 𝑃(𝐶𝑙 < 𝑐𝑙𝑘 ) which indicates the probability that a randomly sampled value from the empirical 𝐶𝑙 distribution (over all k tcga samples) would be less than the cell estimate for cell type l, in sample k. to do this, for a given cell type, using all samples available, an empirical distribution 𝑃(𝐶𝑙 ) is computed, and for any query, essentially using a value 𝑐𝑙𝑘 , the probability can be found by integrating from to 𝑐𝑙𝑘. to compute 𝑃(𝑙𝑎 | 𝑐𝑙 ), each 𝐶𝑙 distribution was divided into quartiles, and then (again using the k samples) empirical gene expression distributions within each quartile were fit. this expresses the probability that with an observed cell quantity (thus within a quartile), the probability that a randomly selected gene expression value (for gene 𝑙𝑎) would be lower than what is observed in sample k. we refer to "edge weights" to be the probability as shown in eq. ( ). to compute edge weights, each tcga sample was represented as a column vector of gene expression and a column vector of cell quantities (or enrichments). for each edge in the scaffold (cell-ligand-receptor-cell), data was used to look up probabilities using the defined empirical distributions and taking products for the resulting edge weight probability. this leads to over k tumor-specific weighted networks, one for each tcga participant. probability distributions were precomputed using the r language empirical cumulative distribution function (ecdf). for example, fitting p(cd t cells) is done by taking all available estimates across the pan-cancer samples and computing the ecdf. then, for a sample k, we find 𝑃(𝐶𝑙 < 𝑐𝑙𝑘 ) using the ecdf. the same technique is used to find the conditional probability functions, where for each gene, the expression values are selected after binning samples using the r function 'quantile', and then used to compute the ecdf. with all distributions precomputed, . billion joint probability functions were .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / computed using an hpc environment, then transferred to a google bigquery table where analysis proceeded. this table of network weights was structured so that each row contained one weight from one edge and one tumor sample. although being a large table of . billion rows, taking nearly gb, bigquery allows for fast analytical queries that can produce statistics using a selection of standard mathematical functions. association of network features and survival-based phenotypes the s statistic is a robust measure based on the difference of medians (yahaya et al., ; ahad et al., ; babu et al., ; hubert et al. ), in this case the median of edge weights for a defined phenotypic group. s statistics were computed using the nci cancer research data commons cloud resource, the isb-cgc, per tissue type. 𝑆 = 𝑚𝑒𝑑𝑖𝑎𝑛(𝑋) − 𝑚𝑒𝑑𝑖𝑎𝑛(𝑌) √ . 𝑀𝐴𝐷(𝑋) + . 𝑀𝐴𝐷(𝑌) this statistic allowed for cell-cell interactions to be ranked within a defined context. the results were again saved to bigquery tables to allow for further cloud-based analysis and integration with underlying data. to judge the magnitude of the statistic with respect to a random context (figure ), an ensemble of three edge-weight sample-pools were generated, each with k weights. then, for each member of the ensemble, million s statistics were generated using sample sizes that match the analyzed data. these random s statistic distributions were used to compare to the observed results (i.e., a resampling procedure). as an initial examination of the interplay of cell communication and disease, two proxies of disease severity were investigated: progression-free interval (pfi) and tumor stage (liu et al., ). the staging variable used the ajcc pathologic tumor stage. the pfi feature was computed using days until a progression event. the staging variable was binarized by binning stages i-ii together (“early stage”), and iii-iv together (“late stage”). a binary pfi variable was created by computing the median pfi on non- censored samples and then applying the split to all samples. both clinical features were computed by tissue type (tcga study). as liu et al writes, "the event time is the shortest period from the date of initial diagnosis to the date of an event. the censored time is from the date of initial diagnosis to the date of last contact or the date of death without disease." for example, in lusc, the median time to pfi event was days ( . months) and in the censored group, days ( . months). after splitting samples at days ( months), the short pfi group was composed of uncensored samples and censored samples. the long pfi group was composed of uncensored samples and censored samples. null distributions, using these same sample sizes (e.g., one group of and another group of ), were generated by repeatedly drawing from the previously described ensemble of three sample-pools. the distributions, while heavy tailed, were close to normal (supplemental figure ). the s statistics scale .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/ap gd https://paperpile.com/c/ tes g/ap gd https://paperpile.com/c/ tes g/qrfns https://paperpile.com/c/ tes g/qrfns https://paperpile.com/c/ tes g/jngul https://paperpile.com/c/ tes g/ sc c https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / with the difference in median values (supplemental figure ). after combining resampled statistics across the ensemble, an edge was selected as a high edge weight if it were in the top millionth percentile when compared to the null. each tissue and contrast generates a weighted subgraph of the starting scaffold, which is retained for further analysis (e.g., a lusc-pfi network). to identify informative cell-cell edges that relate to disease progression, machine learning models were trained on binarized clinical data as described. with clinical features such as progression free interval (pfi) and tumor stage for each sample, a matrix of patient-specific edge weights was constructed representing each tissue and contrast. classification of samples was performed with xgboost classifiers (chen and guestrin, ) , which are composed of an ensemble of tree classifiers. to avoid overfitting the models, the tree depth was set at maximum of and the early-stopping parameter was set at rounds (training was stopped after the classification error did not improve on a test set for two rounds). xgboost provides methods for determining the information gain of each feature in the model and was used to rank edges that are most informative for classification. gene ontology (go) term enrichment was performed using the gonet tool (pomaznoy et al., ). the set of , genes in the cell-cell scaffold was used as the enrichment background. gonet builds on the "goenrich" software package, which maps genes onto terms and propagates them up the go graph, performs fisher's exact tests, and moderates results with fdr. to compare the results, random collections of genes were generated from the cell-cell scaffold and produced no significant results. results the scaffold network graph is heterogeneous, containing nodes representing cells, ligands (e.g. cytokines), and receptors. edges are directed, following communication routes from cell to cell. but, to simplify the graph, a cell produces a ligand that binds a receptor found on another cell type, which could make a single edge "lcell-ligand-receptor-rcell". in total, there were , , cell-cell edges in the network. the number of edges for ligand-producing cells varies from , for osteoblasts to , for multi- potent progenitor (mpps). for receptor-producing cells, the range spans from , for platelets to , edges for mpps. applying the proposed probabilistic framework allowed for the creation of , weighted networks. the edge weight distributions generally follow approximately exponentially decreasing function (supplemental figure ). there are few edges with strong weights and many with low (near zero) weights. we first sought to find communication edges that were most characteristic of an individual tumor type. the s statistics comparing one tumor type to all other tissue types was computed, with a high score indicating a substantial difference in edge weights between the two groups. edges were found that clearly delineated tissues (figure ). for example, in skcm (skin cutaneous melanoma), the top scoring edge is between melanocytes the most cell of origin for cutaneous melanoma (melanocytes-mia-cdh - melanocytes, s score . , median edge weight . higher than in other tumor types). normal tissue .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/yrf j https://paperpile.com/c/ tes g/x omi https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / differences can contribute to differences in edge weights, though in this case the central role of melanocytes in melanomas implies that the high scores are likely due to cancerous cell activity. the study with the most similar edge weights is uveal melanoma (uva), which arises from melanocytes resident in the uveal tract (robertson et al., ) (fig. a). additionally, we observed that when a cell type is highly prevalent in a particular tissue, and the scaffold has an autocrine loop, interactions between that type of cell tend to have high weights. if we exclude cell types communicating with self-types, then for skcm, osteoblasts, natural killer t cells, and mesenchymal stem cells (mscs) interact with melanocytes in the top scoring edges, consistent with the emerging role of these cell types in melanoma. an important role for osteoblasts is now coming to light for melanoma (ferguson et al., ). natural killer t cells are being investigated for their applicability in immunotherapy of cancers such as melanoma (wolf et al., ). mscs appear to interact with melanoma cells, as work by zhang et al. (zhang et al., ) showed the proliferation of a cells (a melanoma cell line) was inhibited and the cell cycle of a was arrested by mscs, and cell-cell signaling related to nf-κb was down-regulated. overall, the number of high weight edges in each tumor type did not associate with the number of samples, as might be expected (supplemental figure ). to identify which elements of cellular communication networks might be associated with clinical progression of particular tumor types, we identified edges associated with disease. disease progression and severity were examined using dichotomous values of tumor stage and progression free interval (pfi) as described in the methods. statistical scores were calculated comparing edge weight distributions between the two clinical groups using s . results were carried forward if larger than the threshold set by the millionth percentage of resampled statistics (supplemental figure - ), yielding differentially weighted edges (dwes). most tumor types showed dwes for pfi, and fewer for the early to late tumor stage comparison (supplemental figure ). for example, stad (gastric adenocarcinoma) had several hundred edges in for both comparisons, while paad (pancreatic adenocarcinoma) showed fewer dwes, and only for pfi. figure shows median edge weights between the two groups for the selected studies. some tumor types, like skcm, show much stronger deviations between the medians, compared to the other studies like stad, esca, and lusc, which may be an indication of a stronger immune response. according to cri-iatlas (eddy et al., ), among our example studies, skcm has the highest estimated level of cd t cells and generally has a robust immune response. tumor stage comparison showed dwes in of studies and ranged widely from edges for meso (mesothelioma) to over k edges for blca (urothelial bladder cancer adenocarcinoma). the pfi comparison showed results in / studies and ranged from edges in read to over k in lihc. see table for edge counts from selected studies. the studies with larger numbers of samples had permuted s distributions that were narrow compared to studies with few samples (supp. fig. ), but there was not a strong association between dwe counts and sample sizes. the variation thus more likely has to do with clinical factors. within a tumor type and clinical response variable, the set of high scoring edges were usually dominated by a small number of cell-types, ligands, or receptors (figure , supplemental figure a, b). for skcm, in the tumor stage contrast, a majority of ligand-producing cells include gmp cells, osteoblasts, msc cells, and melanocytes, in order of prevalence. the number of edges starting with these four cells .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/iigur https://paperpile.com/c/ tes g/egnuo https://paperpile.com/c/ tes g/ inhk https://paperpile.com/c/ tes g/td vb https://paperpile.com/c/ tes g/td vb https://paperpile.com/c/ tes g/cxcxs https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / accounts for % of dwes. certainly, melanocytes are well known in melanoma, and mesenchymal stem cells are drawn to inflammation, but the role of osteoblasts is less well documented, but still have been associated with melanoma progression (ferguson et al., ). in the pfi contrast with gastroesophageal cancers, megakaryocytes are the most common cell type in stad dwes ( edges out of ), and the second most common in esca ( edges out of , following cd + tcm interactions). the megakaryocyte dwes include ligands and receptors that represent both interleukins and ecm-associated molecules such as integrins and collagen, but also notch and pf (platelet factor ). for stad, most edge weights are lower with longer pfi. put another way, the shorter pfi intervals (adverse outcome) were associated with increased megakaryocyte- involved edge weights (supplemental figure ). however, the opposite is observed in esca, where higher edge weights were generally associated with longer pfi (negative s score). in esca, edges that show high weights for short pfi include neutrophils-hmgb -sdc -sebocytes ( . ). although esca has a much lower xcell mean megakaryocyte score than stad ( % lower), the cell score trends from xcell follow opposite trends with stad decreasing with longer pfi and esca increasing with pfi. stad is among the tissues with highest megakaryocyte scores ( , th rank out of for pfi , resp.), esca is at a respectable rank of and out of for short-pfi, respectively. in coad (colorectal adenocarcinoma), for ligand-producing cells, the dwes were dominated by astrocytes, mscs, megakaryocytes, and sebocytes, while receptor-producing cells included astrocytes, chondrocytes, and mscs in order of counts of dwes. by summarizing dwes we can possibly categorize cancer types based on which cells are taking part in potentially active interactions. the above-described edge dominance is related to cells (graph nodes) with high degree. in the language of graphs, the degree is the count of edges connected to a given node or vertex. in stad the cell types with highest degree are megakaryocytes (degree ), followed by neutrophils ( ), clp cells ( ), and erythrocytes ( )(supplemental figure a,b). however, if we look at the directionality for the directed graph, we see that while megakaryocytes are split nearly evenly in and out, cells like the th have edges in, and only a single edge out, whereas b cells have zero edges in and edges out. the network directionality should be considered in activities such as the modeling of dynamical systems. within the tumor microenvironment, communication between the multitude of cells happens simultaneously through many ligand-receptor axes. by considering a set of differentially weighted edges within a tissue type, we can construct connected networks that potentially represent dynamic communication. dwes derived by comparing edge weights between clinical groups may indicate which parts of the cell-cell communication network shift together with disease severity. we sought to identify which aspects of intercellular communication could relate to tumor staging or disease severity. the edges making up the differential networks were used to model clinical states for individual tumors. xgboost models (chen and guestrin, ) were fit on each clinical feature, using edge weights as predictive variables, to infer which edges carried the most information in classification (figure ). the purpose of the modeling was within-data inference rather than classification outside of the tcga pan-cancer data set. after fitting, it is possible to examine what model features (edges) are most useful in classification. the xgboost classifiers are regularized models, not all features will be used and often only a small subset of features are retained in the final model. we assess the relative usefulness of a feature by comparing the feature gain -- the improvement in accuracy when a feature is added to a tree. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/egnuo https://paperpile.com/c/ tes g/yrf j https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / all classification models had an accuracy between % (skcm, pfi) and % (coad, stage). as mentioned above, there can be a high degree of correlation between edge values in a data set. while features are selected first based on improving prediction, the machine learning model accounts for correlated features by selecting the one that has best predictive power, leaving out other correlated features. that said, the number of features selected by the model is then related to the correlation structure. in a set of uncorrelated features where all features add to the predictive power, all features will be selected, whereas for correlated features, only a small number will be selected. this is seen in results here in terms of differences in the numbers of features compared with the starting network. in the coad-pfi case, the number of features was reduced by approximately %, keeping edges in the model. the stad-pfi features were reduced by approximately %. other examples are are lusc-pfi at % reduction, esca-pfi at %, and skcm-pfi at % ( edges selected) indicating a high degree of internal feature correlation. a similar pattern was observed in the tumor stage contrasts, where skcm-stage had a % reduction in features, stad-stage %, read-stage %. for coad-stage, feature reduction was % reduction, but attributable to the large number of starting edges ( ) compared to the edges selected. a collection of the most predictive edges is given in table . the collection of genes from each differential network was used for gene ontology (go) term enrichment using the gonet tool (pomaznoy et al., ). all tissue-contrast combinations with differentially weighted edges produced enriched go terms (fdr < . , within tissue contrasts) except the skcm- stage group, which although contained genes in the differential network, produced no enriched terms. common themes included structural go terms such as "extracellular structure organization" (for skcm), cell-substrate adhesion (esca, lusc), cell-cell adhesion (stad), ecm / extracellular matrix organization (lusc, coad, read, stad). cell migration was also a common theme with "cell migration" (stad), "epithelial cell migration" (skcm), and "regulation of cell migration" (lusc, coad/read). among immune related themes, go terms included "ifng signaling" and "antigen processing and presentation" (skcm), "regulation of immune processes" and "il " (stad), and "viral host response" (coad / read). see table for a summary and supplemental table for complete results. discussion patient outcome or response to therapy is not necessarily well predicted by tumor stage alone (kirilovsky et al., ). as fridman et al. wrote, "different types of infiltrating immune cells have different effects on tumour progression, which can vary according to cancer type" (fridman et al., ). this idea has been developed further with the creation of the 'immunoscore', a prognostic based on the presence and density of particular immune cells in the tme context, expanded to include the peripheral margin as well as tumor core. for example, the immunoscore in colorectal cancer depends on the density of both cd + lymphocytes (any t cell) and specifically, cd + cytotoxic t cells in the tumor core and invasive margin (pagès et al., ). the differences in factors that relate to stage and survival is reflected in the current work in the identification of different cell-cell interactions of importance for each. previous studies have shown that cellular interactions within the tumor microenvironment have an impact on patient survival, drug response, and tumor growth. x. zhao et al. (zhou et al., ) described .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/x omi https://paperpile.com/c/ tes g/ovhrz https://paperpile.com/c/ tes g/ovhrz https://paperpile.com/c/ tes g/kwodd https://paperpile.com/c/ tes g/istvg https://paperpile.com/c/ tes g/ xjiz https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / alterations in ligand-receptor pair associations in cancer compared to normal tissue, the cell-cell communication structures thereby becoming a generalized phenotype for malignancy. using the same foundational database of possible interactions as this work, ligand-receptor pair expression correlation was compared between tumor and normal tissue. their "aggregate analysis revealed that … tumors of most cancer types generally had reduced (ligand-receptor) correlation compared with the normal tissues." the ligand-receptor pairs that commonly showed such differences across the ten tissue types studied included plau-itga , liph-lpar , sem g-plxnb , semabd-tyrobp, ccl -ccr , ccl - ccr , and cgn-tyrobp. like the zhao et al. work, we found the collection of associated edges enriched for related biological processes, especially to ecm organization and cell adhesion -- possibly related to the progression towards dysplasia. for example, in zhao et al., the ligand-receptor pairs col a -itga , col a -itga , mdk-gpc and mmp -itga were found to be positively correlated in cancer but not in normal tissue. in the current work, integrins and laminins generally have elevated edge weights in late tumor stage. in the pfi contrasts, except for esca, such edges have higher weights in shorter pfis, corresponding to more severe progression. regarding sema a, found in the pfi stad results as a predictive feature, previous findings report the collagen gene col a has been associated with metastasis, and sema a is known to play an important role in integrin-mediated signaling and functions both in regulating cell migration and immune responses. cancers such as esophageal, gastric, and colorectal all show transitions to metaplasia and dysplasia, a process that breaks down the structural order of a tissue, replacing it with disorder and cell transdifferentiation. in our model, a host response is reflected in a change in s score, negative if the edge weight is higher with longer pfi times. in the pfi results, th cells appeared in high scoring edges in skcm, all with negative s values. also, for skcm and coad, ligand producing (pro-inflammatory) m macrophage edges are present but show both positive and negative s scores. inflammation cytokines il b and il are both present in the results of esca and stad (figure ). in the tumor stage contrasts, we see th and nk cells with inflammation cytokines il a, il b, il , tnf in stad and coad. so, while certain inflammatory signatures are observed, the absence of well-known canonical edges such as th -il - il rb -m macrophages, may be due to essentially no difference, or undetectable differences in the quantity of th cells or il a expression between pfi groups ( . vs . tcga pan-cancer rsem for short vs long pfi). these observations point to possible mechanisms of action for immune cells known to be important for cancer immune response, the cd + t helper cells and m macrophages, in relation to tumor progression in tissues susceptible to dysplasia, such as the tissues explored here, unexpected cell types may be detected. for example, the '...disruption of tissue organization appears to trigger a profound change in cellular commitment, which leads to hepatocyte differentiation in the “oval cells” in … the epithelial cells lining the small pancreatic ductules' (reddy et al., ). as another example, pancreatic cancer is known to have desmoplastic stroma, the source of which may include mscs which are defined by their ability to differentiate into osteoblasts, chondrocytes, and adipocytes (mathew et al., ). in line with that finding, it's been observed that "...stromal cells isolated from the neoplastic pancreas can differentiate into osteoblasts, chondrocytes, and adipocytes" (mathew et al., ). .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/trkjf https://paperpile.com/c/ tes g/ gkvg https://paperpile.com/c/ tes g/ gkvg https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / it has been reported that (yáñez et al., ), "granulocyte-monocyte progenitors (gmps) and monocyte- dendritic cell progenitors (mdps) produce monocytes during homeostasis and in response to increased demand during infection." or as in (weston et al., ), "granulocyte-monocyte progenitor (gmp) cells play a vital role in the immune system by maturing into a variety of white blood cells, including neutrophils and macrophages, depending on exposure to cytokines such as various types of colony stimulating factors (csf)." in our results for skcm and coad, gmps had negative s statistics, meaning the late-stage cases had edges with higher weights. the gmp cells most often interacted with (as receptor bearing cells) msc, melanocytes, both m and m macrophages, and cd + tem (t effector-memory cells). the presence of gmp related edges may be indicative of the commonly observed 'myeloid dysfunction', which "can promote tumor progression through immune suppression, tissue remodeling, angiogenesis or combinations of these mechanisms."(messmer et al., ) also, "tumors secrete a variety of factors such as g-csf that act in a systemic way to reduce irf- within progenitor cells, releasing myelopoiesis from irf- control such that the granulocytic lineage (blue cell) undergoes hyperplasia, leading to increased immature suppressive cells to promote tumor growth." this is in line with our observations. megakaryocytes, a multipotent stem cell, are cells that typically reside in the bone marrow and produce platelets. megakaryocytes are also produced in the liver, kidney, and spleen. additionally, megakaryocytes have been observed in the lung and circulating blood where they were useful as a biomarker in prostate cancer. case reports exist showing megakaryocytes in the metaplasia of gastric cancer patients (chatelain et al., ). megakaryocytes respond to a variety of cytokines such as il- , il- , il- , cxcl , cxcl , and ccl . a majority of interacting cells are leukocytes. in both esophageal and gastric cancers “...thrombocytosis has been reported in general to be associated with adverse clinical outcomes. (voutsadakis, )" additionally, there are reports of 'tumor educated platelets' that can be useful as part of a liquid biopsy (best et al., ) (haemmerle et al., ). among the rich literature regarding oncological cytokine networks, there is a strong emphasis on the cancer cell as a central actor. many of the review articles and research focuses on the cancer cell interactions in the tme. for example, cancer cells producing an overabundance of il or il that has been associated with poor prognosis (burkholder et al., ; fisher et al., ; lippitz and harris, ). however, in this work, the focus has been put on the environment and less about the cancer cell itself. this is largely because in performing cell deconvolution on gene expression data to determine the presence and quantity of different cell types in the mixed sample, reliable signatures for cancer cells are not readily available. because in carcinomas, a cancer cell derives from the epithelium, and in many ways remains similar to epithelial cells. even in single cell rna-seq studies, it is often difficult to determine what cells are cancerous and picking this signature out of a mixed expression dataset is difficult and remains an open question. this work is based upon gene expression, rather than protein expression, cell-surface expression or secretion measurements. also, the base expression data is taken from sorted cells, rather than cells in tissue with an assumption that we cannot get “new/non-scaffold edges” in a tissue/cancer context. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/x njd https://paperpile.com/c/ tes g/dtjl https://paperpile.com/c/ tes g/xbz p https://paperpile.com/c/ tes g/ksjuz https://paperpile.com/c/ tes g/ afsz https://paperpile.com/c/ tes g/oxebr https://paperpile.com/c/ tes g/ nr r https://paperpile.com/c/ tes g/qaj q+ew ix+h tce https://paperpile.com/c/ tes g/qaj q+ew ix+h tce https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / however new data types and methods including scrna-seq and pic-seq will provide ways of determining new cell-cell interactions that are context specific (giladi et al., ). importantly, the physical and biochemical process of secretion, binding and activation cannot be identified with the current data and method. by identifying the propensity of edge constituents in particular tumor microenvironments in comparison with others, it becomes more likely that communication with activation can take place, as the presence of those constituents is a prerequisite. with the data and results publicly available in a google bigquery table (supplemental figure ), this resource is open to researchers to explore and ask questions. it is a low-cost way (with free options) to achieve compute cluster performance for quickly answering such questions. the table is easily joined to clinical and molecular annotations and can be worked with from r and python notebooks. with the addition of resources like gtex, it should begin to be possible to tease aberrant, cancer specific interactions apart. in terms of future work, it could be important to examine communication networks given the immune subtypes of (thorsson et al., ) and communication differences between tcga tumor molecular subtypes. new data types can be applied to enhance the scaffold with knowledge gained from (for example) single-cell rna-seq. in this work, we have introduced a method and identified lines of communication between cells that may play a role in disease. these lines include both established/recognized cells in the context of cancer, as well as others that should be explored further, with targeted methods. acknowledgments the authors would like to thank samuel danziger, david reiss, mark mcconnell, andrew dervan, matthew trotter, douglas bassett, robert hershberg, the shmulevich lab and the institute for systems biology for engaging and informative discussions. this study was supported by celgene, a wholly owned subsidiary of bristol-myers squibb, in part through a sponsored research award to d.l.g., b.a. and i.s, and by the cancer research institute (d.l.g, v.t, i.s). we thank the isb-cgc for their ongoing support. isb-cgc has been funded in whole or in part with federal funds from the national cancer institute, national institutes of health, task order no. x under contract no. n d . the content of this publication does not necessarily reflect the views or policies of the department of health and human services, nor does mention of trade names, commercial products, or organizations imply endorsement by the u.s. government. competing interests. d.l.g., b.a., v.t. and i.s. declare no competing interests. a.v.r.: bristol-myers squibb: employment, equity ownership. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://paperpile.com/c/ tes g/jzsbx https://paperpile.com/c/ tes g/efyse https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / author contributions d.l.g., b.a., v.t, a.r., i.s. conceived of the idea. d.l.g. developed the method, wrote the code, and performed the computations. d.l.g. wrote the manuscript with contributions from b.a., v.t., a.r., i.s. and a.r. supervised the project. all authors provided critical feedback and helped shape the research, analysis and manuscript. tables table . counts of differentially weighted edges compared to the number of samples in each study. study n samples pfi short/long pfi dwes selected feat. model accuracy go results? esca / . y stad / . y paad / - - y coad / . y read / - - y skcm / . y lusc / . y study n samples stage early/late stage dwes selected feat. model accuracy go results? esca / - - - stad / . y paad / - - - coad / . y read / . y skcm / n lusc / - - - study: tissue type, n samples: number of samples used, pfi short/long: number of samples in each group, pfi dwes: number of differentially weighted edges, model accuracy: accuracy of predicting group, go results?: if yes, significant go enrichments. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / table . top most predictive edges from xgboost models. contrast study edgeid lcell ligand receptor rcell s median diff information gain pfi coad megakaryocyt es bmp eng epithelial cells . . . pfi coad astrocytes tnc itga mv endothelial cells . . . pfi coad hepatocytes gdf eng epithelial cells . . . pfi coad astrocytes efnb ephb mesangial cells . . . pfi coad mep timp itgb mep . . . stage coad hepatocytes cgn tgfbr eosinophils - . - . . stage coad eosinophils lamb itgb eosinophils - . - . . stage coad memory b- cells bmp bmpr epithelial cells - . - . . stage coad nk cells tnfsf tnfrsf b cd + memory t- cells . . . stage coad mep b m kir dl idc . . . pfi esca keratinocytes gs adcy cd + tcm . . . pfi esca cd + tcm rbp notch pdc - . - . . pfi esca th cells calm gp naive b-cells . . . pfi esca mesangial cells spp cd tregs . . . pfi esca gmp hmgb thbd mep . . . pfi lusc plasma cells vegfa itgb gmp . . . pfi lusc idc vegfa itgb plasma cells . . . pfi lusc gmp adam itgb plasma cells . . . pfi lusc epithelial cells col a itgb cd + naive t-cells . . . pfi lusc keratinocytes thbs itga plasma cells . . . stage read mep tgfb tgfbr mep - . - . . stage read nkt gzmb pgrmc cd + memory t- cells . . . stage read hepatocytes cgn tgfbr cd + tem - . - . . stage read nkt gzmb igf r plasma cells . . . stage read nkt il il rg gmp . . . pfi skcm smooth muscle sema a plxnc pro b-cells . . . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / pfi skcm macrophages uba notch osteoblast - . - . . pfi skcm basophils vim cd nkt - . - . . pfi skcm smooth muscle psap sort preadipocytes . . . pfi skcm basophils calm ptpra th cells - . - . . stage skcm clp gi cxcr osteoblast . . . stage skcm gmp timp cd plasma cells - . - . . stage skcm clp gi f r mep . . . stage skcm cd + tcm gi tbxa r plasma cells - . - . . stage skcm gmp bst cav msc - . - . . pfi stad keratinocytes calm kcnq eosinophils - . - . . pfi stad mesangial cells tgfb acvr erythrocytes . . . pfi stad cd + t-cells il b il r megakaryocytes . . . pfi stad clp adam itga cd + t-cells . . . pfi stad epithelial cells vcan tlr clp . . . stage stad cd + tem calm kcnq macrophages . . . stage stad astrocytes fbn itgb epithelial cells - . - . . stage stad epithelial cells lamb itgav hepatocytes - . - . . stage stad hepatocytes shh ptch cd + t-cells - . - . . stage stad mesangial cells fgb itgav megakaryocytes - . - . . contrast: the groupwise test performed, study: tissue type, edge id: bigquery table lookup id, lcell: cell producing ligands, ligand: ligand gene symbol , receptor: receptor gene symbol, r cell: receptor producing cell, s : between group s statistic, median diff: difference in edge weights between groups, information gain: xgboost information gain after adding feature to model. table . enriched go terms. tissue contrast num gos ecm migration immune immune skcm pfi extracellular structure organization epithelium cell migration ifng signaling antigen processing and presentation esca pfi cell-substrate adhesion .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / stad pfi cell-cell adhesion mediated by integrin cell migration regulation of immune system process il lusc pfi extracellular matrix organization positive regulation of cell migration coad / read stage ecm regulation of epithelial cell migration viral host response stad stage ecm / adhesion cell migration tissue: tcga study, contrast: the groupwise test performed, num gos: number of gene ontology terms found significantly enriched, ecm: go categories involving ecm, migration: go terms involving cell migration, immune: go terms involving immune response, immune : additional go terms involving immune response. references ahad, n. a., yahaya, s. s. s., and yin, l. p. ( ). robustness of s statistic with hodges-lehmann for skewed distributions. aip conf. proc. , . aran, d., hu, z., and butte, a. j. ( ). xcell: digitally portraying the tissue cellular heterogeneity landscape. genome biol. , . babu, g. j., padmanabhan, a. r., and puri, m. l. ( ). robust one-way anova under possibly non- regular conditions. biometrical journal: journal of mathematical methods in biosciences , – . behar, m., barken, d., werner, s. l., and hoffmann, a. ( ). the dynamics of signaling as a pharmacological target. cell , – . best, m. g., wesseling, p., and wurdinger, t. ( ). tumor-educated platelets as a noninvasive biomarker source for cancer detection and progression monitoring. cancer res. , – . burkholder, b., huang, r.-y., burgess, r., luo, s., jones, v. s., zhang, w., et al. ( ). tumor-induced perturbations of cytokines and immune cell networks. biochim. biophys. acta , – . cameron, m. j., and kelvin, d. j. ( ). cytokines, chemokines and their receptors. landes bioscience. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/ tes g/qrfns http://paperpile.com/b/ tes g/qrfns http://paperpile.com/b/ tes g/qrfns http://paperpile.com/b/ tes g/qrfns http://paperpile.com/b/ tes g/yaa p http://paperpile.com/b/ tes g/yaa p http://paperpile.com/b/ tes g/yaa p http://paperpile.com/b/ tes g/yaa p http://paperpile.com/b/ tes g/jngul http://paperpile.com/b/ tes g/jngul http://paperpile.com/b/ tes g/jngul http://paperpile.com/b/ tes g/jngul http://paperpile.com/b/ tes g/jngul http://paperpile.com/b/ tes g/ wed http://paperpile.com/b/ tes g/ wed http://paperpile.com/b/ tes g/ wed http://paperpile.com/b/ tes g/ wed http://paperpile.com/b/ tes g/oxebr http://paperpile.com/b/ tes g/oxebr http://paperpile.com/b/ tes g/oxebr http://paperpile.com/b/ tes g/oxebr http://paperpile.com/b/ tes g/h tce http://paperpile.com/b/ tes g/h tce http://paperpile.com/b/ tes g/h tce http://paperpile.com/b/ tes g/h tce http://paperpile.com/b/ tes g/kvv o http://paperpile.com/b/ tes g/kvv o http://paperpile.com/b/ tes g/kvv o http://paperpile.com/b/ tes g/kvv o https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / cancer genome atlas network ( ). genomic classification of cutaneous melanoma. cell , – . chatelain, d., devendeville, a., rudelli, a., bruniau, a., geslin, g., and sevestre, h. ( ). gastric myeloid metaplasia: a case report and review of the literature. arch. pathol. lab. med. , – . chen, t., and guestrin, c. ( ). xgboost: a scalable tree boosting system. in proceedings of the nd acm sigkdd international conference on knowledge discovery and data mining kdd ’ . (new york, ny, usa: acm), – . cohen, d. j., and nelson, w. j. ( ). secret handshakes: cell-cell interactions and cellular mimics. curr. opin. cell biol. , – . eddy, j. a., thorsson, v., lamb, a. e., gibbs, d. l., heimann, c., yu, j. x., et al. ( ). cri iatlas: an interactive portal for immuno-oncology research. f res. , . efremova, m., vento-tormo, m., teichmann, s. a., and vento-tormo, r. ( ). cellphonedb v . : inferring cell-cell communication from combined expression of multi-subunit receptor-ligand complexes. biorxiv. doi: . / . ferguson, j., wilcock, d. j., mcentegart, s., badrock, a. p., levesque, m., dummer, r., et al. ( ). osteoblasts contribute to a protective niche that supports melanoma cell proliferation and survival. pigment cell melanoma res. , – . fisher, d. t., appenheimer, m. m., and evans, s. s. ( ). the two faces of il- in the tumor microenvironment. semin. immunol. , – . frankenstein, z., alon, u., and cohen, i. r. ( ). the immune-body cytokine network defines a social architecture of cell interactions. biol. direct , . fridman, w. h., pagès, f., sautès-fridman, c., and galon, j. ( ). the immune contexture in human tumours: impact on clinical outcome. nat. rev. cancer , – . giladi, a., cohen, m., medaglia, c., baran, y., li, b., zada, m., et al. ( ). dissecting cellular crosstalk by sequencing physically interacting cells. nat. biotechnol. , – . haass, n. k., and herlyn, m. ( ). normal human melanocyte homeostasis as a paradigm for understanding melanoma. j. investig. dermatol. symp. proc. , – . haemmerle, m., stone, r. l., menter, d. g., afshar-kharghan, v., and sood, a. k. ( ). the platelet lifeline to cancer: challenges and opportunities. cancer cell , – . heldin, c.-h., lu, b., evans, r., and gutkind, j. s. ( ). signals and receptors. cold spring harb. perspect. biol. , a . hubert m., pison g., struyf a., van aelst s., editors. theory and applications of recent robust methods. birkhäuser; dec . jin, s., guerrero-juarez, c. f., zhang, l., chang, i., myung, p., plikus, m. v., et al. ( ). inference and analysis of cell-cell communication using cellchat. cold spring harbor laboratory, .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/ tes g/so k http://paperpile.com/b/ tes g/so k http://paperpile.com/b/ tes g/so k http://paperpile.com/b/ tes g/so k http://paperpile.com/b/ tes g/ksjuz http://paperpile.com/b/ tes g/ksjuz http://paperpile.com/b/ tes g/ksjuz http://paperpile.com/b/ tes g/ksjuz http://paperpile.com/b/ tes g/ksjuz http://paperpile.com/b/ tes g/yrf j http://paperpile.com/b/ tes g/yrf j http://paperpile.com/b/ tes g/yrf j http://paperpile.com/b/ tes g/yrf j http://paperpile.com/b/ tes g/yrf j http://paperpile.com/b/ tes g/ f wh http://paperpile.com/b/ tes g/ f wh http://paperpile.com/b/ tes g/ f wh http://paperpile.com/b/ tes g/ f wh http://paperpile.com/b/ tes g/cxcxs http://paperpile.com/b/ tes g/cxcxs http://paperpile.com/b/ tes g/cxcxs http://paperpile.com/b/ tes g/cxcxs http://paperpile.com/b/ tes g/is ix http://paperpile.com/b/ tes g/is ix http://paperpile.com/b/ tes g/is ix http://paperpile.com/b/ tes g/is ix http://paperpile.com/b/ tes g/is ix http://dx.doi.org/ . / http://paperpile.com/b/ tes g/is ix http://paperpile.com/b/ tes g/egnuo http://paperpile.com/b/ tes g/egnuo http://paperpile.com/b/ tes g/egnuo http://paperpile.com/b/ tes g/egnuo http://paperpile.com/b/ tes g/qaj q http://paperpile.com/b/ tes g/qaj q http://paperpile.com/b/ tes g/qaj q http://paperpile.com/b/ tes g/qaj q http://paperpile.com/b/ tes g/llbfv http://paperpile.com/b/ tes g/llbfv http://paperpile.com/b/ tes g/llbfv http://paperpile.com/b/ tes g/llbfv http://paperpile.com/b/ tes g/kwodd http://paperpile.com/b/ tes g/kwodd http://paperpile.com/b/ tes g/kwodd http://paperpile.com/b/ tes g/kwodd http://paperpile.com/b/ tes g/jzsbx http://paperpile.com/b/ tes g/jzsbx http://paperpile.com/b/ tes g/jzsbx http://paperpile.com/b/ tes g/jzsbx http://paperpile.com/b/ tes g/p kbb http://paperpile.com/b/ tes g/p kbb http://paperpile.com/b/ tes g/p kbb http://paperpile.com/b/ tes g/p kbb http://paperpile.com/b/ tes g/ nr r http://paperpile.com/b/ tes g/ nr r http://paperpile.com/b/ tes g/ nr r http://paperpile.com/b/ tes g/ nr r http://paperpile.com/b/ tes g/otjgl http://paperpile.com/b/ tes g/otjgl http://paperpile.com/b/ tes g/otjgl http://paperpile.com/b/ tes g/otjgl http://paperpile.com/b/ tes g/zgbvl http://paperpile.com/b/ tes g/zgbvl http://paperpile.com/b/ tes g/zgbvl http://paperpile.com/b/ tes g/zgbvl https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . . . . doi: . / . . . . kirilovsky, a., marliot, f., el sissy, c., haicheur, n., galon, j., and pagès, f. ( ). rational bases for the use of the immunoscore in routine clinical settings as a prognostic and predictive biomarker in cancer patients. int. immunol. , – . kumar, m. p., du, j., lagoudas, g., jiao, y., sawyer, a., drummond, d. c., et al. ( ). analysis of single-cell rna-seq identifies cell-cell communication associated with tumor characteristics. cell rep. , – .e . lippitz, b. e., and harris, r. a. ( ). cytokine patterns in cancer patients: a review of the correlation between interleukin and prognosis. oncoimmunology , e . liu, j., lichtenberg, t., hoadley, k. a., poisson, l. m., lazar, a. j., cherniack, a. d., et al. ( ). an integrated tcga pan-cancer clinical data resource to drive high-quality survival outcome analytics. cell , – .e . mathew, e., brannon, a. l., del vecchio, a., garcia, p. e., penny, m. k., kane, k. t., et al. ( ). mesenchymal stem cells promote pancreatic tumor growth by inducing alternative polarization of macrophages. neoplasia , – . messmer, m. n., netherby, c. s., banik, d., and abrams, s. i. ( ). tumor-induced myeloid dysfunction and its implications for cancer immunotherapy. cancer immunol. immunother. , – . morel, p. a., lee, r. e. c., and faeder, j. r. ( ). demystifying the cytokine network: mathematical models point the way. cytokine , – . nath, a., and leier, a. ( ). improved cytokine-receptor interaction prediction by exploiting the negative sample space. bmc bioinformatics , . pagès, f., mlecnik, b., marliot, f., bindea, g., ou, f.-s., bifulco, c., et al. ( ). international validation of the consensus immunoscore for the classification of colon cancer: a prognostic and accuracy study. lancet , – . pomaznoy, m., ha, b., and peters, b. ( ). gonet: a tool for interactive gene ontology analysis. bmc bioinformatics , . ramilowski, j. a., goldberg, t., harshbarger, j., kloppmann, e., lizio, m., satagopam, v. p., et al. ( ). a draft network of ligand–receptor-mediated multicellular signalling in human. nat. commun. , . reddy, j. k., rao, m. s., yeldandi, a. v., tan, x. d., and dwivedi, r. s. ( ). pancreatic hepatocytes. an in vivo model for cell lineage in pancreas of adult rat. dig. dis. sci. , – . robertson, a. g., shih, j., yau, c., gibb, e. a., oba, j., mungall, k. l., et al. ( ). integrative analysis identifies four molecular and clinical subsets in uveal melanoma. cancer cell , – .e . shao, x., liao, j., li, c., lu, x., cheng, j., and fan, x. ( ). celltalkdb: a manually curated database of ligand-receptor interactions in humans and mice. brief. bioinform. doi: . /bib/bbaa . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/ tes g/zgbvl http://dx.doi.org/ . / . . . http://paperpile.com/b/ tes g/zgbvl http://paperpile.com/b/ tes g/ovhrz http://paperpile.com/b/ tes g/ovhrz http://paperpile.com/b/ tes g/ovhrz http://paperpile.com/b/ tes g/ovhrz http://paperpile.com/b/ tes g/ovhrz http://paperpile.com/b/ tes g/ndijk http://paperpile.com/b/ tes g/ndijk http://paperpile.com/b/ tes g/ndijk http://paperpile.com/b/ tes g/ndijk http://paperpile.com/b/ tes g/ew ix http://paperpile.com/b/ tes g/ew ix http://paperpile.com/b/ tes g/ew ix http://paperpile.com/b/ tes g/ew ix http://paperpile.com/b/ tes g/ sc c http://paperpile.com/b/ tes g/ sc c http://paperpile.com/b/ tes g/ sc c http://paperpile.com/b/ tes g/ sc c http://paperpile.com/b/ tes g/ sc c http://paperpile.com/b/ tes g/ gkvg http://paperpile.com/b/ tes g/ gkvg http://paperpile.com/b/ tes g/ gkvg http://paperpile.com/b/ tes g/ gkvg http://paperpile.com/b/ tes g/ gkvg http://paperpile.com/b/ tes g/xbz p http://paperpile.com/b/ tes g/xbz p http://paperpile.com/b/ tes g/xbz p http://paperpile.com/b/ tes g/xbz p http://paperpile.com/b/ tes g/xbz p http://paperpile.com/b/ tes g/ctvjs http://paperpile.com/b/ tes g/ctvjs http://paperpile.com/b/ tes g/ctvjs http://paperpile.com/b/ tes g/ctvjs http://paperpile.com/b/ tes g/ffndp http://paperpile.com/b/ tes g/ffndp http://paperpile.com/b/ tes g/ffndp http://paperpile.com/b/ tes g/ffndp http://paperpile.com/b/ tes g/istvg http://paperpile.com/b/ tes g/istvg http://paperpile.com/b/ tes g/istvg http://paperpile.com/b/ tes g/istvg http://paperpile.com/b/ tes g/istvg http://paperpile.com/b/ tes g/x omi http://paperpile.com/b/ tes g/x omi http://paperpile.com/b/ tes g/x omi http://paperpile.com/b/ tes g/x omi http://paperpile.com/b/ tes g/p pxd http://paperpile.com/b/ tes g/p pxd http://paperpile.com/b/ tes g/p pxd http://paperpile.com/b/ tes g/p pxd http://paperpile.com/b/ tes g/p pxd http://paperpile.com/b/ tes g/trkjf http://paperpile.com/b/ tes g/trkjf http://paperpile.com/b/ tes g/trkjf http://paperpile.com/b/ tes g/trkjf http://paperpile.com/b/ tes g/iigur http://paperpile.com/b/ tes g/iigur http://paperpile.com/b/ tes g/iigur http://paperpile.com/b/ tes g/iigur http://paperpile.com/b/ tes g/iigur http://paperpile.com/b/ tes g/uybhs http://paperpile.com/b/ tes g/uybhs http://paperpile.com/b/ tes g/uybhs http://paperpile.com/b/ tes g/uybhs http://dx.doi.org/ . /bib/bbaa http://paperpile.com/b/ tes g/uybhs https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / song, d., yang, d., powell, c. a., and wang, x. ( ). cell-cell communication: old mystery and new opportunity. cell biol. toxicol. , – . theory and applications of recent robust methods | mia hubert | springer available at: https://www.springer.com/gp/book/ [accessed june , ]. thorsson, v., gibbs, d. l., brown, s. d., wolf, d., bortone, d. s., ou yang, t.-h., et al. ( ). the immune landscape of cancer. immunity , – .e . trosko, j. e., and ruch, r. j. ( ). cell-cell communication in carcinogenesis. front. biosci. , d – . voutsadakis, i. a. ( ). thrombocytosis as a prognostic marker in gastrointestinal cancers. world j. gastrointest. oncol. , – . wei, c.-j., xu, x., and lo, c. w. ( ). connexins and cell signaling in development and disease. annu. rev. cell dev. biol. , – . west, j., and newton, p. k. ( ). cellular interactions constrain tumor growth. proc. natl. acad. sci. u. s. a. , – . weston, b. r., li, l., and tyson, j. j. ( ). mathematical analysis of cytokine-induced differentiation of granulocyte-monocyte progenitor cells. front. immunol. , . wilson, m. r., close, t. w., and trosko, j. e. ( ). cell population dynamics (apoptosis, mitosis, and cell–cell communication) during disruption of homeostasis. exp. cell res. , – . wolf, b. j., choi, j. e., and exley, m. a. ( ). novel approaches to exploiting invariant nkt cells in cancer immunotherapy. front. immunol. , . yahaya, s. s. s., othman, a. r., and keselman, h. j. ( ). testing the equality of location parameters for skewed distributions using s with high breakdown robust scale estimators. in theory and applications of recent robust methods (birkhäuser basel), – . yáñez, a., coetzee, s. g., olsson, a., muench, d. e., berman, b. p., hazelett, d. j., et al. ( ). granulocyte-monocyte progenitors and monocyte-dendritic cell progenitors independently produce functionally distinct monocytes. immunity , – .e . zhang, j., hou, l., zhao, d., pan, m., wang, z., hu, h., et al. ( ). inhibitory effect and mechanism of mesenchymal stem cells on melanoma cells. clin. transl. oncol. , – . zhou, j. x., taramelli, r., pedrini, e., knijnenburg, t., and huang, s. ( ). extracting intercellular signaling network of cancer tissues using ligand-receptor expression patterns from whole- tumor and single-cell transcriptomes. sci. rep. , . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://paperpile.com/b/ tes g/h jgr http://paperpile.com/b/ tes g/h jgr http://paperpile.com/b/ tes g/h jgr http://paperpile.com/b/ tes g/h jgr http://paperpile.com/b/ tes g/ups l http://paperpile.com/b/ tes g/ups l https://www.springer.com/gp/book/ http://paperpile.com/b/ tes g/ups l http://paperpile.com/b/ tes g/efyse http://paperpile.com/b/ tes g/efyse http://paperpile.com/b/ tes g/efyse http://paperpile.com/b/ tes g/efyse http://paperpile.com/b/ tes g/rmqgg http://paperpile.com/b/ tes g/rmqgg http://paperpile.com/b/ tes g/rmqgg http://paperpile.com/b/ tes g/rmqgg http://paperpile.com/b/ tes g/ afsz http://paperpile.com/b/ tes g/ afsz http://paperpile.com/b/ tes g/ afsz http://paperpile.com/b/ tes g/ afsz http://paperpile.com/b/ tes g/vadrs http://paperpile.com/b/ tes g/vadrs http://paperpile.com/b/ tes g/vadrs http://paperpile.com/b/ tes g/vadrs http://paperpile.com/b/ tes g/qtdx http://paperpile.com/b/ tes g/qtdx http://paperpile.com/b/ tes g/qtdx http://paperpile.com/b/ tes g/qtdx http://paperpile.com/b/ tes g/dtjl http://paperpile.com/b/ tes g/dtjl http://paperpile.com/b/ tes g/dtjl http://paperpile.com/b/ tes g/dtjl http://paperpile.com/b/ tes g/qcquj http://paperpile.com/b/ tes g/qcquj http://paperpile.com/b/ tes g/qcquj http://paperpile.com/b/ tes g/qcquj http://paperpile.com/b/ tes g/ inhk http://paperpile.com/b/ tes g/ inhk http://paperpile.com/b/ tes g/ inhk http://paperpile.com/b/ tes g/ inhk http://paperpile.com/b/ tes g/ap gd http://paperpile.com/b/ tes g/ap gd http://paperpile.com/b/ tes g/ap gd http://paperpile.com/b/ tes g/ap gd http://paperpile.com/b/ tes g/ap gd http://paperpile.com/b/ tes g/x njd http://paperpile.com/b/ tes g/x njd http://paperpile.com/b/ tes g/x njd http://paperpile.com/b/ tes g/x njd http://paperpile.com/b/ tes g/x njd http://paperpile.com/b/ tes g/td vb http://paperpile.com/b/ tes g/td vb http://paperpile.com/b/ tes g/td vb http://paperpile.com/b/ tes g/td vb http://paperpile.com/b/ tes g/ xjiz http://paperpile.com/b/ tes g/ xjiz http://paperpile.com/b/ tes g/ xjiz http://paperpile.com/b/ tes g/ xjiz http://paperpile.com/b/ tes g/ xjiz https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure legends figure . overview of workflow showing the transition from data sources to results. figure . illustration of the probabilistic model and edge weight computations. (a) for a given cell-cell communication edge, (b) per patient values are used to 'look up' probabilities from the distributions learned from all tcga data. those probabilities are then used to compute an edge weight. figure . diagram of how differentially weighted edges were determined. three samples of edge weights were taken from the pool by tissue source. then matching the sample proportions in the clinical features, permutations were sampled and used for computing randomized s statistics. each sample was used to produce million permuted statistics, and taken together, the millionth percentile was used as a cutoff in determining important edges. figure . top edges (by s scores) that can distinguish tissue types. each point represents a tumor sample and each panel represents one edge. (a) edgeid , melanocytes-mia-cdh -melanocyte skcm red, uvm blue, brca purple, paad orange. (b) edgeid , msc-tfpi-f -msc, paad red. (c) edgeid , sebocytes-wnt a-fzd - sebocytes, lusc red, luad blue, hnsc purple. (d) edgeid , th cells-il -il rg-megakaryocytes, stad red, read blue, coad purple, esca orange. figure . (a) median values for each differentially weighted cell-cell edge (dwe) for the pfi categories (in row, dwe edges in columns). (b) examples of differentially weighted edges. figure . edge member dominance in dwes shown by log counts of cell types. figure . high probability edges (dwes) from pfi contrasts form predictive connected subnetworks. color indicates the magnitude and direction of s statistics (+ / -). figure . informative edges selected by xgboost models for prediction within study. color indicates information gain. figure . cell-cell interaction diagram demonstrating complexity in communication with three cell types that produce the il b ligand that have two possible binding partners on the same receptor bearing cell. edge weight violin plots are shown for two stad pfi groups, short (left) and long (right) pfi. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes dr af t next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes jianbo zhanga,� and dilip r. pantheea,� a department of horticultural science, north carolina state university, mountain horticultural crops research and extension center, research drive, mills river, nc , usa this manuscript was compiled on november , the genomic region(s) that controls a trait of interest can be rapidly identified using bsa-seq, a technology in which next-generation se- quencing (ngs) is applied to bulked segregant analysis (bsa). we recently developed the significant structural variant method for bsa- seq data analysis that exhibits higher detection power than standard bsa-seq analysis methods. our original algorithm was developed to analyze bsa-seq data in which genome sequences of one par- ent served as the reference sequences in genotype calling, and thus required the availability of high-quality assembled parental genome sequences. here we modified the original script to allow for the ef- fective detection of the genomic region-trait associations using only bulk genome sequences. we analyzed a public bsa-seq dataset us- ing our modified method and the standard allele frequency and g- statistic methods with and without the aid of the parental genome sequences. our results demonstrate that the genomic region(s) as- sociated with the trait of interest could be reliably identified only via the significant structural variant method without using the parental genome sequences. bsa-seq | pybsaseq | qtl | genomic region-trait association bulked segregant analysis (bsa) was developed for the quick identification of genetic markers associated with a trait of interest ( , ). for a particular trait, two groups of individuals with contrasting phenotypes are selected from a segregating population. equal amounts of dna are pooled from each individual within a group. the pooled dna samples are then subjected to analysis, such as restriction fragment length polymorphism (rflp) or random amplification of poly- morphic dna (rapd). fragments unique to either group are potential genetic markers that may link to the gene(s) that control phenotypic expression for the trait of interest. can- didate markers are further tested against the population to verify the marker-trait associations. with the recent dramatic reductions in cost, next-generation sequencing (ngs) has been applied to more and more bsa studies ( – ). this new tech- nology is referred to as bsa-seq. in bsa-seq, pooled dna samples are not subjected to rflp/rapd analysis, but are directly sequenced instead. genome-wide structural variants between bulks, such as single nucleotide polymorphisms (snp) and small insertions/deletions (indel), are identified based on the sequencing data. genomic regions linked to the trait- controlling gene(s) are then identified based on the enrichment of the snp/indel alleles in those regions in each bulk. the time-consuming and labor-intensive marker development and genetic mapping steps are eliminated in the bsa-seq method. moreover, snps/indels can be detected genome-wide via ngs, which allows for the reliable identification of trait-associated genomic regions across the entire genome. for each snp/indel in a bsa-seq dataset, the base (or oligo in the case of an indel) that is the same as in the reference genome is termed the reference base (ref), and the other base is termed the alternative base (alt). because each bulk contains many individuals, the vast majority of snp loci in the dataset have both ref and alt bases. for each snp, the number of reads of its ref/alt alleles is termed allele depth (ad). because of the phenotypic selection via bulking, for trait-associated snps, the alt allele should be enriched in one bulk while the ref allele should be enriched in the other. however, for snps not associated with the trait, both alt and ref alleles would be randomly segregated in both bulks, and neither enriched in either bulk. hence these four ad values can be used to assess how likely a snp/indel is associated with the trait. we have previously developed the significant structural variant method for bsa-seq data analysis ( ). in this method, a snp/indel is assessed with fisher’s exact test using the ad values of both bulks. a snp/indel is considered significant if the p-value of fisher’s exact test is lower than a specific cut-off value, e.g., . . a genomic region normally contains many snps/indels. the ratio of the significant structural variants to the total structural variants is used to judge if this genomic region is associated with the trait of interest. we tested this method using the bsa-seq data of a rice cold- tolerance study ( ). one of the parents in this study was rice cultivar oryza sativa ssp. japonica cv. nipponbare. its high- quality assembled genome sequences were used as the reference sequences for snp/indel calling as well, which makes the genotype calling and snp/indel filtering very straightforward: any locus in any bulk that is different from the ref allele is a valid snp/indel ( ). only high-quality assembled genome sequences can serve as the reference sequences in genotype calling, an essential step in bsa-seq data analysis. for most species, however, such sequences are available for only a single or limited number of lines. if lines without high-quality assembled genome sequences are used as the parents in bsa-seq studies, the parental genomes are often sequenced via ngs for the determination significance statement bsa-seq can be utilized to rapidly identify structural variant- trait associations, and our modified significant structural variant method allows the detection of such associations without se- quencing the parental genomes, leading to further lower the sequencing cost and making bsa-seq more accessible to the research community and more applicable to the species with a large genome. author contributions: jz and drp conceived the study. jz developed the algorithm, wrote the python code, analyzed the data, and wrote and edited the manuscript. drp edited the manuscript and supervised the project. the authors declare no conflict of interest. �to whom correspondence should be addressed. e-mail: dilip_panthee@ncsu.edu or zhang.jianbo@gmail.com https://doi.org/ . / biorχiv | november , | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / dr af t of the parental origin of snp alleles and the identification of parental heterozygous snps. modification of our original method to allow the analysis of bsa-seq data in the absence of assembled or ngs-generated parental genome sequences would provide greater flexibility and significantly reduce sequencing costs. hence, we modified our original script to allow for the identification of the false-positive snps/indels and part of the heterozygous loci in the parents without the aid of the parental genome sequences. using the modified script, along with the scripts for the standard g-statistic and allele frequency methods ( , ), we analyzed a public bsa-seq dataset using either the genome sequences of both the parents and the bulks, or the bulk genome sequences alone. the results revealed that reliable detection of genomic region-trait associations can be achieved only via our modified script when using only the bulk genome sequences. materials and methods the sequencing data used in this study were generated by lahari et al. ( ). using the allele frequency method, the authors identified a single locus for root-knot nematode resistance in rice. in that study, the parents of the f population were ld and vialonenano, yielding an f population size of (plants), and both the resistant bulk and the susceptible bulk contained plants each. the dna samples of both the parents and the bulks were sequenced using illumina miseq sequencing system and miseq v chemistry. the bsa-seq sequencing data (err : parent ld ; err : parent vialonenano; err : the resistant bulk from the f population; err : the susceptible bulk from the f population) were downloaded from the european nucleotide archive (ena) using the linux program wget, and the rice reference sequence (release ) was downloaded from https://plants.ensembl. org/oryza_sativa/info/index. sequencing data preprocessing and snp calling were performed as described previously ( ). when analyzing the bsa-seq data with the genome sequences of both the parents and the bulks, bulk/parent snp calling was performed separately. the common snps of the two snp datasets were used for the downstream analysis. the snp dataset generated via snp calling was processed with our python script to identify significant snp-trait associations. a single script containing all the three methods is available on the website https://github.com/dblhlx/pybsaseq. the workflow of the scripts is as follows: . read the .tsv input file generated via snp calling into a pandas dataframe. . perform snp filtering on the pandas dataframe. . identify the significant snps (ssnps) via fisher’s exact test (the significant structural variant method), calculate the Δaf (allele frequency difference between bulks) values (the allele frequency method), or calculate the g-statistic values (the g-statistic method) using the four ad values (adref and adalt of bulk and adref and adalt of bulk ) of each snp in the filtered pandas dataframe. . use the sliding window algorithm to plot the ssnp/totalsnp ratios, the Δaf values, or the g-statistic values against their genomic positions. . estimate the threshold of the ssnp/totalsnp ratio, the Δaf, or the g-statistic via simulation. the thresholds are used to identify the significant peaks/valleys in the plots generated in step . identification of the ssnps, calculation of the ssnp/totalsnp ratios, the g-statistic values, or the Δaf values, and estimation of their thresholds were carried out as described previously ( ). the . th percentile of simulated ssnp/totalsnp ratios or g-statistic values was used as the threshold for the significant structural variant method or the g-statistic method, and the % confidence interval of simulated Δaf values was used as the threshold for the allele frequency method. for all methods, the size of the sliding windows is mb and the incremental step is kb. in our previous work, a parent was the japonica rice cultivar nipponbare, and its genome sequences were used as the reference sequences for snp/indel calling. in the current dataset, the parents were ld and vialonenano; many false-positive snps/indels and heterozygous loci in the parents would be included in the dataset if analyzing the bsa-seq data using the original script. hence, snp filtering is carried out a little differently from previously described ( ), and its details are below (see table s for examples): • unmapped snps or snps mapped to the mitochondrial or chloroplast genome • snps with an ‘na’ value in any column of the dataframe • snps with zero ref read and a single alt allele in both bulks/parents • snps with three or more alt alleles in any bulk/parent • snps with two alt alleles and its ref read is not zero in any bulk/parent • snps in which the bulk/parent genotypes do not agree with the ref/alt bases • snps in which the bulk/parent genotypes are not consistent with the ad values • snps with a genotype quality (gq) score less than in any bulk • snps with very high reads • snps heterozygous in any parent when parental genome se- quences are available additionally, for snps with two alt alleles and zero ref read in both bulks/parents, the ref allele is replaced with the first allele in the ‘alt’ field, its alt allele is replaced with the second allele in the original ‘alt’ field. the ref read, and a comma after it, are removed from both the allele depth (ad) fields (one for each bulk/parent). this step is carried out before checking the genotype agreement between bulks and the ref/alt fields. when parental genome sequences are involved, the common snp set is identified before filtering out the snps with a low gq score in the parental snp dataset. the tightly linked snp alleles from the same parent tend to segregate together and should have a similar extent of allele enrich- ment, and thus similar ad values. in a snp dataset, the genotypes of each bulk/parent are represented as ‘gtref/gtalt’ when a snp contains both the ref base and the alt base in the genotype (gt) field, and the ad values in each bulk/parent is represented as ‘adref,adalt’. the genotype and the ad value of the ref allele are always placed first in both fields. for a snp locus in the .tsv input file, the allele having the same genotype as that in the reference genome is defined as the ref allele. however, it is highly unlikely that all of the snp alleles in a parent are the same as those in the reference genome, except in instances where reference genome sequences used in snp calling are from one of the parents as in the case of the cold-tolerance study as mentioned above ( ). it is necessary to place the genotypes and ad values of all snp alleles from one parent (e.g., ld ) in the ref position, and those from the other parent (e.g., vialonenano) to the alt position in the gt and ad fields to make the bulk dataset consistent. thus, for a particular snp, if the ref base in the .tsv file is different from the genotype of ld (either parent will work), its gt/ad values would be swapped, e.g., ‘g/a’ to ‘a/g’ and ‘ , ’ to ‘ , ’. ad/gt swapping is performed following snp filtering and is performed only when the parental genome sequences are used to aid bsa-seq data analysis. equation is used for Δaf calculation. ad swapping ensures that adjacent snps have similar Δaf values. ∆af = adalt adref + adalt − adalt adref + adalt [ ] results the original sequence reads were . g, . g, . g, and . g; they became . g, . g, . g, and . g after quality con- trol, respectively, in err (parent ld ), err (parent vialonenano), err (the resistant bulk), and err (the susceptible bulk), which correspond to . ×, . ×, . ×, and . × coverage, respectively ( ). the prepro- cessed sequences were used for snp calling to generate a snp | https://doi.org/ . / zhang et al. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://plants.ensembl.org/oryza_sativa/info/index https://plants.ensembl.org/oryza_sativa/info/index https://plants.ensembl.org/oryza_sativa/info/index https://github.com/dblhlx/pybsaseq https://doi.org/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / dr af t dataset, which was analyzed using the modified significant structural variant method, the g-statistic method, and the allele frequency method with or without the aid of the parental genome sequences. bsa-seq data analysis using the genome sequences of both the parents and the bulks. the snp calling-generated parent/bulk snp dataset was processed with the python script pybsaseq_wp.py. snp filtering was performed as described in the materials and methods section. the parental snp dataset was processed first, and the snps heterozygous in any parent were eliminated because all algorithms assume all snp loci are homozygous in the parental lines. threshold estimation is based on this assumption. although most rice breeding lines should be homozygous in most loci, more than % heterozygous snp loci ( homozygous and heterozygous) were identified in the parental snp dataset. however, the gatk’s variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity (https://gatk.broadinstitute.org/hc/en-us/articles/ -germline-short-variant-discovery-snps-indels-), we cannot rule out the possibility that some of the heterozy- gous loci were caused by sequencing artifacts. the bulk snp dataset was processed second. the snps with the same chromosome id and the same genomic coordinate in both datasets were considered common snps. common snps in the bulk dataset were used to detect snp-trait associations for all three methods. table . chromosomal distribution of snps - using the genome sequences of both the parents and the bulks chromosome ssnps totalsnps ssnp/totalsnp . . . . . . . . . . . . genome-wide . the significant structural variant method: each snp in the dataset was tested via fisher’s exact test using its four ad values, and snps with p-values less than . were defined as ssnps. the chromosomal distributions of the ssnps and the total snps are summarized in table . using the sliding window algorithm, the genomic distribution of the ssnps, the total snps, and the ssnp/totalsnp ratios of sliding windows were plotted against their genomic position (figure a and figure b). a genome-wide threshold was estimated as . via simulation as described previously ( ). two peaks above the threshold were identified: a minor one on chromosome and a major one on chromosome . the position of the peak on chromosome was at . mb, the sliding window contained ssnps and total snps, corresponding to an ssnp/totalsnp ratio of . ; the position of the peak on chromosome was at . mb, the sliding window contained ssnps and total snps, corresponding to an ssnp/totalsnp ratio of . . the sliding window-specific threshold was estimated for each peak via simulation, and the values were . and . , respectively, indicating both peaks were significant. both values are higher than the genome-wide threshold, probably due to the lower amounts of total snps in these sliding windows. the average snps per sliding window was . the g-statistic method: the g-statistic value of each snp in the dataset was calculated, and its threshold was estimated via simulation as described previously ( ). using the sliding window algorithm, the g-statistic value of each sliding win- dow, the average g-statistic values of all snps in that sliding window, was plotted against its genomic position (figure c), and the curve pattern was very similar to that in figure b. a significant peak was identified on chromosome ; its position was at . mb, its g-statistic value was . , well above the threshold . ( . th percentile). the allele frequency method: the Δaf value of each snp in the dataset was calculated, and the Δaf threshold of the snp was estimated via simulation as described previously ( ). using the sliding window algorithm, the Δaf value of each sliding window, the average Δaf values of all snps in that sliding window, was plotted against its genomic position (figure d). a significant peak on chromosome was identified, the peak position was located at . mb, its Δaf value was . , and the % confidence interval was − . to . . bsa-seq data analysis using only the bulk genome se- quences. the snp calling-generated bulk snp dataset was processed with the python script pybsaseq.py. all the meth- ods and parameters were the same as above; the only difference was that the parental snp dataset was not used. the significant structural variant method: the chromoso- mal distribution of the ssnps and total snps are summarized in table . the total number of snps was here, much higher than the above, which was . the ge- nomic distribution of the ssnps, the total snps, and the ssnp/totalsnp ratios of the sliding windows are presented in figure a and figure b. the patterns of the curves were very similar to those in figure a and figure b. one of the obvious differences was that ssnp/totalsnp ratios of the slid- ing windows were much lower than those in figure b, leading to missing the minor locus on chromosome . only the peak on chromosome was significant; it was located at . mb, a kb shift compared to figure b. the sliding window contained ssnps and total snps, corresponding to a . ssnp/totalsnp ratio, well above the genome-wide threshold ( . ) and the sliding window specific threshold ( . ). the average snps per sliding window was . the g-statistic method: the patterns of the g-statistic value plot (figure c) were very similar to that in figure c, but the g-statistic values were significantly lower than those in figure c, and the threshold did not change much. only a single sliding window was above the threshold ( . ), its position was at . mb, and its g-statistic value was . . the allele frequency method: without the aid of the parental genome sequences, the pattern of the Δaf curve of chromosome (figure d), especially the genomic region associated with the trait, was drastically different from that in figure d. differences in the curve patterns were observed in other chromosomes as well, but they were relatively minor. all Δaf values were within the % confidence interval, although zhang et al. biorχiv | november , | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/dblhlx/pybsaseq/blob/master/pybsaseq_wp.py https://gatk.broadinstitute.org/hc/en-us/articles/ -germline-short-variant-discovery-snps-indels- https://gatk.broadinstitute.org/hc/en-us/articles/ -germline-short-variant-discovery-snps-indels- https://gatk.broadinstitute.org/hc/en-us/articles/ -germline-short-variant-discovery-snps-indels- https://github.com/dblhlx/pybsaseq/blob/master/pybsaseq.py https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / dr af t n um be r o f s n p s chr chr chr chr chr chr chr chr chr chr chr chr . . . . ss n p /to ta ls n p g -s ta tis tic . . . a f genomic position (× mb) a b c d figure . bsa-seq data analysis using the genome sequences of both the parents and the bulks. the red lines/curves are the thresholds. (a) genomic distributions of ssnps (blue) and totalsnps (black). (b) genomic distributions of ssnp/totalsnp ratios. (c) genomic distributions of g-statistic values. (d) genomic distributions of Δaf values. table . chromosomal distribution of snps - using only the bulk genome sequences chromosome ssnps totalsnps ssnp/totalsnp . . . . . . . . . . . . genome-wide . ad swapping was performed on only snps, . % of total snps. discussion we tested how parental genome sequences affected the detec- tion of snp-trait associations via bsa-seq using a dataset of the rice root-knot nematode resistance. using the genome sequences of both the parents and bulks, a major locus on chromosome and a minor locus on chromosome were de- tected via the significant structural variant method. however, only the major locus was detected via the g-statistic method and the allele frequency method. the positions of the peaks detected via different methods were not the same, but they were very close to each other. using only the bulk genome sequences, the major locus can be detected via only the signif- icant structural variant and g-statistic methods. the allele frequency method uses the Δaf value of a snp to measure allele (ref/alt) enrichment in the snp locus, and the g- statistic method uses the g-statistic value of a snp to measure the allele enrichment; Δaf and g-statistic are parameters at the snp level, therefore, both methods use a snp level parameter to identify significant sliding windows for the detec- tion of the genomic region-trait associations. the significant structural variant method, however, uses the ssnp/totalsnp ratio, a parameter at the sliding window level, to measure the ssnp enrichment in a sliding window for the identification of the trait-associated genomic regions. a snp normally has less than reads because of the cost concern, while a sliding | https://doi.org/ . / zhang et al. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / dr af t n um be r o f s n p s chr chr chr chr chr chr chr chr chr chr chr chr . . . ss n p /to ta ls n p . . . g -s ta tis tic . . . a f genomic position (× mb) a b c d figure . bsa-seq data analysis using only the bulk genome sequences. the red lines/curves are the thresholds. (a) genomic distributions of ssnps (blue) and totalsnps (black). (b) genomic distributions of ssnp/totalsnp ratios. (c) genomic distributions of g-statistic values. (d) genomic distributions of Δaf values. window normally contains thousands of snps. thus, the sig- nificant structural variant method has much higher statistical power, which is consistent with our observation. our results revealed that the parental genome sequences did not much affect the plot patterns of the ssnp/totalsnp ratios and the g-statistic values. however, the plot patterns of the Δaf value of chromosome were altered dramatically when the parental genome sequences were not used. the significant structural variant method assesses if a snp is likely associated with the trait via fisher’s exact test. the greater the alt proportion differences between the bulks, the less the p-value of the fisher’s exact test, and the more likely the snp is associated with the trait. fisher’s exact test takes a numpy array or a python list as its input, the same p-value will be obtained with either [[adref , adalt ], [adref , adalt ]] or [[adalt , adref ], [adalt , adref ]] as its input. the g-statistic method assesses if a snp is likely associated with the trait via the g-test; the greater the g-statistic value of a snp, the more likely it contributes to the trait phenotype ( ). the g- statistic values are the same with either input [[adref , adalt ], [adref , adalt ]] or [[adalt , adref ], [adalt , adref ]]. the order of the ad values (ref/alt reads) in bulks does not affect the p-value of fisher’s exact test or the g-statistic value of g-test, which is why the parental genome sequences-guided ad swapping does not alter the curve patterns of both methods. therefore, theoretically, parental genome sequences are not required to identify genomic region-trait associations in either the significant structural variant method or the g-statistic method. when the parental genome sequences were used, ad value swapping was performed for the snps in which the genotype of ld was different from the ref base, and the Δaf values of these snps were calculated based on the swapped ad values using equation . ad swapping makes the adjacent snp alleles from the same parent have similar ad values and similar Δaf values. the Δaf values of such snps were calculated using equation if not performing ad swapping. equation can be converted to equation , which produces an opposite value relative to that produced by equation . for two adjacent snps in ld , where one snp has the same genotype as the ref base while the other has the same genotype as the alt base, they would have opposite Δaf values if ad swapping is not performed. for the snps that do not contribute to the trait phenotype and are not linked to any trait-associated genomic regions, their Δaf value should fluctuate around zero. the parental genome sequences will have less effect on the Δaf value of the sliding windows containing such snps. however, for trait-associated snps, adjacent snps with opposite Δaf values would cancel each other out and lower the Δaf value of the sliding window significantly, which is the case observed on chromosome in figure d. ∆af = adref adref + adalt − adref adref + adalt [ ] zhang et al. biorχiv | november , | .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / dr af t ∆af = adalt adref + adalt − adalt adref + adalt [ ] when the parental genome sequences were not used, the ssnp/totalsnp ratios and the g-statistic values were signifi- cantly lower. the peak ssnp/totalsnp ratio on chromosome was . in figure b, while it was . in figure b; it was similar for the peak g-statistic values. the decreasing of ssnp/totalsnp ratio and the g-statistic value is likely caused by sequencing artifacts and heterozygosity in the parental lines. there were snps in the bulk dataset when not using the parental genome sequences, while there were snps in the dataset with the aid of the parental genome sequences. comparison of the two snp dataset re- vealed that snps were unique to the bulks. because all the snps in the bulks are derived from the parental lines, crossing should not generate new snps; thus this category of snps was most likely caused by sequencing artifacts. the sequencing coverage in the bulk was less than eight, which is very low. higher sequencing coverage would help decrease the number of snps derived from sequence artifacts. additionally, snp were heterozygous in the parental lines. without the parental genome sequences, this category of snps could not be filtered out from the bulk snp dataset. however, these snps can be decreased via selfing the parental line more gener- ations: five-generations selfing can decrease the heterozygosity of both parental lines to a maximum of . %. to determine how parental heterozygosity and sequenc- ing artifacts affected the detection of genomic region-trait associations, we removed the heterozygous snps or the bulk- specific snps from the bulk snp dataset, and analyzed the data separately. by removing the heterozygous snps, the peak on chromosome was shifted to . mb for both the ssnp/totalsnp ratio and the g-statistic value, and the ssnp/totalsnp ratio of the peak was increased to . , well above the sliding window-specific threshold . . the g-statistic value of the peak was . , significantly higher than the threshold . as well. by removing bulk-specific snps, the peak on chromosome shifted to . mb for both the ssnp/totalsnp ratio and the g-statistic value. the ssnp/totalsnp ratio of the peak and the sliding window- specific threshold were . and . , respectively, and the g-statistic value of the peak and the threshold were . and . , respectively. although both the ssnp/totalsnp ratio and the g-statistic value were lower than above, they were still higher than their corresponding thresholds. while seemed the heterozygous snps affected the ssnp/totalsnp ratio and the g-statistic value a little more than the bulk- specific snps, it is more likely that both produced similar levels of noise for the ssnp/totalsnp ratio and the g-statistic value considering that the former was greater than the latter. when using only the bulk genome sequences, the ssnp/totalsnp peak position on chromosome was shifted . mb ( . mb to . mb) due to the presence of the bulk-specific snps and the heterozygous snps in the dataset, but this is a very short distance for genetic mapping. although only a single dataset was examined here, the genome-wide similarity of the ssnp/totalsnp curve patterns in figure b and figure b suggests that the significant structural method is highly reproducible using only the bulk genome sequences. conclusions the plotting pattern of the Δaf values in the trait-associated genomic region was very different when using only the bulk genome sequences. without the aid of the parental genome sequences, the Δaf values of the sliding windows could not be correctly calculated; thus, the allele frequency method cannot be used to identify snp-trait association. in contrast, the parental genome sequence does not affect the plotting patterns of both the significant structural variant method and the g-statistic method, but the ssnp/totalsnp ratios and the g-statistic values decreased significantly due to sequencing artifacts and/or heterozygosity of the parental lines. because of its high detection power, major snp-trait associations can still be reliably detected via the significant structural variant method even the sequence coverage was very low. acknowledgments. jz was supported by the national science foundation grant [ios- to drp]. we are grateful to lahari et al. for generating the sequencing data and making it available to the public. we thank irene e. palmer for critical review and thank nathan lynch for valuable comments. the manuscript was prepared using a modified version of the pnas latex template. bibliography . rw michelmore, i paran, rv kesseli, identification of markers linked to disease-resistance genes by bulked segregant analysis: a rapid method to detect markers in specific genomic re- gions by using segregating populations. proc. natl. acad. sci. u.s.a. , – ( ). . jj giovannoni, ra wing, mw ganal, sd tanksley, isolation of molecular markers from spe- cific chromosomal intervals using dna pools from existing mapping populations. nucleic acids res , – ( ). . i imerovski, et al., bsa-seq mapping reveals major qtl for broomrape resistance in four sunflower lines. mol breed. , ( ). . s arikit, et al., qtl-seq identifies cooked grain elongation qtls near soluble starch synthase and starch branching enzymes in rice ( oryza sativa l.). sci rep , – ( ). . q chen, et al., identification and genetic mapping for rht-dm, a dominant dwarfing gene in mutant semi-dwarf maize using qtl-seq approach. genes genomics , – ( ). . j clevenger, et al., mapping late leaf spot resistance in peanut (arachis hypogaea) using qtl-seq reveals markers for marker-assisted selection. front plant sci , ( ). . f duveau, et al., mapping small effect mutations in saccharomyces cerevisiae: impacts of experimental design and mutational properties. g (bethesda) , – ( ). . j zhang, dr panthee, pybsaseq: a simple and effective algorithm for bulked segregant analysis with whole-genome sequencing data. bmc bioinforma. , ( ). . z yang, et al., mapping of quantitative trait loci underlying cold tolerance in rice seedlings via high-throughput sequencing of pooled extremes. plos one , e ( ). . h takagi, et al., qtl-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of dna from two bulked populations. plant j. , – ( ). . pm magwene, jh willis, jk kelly, the statistics of bulk segregant analysis using next gen- eration sequencing. plos comput. biol. , e ( ). . z lahari, et al., qtl-seq reveals a major root-knot nematode resistance locus on chromo- some in rice (oryza sativa l.). euphytica , ( ). | https://doi.org/ . / zhang et al. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / scalable bias-corrected linkage disequilibrium estimation under genotype uncertainty scalable bias-corrected linkage disequilibrium estimation under genotype uncertainty david gerard department of mathematics and statistics, american university, washington, dc, , usa abstract linkage disequilibrium (ld) estimates are often calculated genome-wide for use in many tasks, such as snp pruning and ld decay estimation. however, in the presence of genotype uncertainty, naive approaches to calculating ld have extreme attenuation biases, incorrectly suggesting that snps are less dependent than in reality. these biases are particularly strong in polyploid organisms, which often exhibit greater levels of genotype uncertainty than diploids. a principled approach using maximum likelihood estimation with genotype likelihoods can reduce this bias, but is prohibitively slow for genome-wide applications. here, we present scalable moment-based adjustments to ld estimates based on the marginal posterior distributions of the genotypes. we demonstrate, on both simulated and real data, that these moment-based estimators are as accurate as maximum likelihood estimators, and are almost as fast as naive approaches based only on posterior mean genotypes. this opens up bias-corrected ld estimation to genome-wide applications. additionally, we provide standard errors for these moment-based estimators. all methods are implemented in the ldsep r package on github https://github. com/dcgerard/ldsep. introduction pairwise linkage disequilibrium (ld), the statistical association between alleles at two different loci, has applications in genotype imputation [wen and stephens, ], genome-wide association studies [zhu and stephens, ], genomic prediction [wientjes et al., ], population genetics [slatkin, ], and many other tasks [sved and hill, ]. ld is often estimated from next-generation sequencing technologies, where the genotypes and haplotypes are not known with certainty [gerard et al., ]. thus, researchers typically use estimated genotypes, such as posterior mean genotypes [fox et al., ], to estimate ld. however, this can cause biased ld estimates, attenuated toward zero, implying loci are less dependent than in reality. this bias is particularly strong in polyploids, and so in gerard [ ] we derived maximum likelihood estimates (mles) that have lower bias and are consistent estimates of ld. unfortunately, the mle approach is prohibitively slow. researchers typically calculate pairwise ld at genome-wide scales, and the mle approach takes on the order of a tenth of a second. thus, for many genome-wide applications, containing millions of snps, ld estimation using the mle approach would take years of computation time. this is not conducive to large-scale applications. keywords and phrases: attenuation bias, genotype likelihood, linkage disequilibrium, polyploidy, reliability ratio. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/dcgerard/ldsep https://github.com/dcgerard/ldsep https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / here, we derive scalable approaches to estimate ld that account for genotype uncertainty (sec- tion ). our methods use only the first two moments of the marginal posterior genotype distribution for each individual at each locus, which are often provided or easily obtainable from many geno- typing programs. we calculate sample moments from these posterior moments, and use these to multiplicatively inflate naive ld estimates. we show, through simulations (section . ) and real data (section . ), that our estimates can reduce attenuation bias and improve ld estimates when genotypes are uncertain. all calculations have computational complexities that are linear in the sample size, and so these estimates are scalable to genome-wide applications. methods in this section, we will define moment-based estimators of the ld coefficient ∆ [lewontin and kojima, ], the standardized ld coefficient ∆′ [lewontin, ], and the pearson correlation ρ [hill and robertson, ]. we will only consider estimating the “composite” versions of these ld measures which, advantageously, are appropriate ld measures for generic autopolyploid, allopoly- ploid, and segmental allopolyploid populations, even in the absence of hardy-weinberg equilibrium [gerard, ]. we will also only consider biallelic loci, where the genotype for each individual is the dosage (from to the ploidy) of one of the two alleles. to define our estimators of ld, we assume the user provides the posterior means and variances for the genotypes for each individual at two loci. the full posterior genotype distribution for each individual is often provided by genotyping software [gerard et al., , gerard and ferrão, , e.g.], from which these posterior moments can be obtained. if genotype posteriors are not provided, genotype likelihoods may be normalized to posterior probabilities (assuming a uniform prior) and used in what follows. let xia and xib be the posterior means at loci a and b for individual i ∈ { , . . . ,n}. let yia and yib be the posterior variances at loci a and b for individual i. our estimators are based entirely on the following sample moments of these posterior moments, which may be calculated in linear time in the sample size, n. uxa := n n∑ i= xia, uxb := n n∑ i= xib, ( ) vxa := n− n∑ i= (xia −uxa) , vxb := n− n∑ i= (xia −uxb) , ( ) cx := n− n∑ i= (xia −uxa)(xib −uxb), ( ) uya := n n∑ i= yia, and uyb := n n∑ i= yib. ( ) for a k-ploid species, our ld estimators, which we derive in section s , are as follows. the estimated ld coefficient is ∆̂ := ( uya + vxa vxa )( uyb + vxb vxb )(cx k ) . ( ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the estimated pearson correlation is ρ̂ := √ uya + vxa vxa √ uyb + vxb vxb cx√ vxavxb . ( ) note that cx/ √ vxavxb is the sample pearson correlation between posterior mean genotypes. the estimated standardized ld coefficient is ∆̂′ := ∆̂/∆̂m, where ( ) ∆̂m := { min{uxauxb, (k −uxa)(k −uxb)}/k if cx < , and min{uxa(k −uxb), (k −uxa)uxb}/k if cx > . ( ) equations ( )–( ) take the naive estimators most researchers use in practice (the sample covari- ance/correlation of posterior means) and inflate these by a multiplicative effect. such multiplicative effects are sometimes called “reliability ratios” in the measurement error models literature [fuller, ]. due to sampling variability, this inflation could result in estimates that lie beyond the theo- retical bounds of the parameters being estimated. in such cases, we apply the following truncations. ρ̃ := { max{ρ̂,− } if ρ̂ < min{ρ̂, } if ρ̂ > ( ) ∆̃ := { max{∆̂,− √ (vxa + uya)(vxb + uyb)/k} if ∆̂ < min{∆̂, √ (vxa + uya)(vxb + uyb)/k} if ∆̂ > ( ) ∆̃′ := { max{∆̂′,−k} if ∆̂′ < min{∆̂′,k} if ∆̂′ > ( ) standard errors are important for hypothesis testing [brown, ], read-depth suggestions [maruki and lynch, ], and shrinkage [dey and stephens, ]. because estimators ( )–( ) are functions of sample moments, deriving their standard errors can be accomplished by appealing to the central limit theorem, followed by an application of the delta method (section s ). additional considerations for improving our estimates of the reliability ratios, such as using hierarchical shrinkage [stephens, ], are considered in section s . all methods are implemented in the ldsep r package on github https://github.com/dcgerard/ ldsep. results . simulations we compared our moment-based estimators ( )–( ) to those of the mle of gerard [ ] as well as the naive estimator that calculates the sample covariance and sample correlation between posterior mean genotypes at two loci. each replication, we generated genotypes for n ∈{ , , } indi- viduals with ploidy k ∈{ , , , } under hardy-weinberg equilibrium at two loci with major allele frequencies (pa,pb) ∈{( . , . ), ( . , . ), ( . , . )} and pearson correlation ρ ∈{ , . , . }. we then used updog’s rflexdog() function [gerard et al., , gerard and ferrão, ] to generate read-counts at read-depths of either or , a sequencing error rate of . , an overdispersion .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/dcgerard/ldsep https://github.com/dcgerard/ldsep https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / value of . , and no allele bias. updog was then used to generate genotype likelihoods and genotype posterior distributions for each individual at each snp. these were then fed into ldsep to obtain the mle, our new moment-based estimator, and the naive estimator. simulations were replicated times for each unique combination of simulation parameters. the accuracy of estimating ρ when pa = pb = . at a read-depth of is presented in figure . the results for other scenarios are similar and may be found on github (https:// github.com/dcgerard/ldfast_sims). we see that the moment-based estimator and the mle perform comparably, even for small read-depth and sample size. the naive estimator has a strong attenuation bias toward zero. this bias is particularly prominent for higher ploidy levels. for example, for an octoploid species where the true ρ is . , the naive estimator appears to converge to a ρ estimate of around . . this bias does not disappear with increasing sample size. estimated standard errors are reasonably well-behaved, except for ρ̂ and ρ̂ when the sample size is small and the ld is large (figure ). . ld estimates for solanum tuberosum we evaluated our methods on the autotetraploid potato (solanum tuberosum, n = x = ) genotyping-by-sequencing data from uitdewilligen et al. [ ]. we used updog [gerard et al., , gerard and ferrão, ] to obtain the posterior moments for each individual’s genotype at each snp on a single super scaffold (pgsc dmb ). to remove monoallelic snps, we filtered out snps with allele frequencies either greater than . or less than . , and filtered out snps with a variance of posterior means less than . . this resulted in snps. we then estimated the squared correlation between each snp using either the naive approach of calculating the sample pearson correlation between posterior means, or using our new moment-based approach ( ). our estimators are scalable. on a . ghz quad-core pc running linux with gb of memory, it took a total of . seconds to estimate all pairwise correlations using our new moment-based approach, which is a small increase over the . seconds it took to estimate all pairwise correlations using the naive approach. in gerard [ ], we found that the mle approach took about . seconds for each pair of snps for a tetraploid individual. extrapolating this to snps would indicate that the mle approach would take about . days of computation time to calculate all pairwise ld estimates on this dataset ( ( ) × . sec× min/ sec× hr/ min× d/ hr = . d). the histogram of estimated reliability ratios are presented in figure . we see there that the reliability ratios of most snps only increase their correlation estimates by less than %. but a not insignificant portion have reliability ratios that increase the correlation estimates by more than %. to evaluate the ld estimates of high reliability ratio snps, we calculated the mles for ρ between the twenty snps with the largest reliability ratios. a pairs plot for ρ estimates between the three approaches is presented in figure . we see there that the mle and new moment-based approach result in very similar ρ estimates, while the naive approach using posterior means results in much smaller ρ estimates. discussion it has been known since at least the time of spearman that the sample correlation coefficient (or, similarly, the ordinary least squares estimator in simple linear regression) is attenuated in the presence of uncertain variables [spearman, ]. methods to adjust for this bias include assuming prior knowledge on the measurement variances or the ratio of measurement variances (resulting .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/dcgerard/ldfast_sims https://github.com/dcgerard/ldfast_sims https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / from, for example, repeated measurements on the same individuals) [koopmans, , degracie and fuller, ], using instrumental variables [carter and fuller, ], and using distributional assumptions [pal, ]. see fuller [ ] for a detailed introduction to this vast field. our solution was to use sample moments of marginal posterior moments which, to our knowledge, has never been proposed before. it is natural to ask if our methods could be used to account for uncertain genotypes in genome- wide association studies. however, the moment-based techniques we used in this manuscript, when applied to simple linear regression with an additive effects model (where the snp effect is pro- portional to the dosage), result in the standard ordinary least squares estimates when using the posterior mean as a covariate (section s ). this supports using the posterior mean as a covariate in simple linear regression with an additive effects model. this is not to say, however, that using the posterior mean is also appropriate for more complicated models of gene action [rosyara et al., ], or for non-linear models. acknowledgments most analyses were performed using the r statistical language [r core team, ]. data availability all methods discussed in this manuscript are implemented in the ldsep r package, available on github (https://github.com/dcgerard/ldsep) under a gpl- license. scripts to reproduce the results of this research are available on github (https://github.com/dcgerard/ldfast_sims). all datasets used in this manuscript are publicly available [uitdewilligen et al., ] and may be downloaded from: • https://doi.org/ . /journal.pone. .s • https://doi.org/ . /journal.pone. .s • https://doi.org/ . /journal.pone. .s • https://doi.org/ . /journal.pone. .s .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://github.com/dcgerard/ldsep https://github.com/dcgerard/ldfast_sims https://doi.org/ . /journal.pone. .s https://doi.org/ . /journal.pone. .s https://doi.org/ . /journal.pone. .s https://doi.org/ . /journal.pone. .s https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / figures . . . . . . . . . . . . . . . . . . . . . . sample size ρ̂ method mle mom naive figure : estimate of ρ (y-axis) for the maximum likelihood estimator [gerard, ] (mle), our new moment-based estimator ( ) (mom), and the naive squared sample correlation coefficient between posterior mean genotypes (naive). the x-axis indexes the sample size, the row-facets index the ploidy, and the column-facets index the true ρ , which is also presented by the horizontal dashed red line. these simulations were performed using a read-depth of , and major allele frequencies of . at each locus. the naive estimator presents a strong attenuation bias toward , particularly for higher ploidy regimes. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / ẑ ∆̂′ ∆̂ ρ̂ ρ̂ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mad of estimates m e d ia n o f s ta n d a rd e rr o rs n and ρ other n = , ρ = . figure : median of estimated standard errors (y-axis) versus median absolute deviations (x-axis) of each of the moment-based ld estimators (facets). the line is the y = x line, and points above this line indicate that the estimated standard errors are typically larger than the true standard errors. estimated standard error are reasonably unbiased except for ρ̂ and ρ̂ in scenarios with small sample sizes (n = ) and a large levels of ld (ρ = . ) (color and shape). . . . . . reliability ratio estimate co u n t figure : histogram of estimated reliability ratios (s ) using the data from uitdewilligen et al. [ ]. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / mle mom naive m l e m o m n a ive . . . . . . . . . . . . . . . . . . . . . . . . . figure : pairs plot for ρ estimates between the twenty snps from uitdewilligen et al. [ ] with the largest estimated reliability ratios when using either maximum likelihood estimation (mle) [gerard, ], our new moment-based approach ( ) (mom), or the naive approach using just posterior means (naive). the dashed line is the y = x line. the mle and the moment-based approach result in much more similar ld estimates. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / supplementary material s derivation of ld estimators in this section, we derive estimators ( )–( ). we do this by assuming a normal model on the data and the genotypes. this is obviously not appropriate when using genotypes and sequencing data, but our simulations in section . were also accomplished using sequencing data and resulted in very good performance. let gi = (gia,gib) ᵀ be the genotype for individual i at loci a and b. let zi = (zia,zib) ᵀ be the data for individual i at loci a and b. then we let gi ∼ n (µ, Σ), and (s ) zi|gi ∼ n (gi,s), where (s ) µ = (µ ,µ ) ᵀ, (s ) Σ = ( σ σ σ σ ) , and (s ) s = ( s s ) . (s ) to interpret these terms, µ /k and µ /k are the allele frequencies at each locus, σ and σ are the variances of the genotypes at each locus, s and s are the variances of the genotyping errors at each locus, and σ is covariance between genotypes. by elementary methods, we have the well-known result that, marginally, zi ∼ n (µ, Σ + s). (s ) we assume the user has provided posterior moments on the genotypes xia = e[gia|zia],xib = e[gib|zib],yia = var(gia|zia), and yib = var(gib|zib). (s ) these posterior moments are marginal in that they only condition on either zia or zib, but not both. thus, we assume they are well-approximated by the model gia ∼ n(µ ,σ ) (s ) zia|gia ∼ n(gia,s ) (s ) gib ∼ n(µ ,σ ) (s ) zib|gib ∼ n(gib,s ). (s ) by standard methods, this results in gia|zia ∼ n [( σ + s )− ( σ µ + s zia ) , ( σ + s )− ] , and (s ) gib|zib ∼ n [( σ + s )− ( σ µ + s zib ) , ( σ + s )− ] . (s ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / treating only zi as random from distribution (s ), we have uxa ≈ e [( σ + s )− ( σ µ + s zia )] (s ) = ( σ + s )− ( σ µ + s e[zia] ) (s ) = ( σ + s )− ( σ µ + s µ ) (s ) = µ . (s ) similarly, uxb ≈ µ . (s ) furthermore, vxa ≈ var [( σ + s )− ( σ µ + s zia )] (s ) = ( σ + s )− s var(zia) (s ) = ( σ + s )− σ + s s (s ) = ( σ + s )− σ s . (s ) similarly, vxb ≈ ( σ + s )− σ s . (s ) now, using the posterior variances, we have uya ≈ ( σ + s )− , and (s ) uyb ≈ ( σ + s )− . (s ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / finally, the expectation of the sample covariance of posterior means is cx ≈ cov [( σ + s )− ( σ µ + s zia ) , ( σ + s )− ( σ µ + s zib )] (s ) = ( σ + s )− ( σ + s )− s s cov(zia,zib) (s ) = ( σ + s )− ( σ + s )− s s σ . (s ) using a method-of-moments approach, we now have a system of five equations and five un- knowns: vxa = ( σ + s )− σ s , (s ) vxb = ( σ + s )− σ s , (s ) uya = ( σ + s )− , (s ) uyb = ( σ + s )− , and (s ) cx = ( σ + s )− ( σ + s )− s s σ . (s ) solving for s , s , σ , σ , and σ , we obtain: ŝ = uya(uya + vxa) vxa (s ) ŝ = uyb(uyb + vxb) vxb (s ) σ̂ = uya + vxa (s ) σ̂ = uyb + vxb (s ) σ̂ = uya + vxa vxa uyb + vxb vxb cx. (s ) using (s )–(s ), we also have µ̂ = uxa, and (s ) µ̂ = uxb. (s ) the ld coefficient estimates ( )–( ) can be obtained by substituting in parameter estimates in .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / the following equations [gerard, ] ∆ = σ /k, (s ) ρ = σ / √ σ σ , and (s ) ∆′ = ∆/∆m, where (s ) ∆m = { min{µ µ , (k −µ )(k −µ )}/k if ∆ < , and min{µ (k −µ ), (k −µ )µ }/k if ∆ > . (s ) s derivation of standard errors let mi := (xia,x ia,xib,x ib,xiaxib,yia,yib) ᵀ. (s ) then, by the central limit theorem, we have for m̄ := n n∑ i= mi, (s ) that √ nm̄ is asymptotically multivariate normal with some limiting covariance, say, Ω. finite variances are guaranteed by the finite support of the genotypes. we can estimate Ω with the sample covariance matrix Ω̂ := n− n∑ i= (mi −m̄)(mi −m̄)ᵀ. (s ) estimators ( )–( ) are approximately functions of m̄. namely ∆̂ ≈ ( m̄ + m̄ −m̄ m̄ −m̄ )( m̄ + m̄ −m̄ m̄ −m̄ )( m̄ −m̄ m̄ k ) (s ) ρ̂ ≈ (√ m̄ + m̄ −m̄ m̄ −m̄ )(√ m̄ + m̄ −m̄ m̄ −m̄ )( m̄ −m̄ m̄ ) (s ) ∆̂′ ≈ ( m̄ + m̄ −m̄ m̄ −m̄ )( m̄ + m̄ −m̄ m̄ −m̄ )( m̄ −m̄ m̄ k ) /∆̂m, where (s ) ∆̂m = { min{m̄ m̄ , (k −m̄ )(k −m̄ )}/k if m̄ −m̄ m̄ < , and min{m̄ (k −m̄ ), (k −m̄ )m̄ }/k if m̄ −m̄ m̄ > . (s ) these are smooth functions of m̄ (except on a space of lebesgue measure zero), and so admit the .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / following gradients, calculated in mathematica [wolfram research, inc., ]: g∆ := d∆̂ dm̄ =   −( m̄ m̄ − m̄ m̄ m̄ +m̄ m̄ (− m̄ +m̄ )+m̄ m̄ (m̄ +m̄ ))(m̄ −m̄ −m̄ ) k(m̄ −m̄ ) (m̄ −m̄ ) (m̄ m̄ −m̄ )m̄ (m̄ −m̄ −m̄ ) k(m̄ −m̄ ) (m̄ −m̄ ) −( m̄ −m̄ −m̄ )(− m̄ m̄ m̄ +m̄ (m̄ +m̄ (− m̄ +m̄ )+m̄ (m̄ +m̄ ))) k(m̄ −m̄ )(m̄ −m̄ ) (m̄ m̄ −m̄ )(m̄ −m̄ −m̄ )m̄ k(m̄ −m̄ )(m̄ −m̄ ) (−m̄ +m̄ +m̄ )(−m̄ +m̄ +m̄ ) k(m̄ −m̄ )(m̄ −m̄ ) (−m̄ m̄ +m̄ )(−m̄ +m̄ +m̄ ) k(m̄ −m̄ )(m̄ −m̄ ) (−m̄ m̄ +m̄ )(−m̄ +m̄ +m̄ ) k(m̄ −m̄ )(m̄ −m̄ )   , (s ) gρ := dρ̂ dm̄ =   (m̄ m̄ +m̄ m̄ (−m̄ +m̄ )+m̄ m̄ (m̄ +m̄ )−m̄ m̄ (m̄ + m̄ )) √ −m̄ +m̄ +m̄ (m̄ −m̄ ) (m̄ −m̄ ) √ −m̄ +m̄ +m̄ (m̄ m̄ −m̄ )(m̄ −m̄ − m̄ ) √ −m̄ +m̄ +m̄ (m̄ −m̄ ) (m̄ −m̄ ) √ −m̄ +m̄ +m̄ − √ −m̄ +m̄ +m̄ (m̄ m̄ (m̄ −m̄ )−m̄ m̄ (m̄ +m̄ )+m̄ m̄ (−m̄ +m̄ + m̄ )) (m̄ −m̄ )(m̄ −m̄ ) √ −m̄ +m̄ +m̄ (m̄ m̄ −m̄ ) √ −m̄ +m̄ +m̄ (m̄ −m̄ − m̄ ) (m̄ −m̄ )(m̄ −m̄ ) √ −m̄ +m̄ +m̄ √ −m̄ +m̄ +m̄ √ −m̄ +m̄ +m̄ (m̄ −m̄ )(m̄ −m̄ ) (−m̄ m̄ +m̄ ) √ −m̄ +m̄ +m̄ (m̄ −m̄ )(m̄ −m̄ ) √ −m̄ +m̄ +m̄ (−m̄ m̄ +m̄ ) √ −m̄ +m̄ +m̄ (m̄ −m̄ )(m̄ −m̄ ) √ −m̄ +m̄ +m̄   , (s ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / and g∆′ := d∆̂ dm̄ = g∆/∆̂m −a, where (s ) a =   ∆̂c (m̄ ,m̄ ,m̄ )/∆̂ m ∆̂c (m̄ ,m̄ ,m̄ )/∆̂ m   (s ) c (m̄ ,m̄ ,m̄ ) =   m̄ /k if m̄ < m̄ m̄ and m̄ m̄ < (k −m̄ )(k −m̄ ) −(k −m̄ )/k if m̄ < m̄ m̄ and m̄ m̄ > (k −m̄ )(k −m̄ ) −m̄ /k if m̄ > m̄ m̄ and m̄ (k −m̄ ) > (k −m̄ )m̄ (k −m̄ )/k if m̄ > m̄ m̄ and m̄ (k −m̄ ) < (k −m̄ )m̄ (s ) c (m̄ ,m̄ ,m̄ ) =   m̄ /k if m̄ < m̄ m̄ and m̄ m̄ < (k −m̄ )(k −m̄ ) −(k −m̄ )/k if m̄ < m̄ m̄ and m̄ m̄ > (k −m̄ )(k −m̄ ) (k −m̄ )/k if m̄ > m̄ m̄ and m̄ (k −m̄ ) > (k −m̄ )m̄ −m̄ /k if m̄ > m̄ m̄ and m̄ (k −m̄ ) < (k −m̄ )m̄ (s ) though these gradients are rather complicated, they are not computationally intensive and may be calculated in constant time in the sample size. the asymptotic variances of ∆̂, ρ̂, and ∆̂′ are n g ᵀ ∆Ω̂g∆, n gᵀρΩ̂gρ, and n g ᵀ ∆′ Ω̂g∆′, (s ) respectively. to accommodate missing data, we use only pairwise complete observations for the sample covariance matrix (s ). this ensures that Ω̂ is positive definite and, thus, the resulting stan- dard errors are non-negative. however, we use all non-missing observations for m̄. that is, let .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / Θa, Θb ⊆{ , , . . . ,n} be the index sets of non-missing values at loci a and b, respectively. then m̄ = |Θa| ∑ i∈Θa xia (s ) m̄ = |Θa| ∑ i∈Θa x ia (s ) m̄ = |Θb| ∑ i∈Θb xib (s ) m̄ = |Θb| ∑ i∈Θb x ib (s ) m̄ = |Θa ∩ Θb| ∑ i∈Θa∩Θb xiaxib (s ) m̄ = |Θa| ∑ i∈Θa yia (s ) m̄ = |Θb| ∑ i∈Θb yib (s ) m̄ ∗ = |Θa ∩ Θb| ∑ i∈Θa∩Θb mi (s ) Ω̂ = |Θa ∩ Θb|− ∑ i∈Θa∩Θb (mi −m̄ ∗ )(mi −m̄ ∗ )ᵀ (s ) the asymptotic variances of ∆̂, ρ̂, and ∆̂′ are then |Θa ∩ Θb| g ᵀ ∆Ω̂g∆, |Θa ∩ Θb| gᵀρΩ̂gρ, and |Θa ∩ Θb| g ᵀ ∆′ Ω̂g∆′, (s ) respectively. s adjusting the reliability ratios s . adaptive shrinkage on the reliability ratios each snp has an estimated reliability ratio, bj := uyj + vxj vxj , (s ) which corresponds to the multiplicative adjustment to all ld estimates that include that snp (see ( )). these reliability ratios might have high variance due to (i) lower sequencing depth or (ii) containing fewer individuals with non-missing data. thus, some reliability ratios may be noisy. hierarchical shrinkage is a statistical technique that allows high-variance observations to borrow strength from low-variance observations and thus improve estimation performance. adap- tive shrinkage (ash) [stephens, ] is a recently proposed general-purpose hierarchical shrinkage .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / technique that we can use to model the distribution of reliability ratios flexibly, only constraining them to be unimodal. in this section, we will use ash to improve our reliability ratio estimates. we will now describe the procedure for applying ash to shrink the reliability ratios. our strategy will be to derive the standard errors for the log of the reliability ratios (s ) and apply ash on the log-scale using these standard errors. to begin, let xij be the posterior mean for individual i at snp j. let yij be the posterior variance for individual i at snp j. finally, let mij = (xij,x ij,yij), (s ) m̄j = n n∑ i= mij, so (s ) m̄j = n n∑ i= xij, (s ) m̄j = n n∑ i= x ij, and (s ) m̄j = n n∑ i= yij. (s ) then the log of the reliability ratio for snp j is lj := log ( m̄j + m̄j −m̄ j m̄j −m̄ j ) (s ) = log(m̄j + m̄j −m̄ j ) − log(m̄j −m̄ j ). (s ) let the sample covariance be Ω̂j := n− n∑ i= (mij −m̄j)(mij −m̄j)ᵀ. (s ) then we have by the central limit theorem that √ nm̄j is asymptotically multivariate normal, and we can use Ω̂j as the estimate of the covariance matrix. the gradients for (s ) are gj := dlj dm̄j = − m̄j m̄j + m̄j −m̄ j + m̄j m̄j −m̄ j (s ) gj := dlj dm̄j = m̄j + m̄j −m̄ j − m̄j −m̄ j (s ) gj := dlj dm̄j = m̄j + m̄j −m̄ j (s ) then, with gj := (gj ,gj ,gj ) ᵀ, the variance for lj is ŝ j := n g ᵀ j Ω̂jgj. (s ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / we apply ash to (l , ŝ ), . . . , (lm, ŝm) to obtain shrunken log reliability ratios l̂ , . . . , l̂m. because ash’s grid-based scheme for estimating the mode is not the most computationally efficient, we used the half-sample mode estimator of robertson and cryer [ ] prior to running ash. this procedure seems to result in improved performance for snps with unusually variable reli- ability ratios (figure s ). s . thresholding the reliability ratios if a researcher accidentally provides a monoallelic snp, its reliability ratio could explode due to having a denominator close to zero in (s ). for example, the right panel of figure s contains a monoallelic snp (potvar ) whose reliability ratio estimate (s ) is . . this can provide unstable estimates of ld as some snps will, due to sampling variability, have correlations with these monoallelic snps on the order of . . for example, the sample correlation between posterior means of potvar and potvar (left facet of figure s ) - . . but due to the extreme reliability ratio of potvar , the genotype-error adjusted correlation estimate is - . this is, of course, unsettling. so by default, our software will take all reliability ratio estimates (s ) above a user-provided value (default of ) and assign these to have reliability ratios of the median reliability ratio in the dataset. s genome-wide association studies in this section, we demonstrate that the techniques used in section s , when applied to simple linear regression with an additive effects model [rosyara et al., ], result in the standard ordinary least squares estimate when using the posterior mean as a covariate. this indicates that for genome-wide association studies, using the posterior mean is appropriate in a linear regression context when using an additive model for gene action. let gi be the genotype for individual i at a locus. let zi be the data that lead to the genotyping for individual i at the same locus. let wi be some quantitative trait of interest for individual i. then we let wi|gi ∼ n(β + β gi,σ ) (s ) zi|gi ∼ n(gi,s ) (s ) gi ∼ n(µ,τ ). (s ) we suppose the user is only provided the posterior means and variances of each gi|zi. let xi = e[gi|zi] and yi = var(gi|zi). from elementary methods, we have zi ∼ n(µ,s + τ ) (s ) gi|zi ∼ n [( τ + s )− ( τ µ + s zi ) , ( τ + s )− ] . (s ) .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / let uw = n n∑ i= wi (s ) ux = n n∑ i= xi (s ) cxw = n− n∑ i= (wi −uw)(xi −ux) (s ) vx = n− n∑ i= (xi −ux) (s ) vw = n− n∑ i= (wi −uw) . (s ) we have that cwx ≈ cov(wi,xi) (s ) ≈ cov ( wi, ( τ + s )− ( τ µ + s zi )) (s ) = ( τ + s )− s cov(wi,zi) (s ) = ( τ + s )− s β var(gi) (s ) = ( τ + s )− τ s β . (s ) we also have from (s )–(s ) that vx ≈ ( τ + s )− τ s . (s ) using method of moments with equations (s ) and (s ), we have the following estimator for β β̂ = cwx/vx (s ) = cwx√ vxvw √ vw√ vx . (s ) equation (s ) is the sample correlation between the wi’s and the xi’s (cwx/ √ vxvw) multiplied by the ratio of the sample standard deviations of the wi’s and the xi’s ( √ vw/ √ vx). this is the well-known formula for the ordinary least squares estimate of β from a regression of wi on xi. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / s supplementary figures . . . . . . . . log of reliability ratio e st im a te d s ta n d a rd e rr o r (a) alternative counts r e fe re n ce c o u n ts (b) alternative counts r e fe re n ce c o u n ts (c) . . . . . . . . . . raw reliability ratio s h ru n ke n r e lia b ili ty r a tio (d) figure s : (a) the log of the reliability ratios (x-axis) versus their estimated standard errors (y-axis). the two highlighted points do not seem to fit the trend. when we plot the read-counts for these highlighted points ((b) and (c)), we notice that these two snps are almost monoallelic, providing doubts on their unusually large reliability ratios. we plot the shrunken reliability ratios (y-axis) against their original values (x-axis) in (d), noting that the problem snps (color) have their reliability ratios highly adjusted. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / potvar potvar alternative counts r e fe re n ce c o u n ts figure s : plots of read-counts of two snps (facets) from uitdewilligen et al. [ ]. alternative counts lie on the x-axis and reference counts lie on the y-axis. the right snp is monoallelic and because of this the estimated correlation between the two snps using raw reliability ratios is - , even though the sample correlation between posterior means is only - . . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / references a. brown. sample sizes required to detect linkage disequilibrium between two or three loci. theoreti- cal population biology, ( ): – , . issn - . doi: . / - ( ) - . r. l. carter and w. a. fuller. instrumental variable estimation of the simple errors-in- variables model. journal of the american statistical association, ( ): – , . doi: . / . . . j. s. degracie and w. a. fuller. estimation of the slope and analysis of covariance when the concomitant variable is measured with error. journal of the american statistical association, ( ): – , . doi: . / . . . k. k. dey and m. stephens. corshrink: empirical bayes shrinkage estimation of correlations, with applications. biorxiv, . doi: . / . e. a. fox, a. e. wright, m. fumagalli, and f. g. vieira. ngsld: evaluating linkage disequilibrium using genotype likelihoods. bioinformatics, ( ): – , . issn - . doi: . /bioinformatics/btz . w. a. fuller. measurement error models. john wiley & sons, . d. gerard. pairwise linkage disequilibrium estimation for polyploids. biorxiv, . doi: . / . . . . d. gerard and l. f. v. ferrão. priors for genotyping polyploids. bioinformatics, ( ): – , . issn - . doi: . /bioinformatics/btz . biorxiv: . d. gerard, l. f. v. ferrão, a. a. f. garcia, and m. stephens. genotyping polyploids from messy sequencing data. genetics, ( ): – , . issn - . doi: . /ge- netics. . . w. hill and a. robertson. linkage disequilibrium in finite populations. theoretical and applied genetics, ( ): – , . doi: . /bf . t. c. koopmans. linear regression analysis of economic time series, volume . de erven f. bohn nv, . r. lewontin. the interaction of selection and linkage. i. general considerations; heterotic models. genetics, ( ): , . url https://www.genetics.org/content/ / / . r. c. lewontin and k.-i. kojima. the evolutionary dynamics of complex polymorphisms. evolution, ( ): – , . doi: . /j. - . .tb .x. t. maruki and m. lynch. genome-wide estimation of linkage disequilibrium from population- level high-throughput sequencing data. genetics, ( ): – , . issn - . doi: . /genetics. . . m. pal. consistent moment estimators of regression coefficients in the presence of errors in vari- ables. journal of econometrics, ( ): – , . issn - . doi: . / - ( ) - . r core team. r: a language and environment for statistical computing. r foundation for statistical computing, vienna, austria, . url https://www.r-project.org/. t. robertson and j. d. cryer. an iterative procedure for estimating the mode. journal of the amer- ican statistical association, ( ): – , . doi: . / . . . u. r. rosyara, w. s. de jong, d. s. douches, and j. b. endelman. software for genome-wide association studies in autopolyploids and its application to potato. the plant genome, ( ), . doi: . /plantgenome . . . m. slatkin. linkage disequilibrium-understanding the evolutionary past and mapping the medical future. nature reviews genetics, ( ): , . doi: . /nrg . .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://doi.org/ . / - ( ) - http://doi.org/ . / . . http://doi.org/ . / . . http://doi.org/ . / . . http://doi.org/ . / http://doi.org/ . /bioinformatics/btz http://doi.org/ . /bioinformatics/btz http://doi.org/ . / . . . http://doi.org/ . / . . . http://doi.org/ . /bioinformatics/btz http://doi.org/ . /genetics. . http://doi.org/ . /genetics. . http://doi.org/ . /bf https://www.genetics.org/content/ / / http://doi.org/ . /j. - . .tb .x http://doi.org/ . /genetics. . http://doi.org/ . /genetics. . http://doi.org/ . / - ( ) - http://doi.org/ . / - ( ) - https://www.r-project.org/ http://doi.org/ . / . . http://doi.org/ . /plantgenome . . http://doi.org/ . /nrg https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / c. spearman. the proof and measurement of association between two things. the american journal of psychology, ( ): – , . doi: . / . m. stephens. false discovery rates: a new deal. biostatistics, ( ): – , . issn - . doi: . /biostatistics/kxw . j. a. sved and w. g. hill. one hundred years of linkage disequilibrium. genetics, ( ): – , . issn - . doi: . /genetics. . . j. g. a. m. l. uitdewilligen, a.-m. a. wolters, b. b. d’hoop, t. j. a. borm, r. g. f. visser, and h. j. van eck. a next-generation sequencing method for genotyping-by-sequencing of highly heterozygous autotetraploid potato. plos one, ( ): – , . doi: . /jour- nal.pone. . x. wen and m. stephens. using linear predictors to impute allele frequencies from summary or pooled genotype data. the annals of applied statistics, ( ): – , . issn - . doi: . / -aoas . y. c. j. wientjes, r. f. veerkamp, and m. p. l. calus. the effect of linkage disequilibrium and family relationships on the reliability of genomic prediction. genetics, ( ): – , . issn - . doi: . /genetics. . . wolfram research, inc. mathematica, version . , . url https://www.wolfram.com/ mathematica. champaign, il. x. zhu and m. stephens. large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across human phenotypes. nature communications, ( ): – , . doi: . /s - - -x. .cc-by-nc-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://doi.org/ . / http://doi.org/ . /biostatistics/kxw http://doi.org/ . /genetics. . http://doi.org/ . /journal.pone. http://doi.org/ . /journal.pone. http://doi.org/ . / -aoas http://doi.org/ . /genetics. . https://www.wolfram.com/mathematica https://www.wolfram.com/mathematica http://doi.org/ . /s - - -x http://doi.org/ . /s - - -x https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nc-nd/ . / introduction methods results simulations ld estimates for solanum tuberosum discussion figures derivation of ld estimators derivation of standard errors adjusting the reliability ratios adaptive shrinkage on the reliability ratios thresholding the reliability ratios genome-wide association studies supplementary figures struo : efficient metagenome profiling database construction for ever-expanding microbial genome datasets struo : efficient metagenome profiling database construction for ever-expanding microbial genome datasets nicholas d. youngblut* , , ruth e. ley department of microbiome science, max planck institute for developmental biology, max planck ring , tübingen, germany * corresponding author: nicholas youngblut (nicholas.youngblut@tuebingen.mpg.de) running title: struo builds databases faster key words: metagenome, database, profiling, gtdb .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / abstract mapping metagenome reads to reference databases is the standard approach for assessing microbial taxonomic and functional diversity from metagenomic data. however, public reference databases often lack recently generated genomic data such as metagenome-assembled genomes (mags), which can limit the sensitivity of read-mapping approaches. we previously developed the struo pipeline in order to provide a straight-forward method for constructing custom databases; however, the pipeline does not scale well with the ever-increasing number of publicly available microbial genomes. moreover, the pipeline does not allow for efficient database updating as new data are generated. to address these issues, we developed struo , which is > . -fold faster than struo at database generation and can also efficiently update existing databases. we also provide custom kraken , bracken, and humann databases that can be easily updated with new genomes and/or individual gene sequences. struo enables feasible database generation for continually increasing large-scale genomic datasets. availability: ● struo : https://github.com/leylabmpi/struo ● pre-built databases: http://ftp.tue.mpg.de/ebio/projects/struo / ● utility tools: https://github.com/nick-youngblut/gtdb_to_taxdump results metagenome profiling involves mapping reads to reference sequence databases and is the standard approach for assessing microbial community taxonomic and functional composition via metagenomic sequencing. most metagenome profiling software includes “standard” reference databases. for instance, the popular humannn pipeline includes multiple databases for assessing both taxonomy and function from read data (franzosa et al. , ) . similarly, kraken includes a set of standard databases for taxonomic classification of specific clades ( e.g., fungi or plants) or all taxa (wood et al. , ) . while such standard reference databases provide a crucial resource for metagenomic data analysis, they may not be optimal for the needs of researchers. for example, a custom database that includes newly generated mags can increase the percent of reads mapped to references (youngblut et al. , ) . the process of making custom reference databases is often complicated and requires substantial computational resources, which led us to create struo for straight-forward custom metagenome profiling database generation (de la cuesta-zuluaga et al. , ) . however, struo requires ~ . cpu hours per genome, which would necessitate > , cpu hours (> . years) if including one genome per the , species in release of the genome taxonomy database (gtdb) (parks et al. , ) . struo generates kraken and bracken databases similarly to struo (lu et al. , ; wood et al. , ) , but the algorithms diverge substantially for the time consuming step of gene annotation required for humann database construction. struo performs gene annotation by clustering all gene sequences of all genomes using the mmseqs linclust algorithm, and then each gene cluster representative is annotated via mmseq search (figure a; supplemental methods) (steinegger and söding, , ) . in contrast, struo annotates all non-redundant genes of each genome with diamond (buchfink et al. , ) . struo utilizes snakemake and .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / conda, which allows for easy installation of all dependencies and simplified scaling to high performance computing systems (köster and rahmann, ) . benchmarking on genome subsets from the gtdb showed that struo requires ~ . cpu hours per genome versus ~ . for struo (figure b). notably, struo annotates slightly more genes than struo, possibly due to the sensitivity of the mmseqs search iterative search algorithm (figure c). the use of mmseqs allows for efficient database updating of new genomes and/or individual gene sequences via mmseqs clusterupdate (figure s ); we show that this approach saves - % of the cpu hours relative to generating a database from scratch (figure d). we used struo to create publicly available kraken , bracken, and humann custom databases from release of the gtdb (see supplemental methods). we will continue to publish these custom databases as new gtdb versions are released. the databases are available at http://ftp.tue.mpg.de/ebio/projects/struo / . we also created a set of utility tools for generating ncbi taxdump files from the gtdb taxonomy and mapping between the ncbi and gtdb taxonomies. the taxdump files are utilized by struo , but these tools can be used more generally to integrate the gtdb taxonomy into existing pipelines designed for the ncbi taxonomy (available at https://github.com/nick-youngblut/gtdb_to_taxdump ). figure . struo can build databases faster than struo and can efficiently update the databases. a) a general outline of the struo database creation algorithm. cylinders are input or output files, squares are processes, and right-tilted rhomboids are intermediate files. the largest change from struo is the .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / utilization of mmseqs for clustering and annotation of genes. b) benchmarking the amount of cpu hours required for struo and struo , depending on the number of input genomes. c) the number of genes annotated with a uniref identifier. d) the percent of cpu hours saved via the struo database updating algorithm versus de novo database generation. the original database was constructed from genomes. for b) and d), the grey regions represent % confidence intervals. data availability struo is available at https://github.com/leylabmpi/struo , the pre-built databases can be found at http://ftp.tue.mpg.de/ebio/projects/struo / , and utility tools are located at https://github.com/nick-youngblut/gtdb_to_taxdump . acknowledgements this study was supported by the max planck society. we thank albane ruaud, liam fitzstevens, jacobo de la cuesta-zuluaga, and jillian waters for providing helpful comments on an earlier version of this manuscript. references buchfink,b. et al. ( ) fast and sensitive protein alignment using diamond. nat. methods , , – . de la cuesta-zuluaga,j. et al. ( ) struo: a pipeline for building custom databases for common metagenome profilers. bioinformatics , , – . franzosa,e.a. et al. ( ) species-level functional profiling of metagenomes and metatranscriptomes. nat. methods , , – . köster,j. and rahmann,s. ( ) snakemake--a scalable bioinformatics workflow engine. bioinformatics , , – . lu,j. et al. ( ) bracken: estimating species abundance in metagenomics data. peerj comput. sci. , , e . parks,d.h. et al. ( ) a standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. nat. biotechnol. , , – . steinegger,m. and söding,j. ( ) clustering huge protein sequence sets in linear time. nat. commun. , , . steinegger,m. and söding,j. ( ) mmseqs enables sensitive protein sequence searching for the analysis of massive data sets. nat. biotechnol. , , – . wood,d.e. et al. ( ) improved metagenomic analysis with kraken . genome biol. , , . youngblut,n.d. et al. ( ) large-scale metagenome assembly reveals novel animal-associated microbial genomes, biosynthetic gene clusters, and other genetic diversity. msystems , . .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / salts – surfr (sncrna) and lagoon (lncrna) transcriptomics suite salts – surfr (sncrna) and lagoon (lncrna) transcriptomics suite mohan v kasukurthi ,§, dominika houserova ,§, yulong huang , addison a. barchie , justin t. roberts , dongqi li , bin wu ,*, jingshan huang , , ,*, and glen m borchert , ,* school of computing, university of south alabama, mobile, al, , usa department of pharmacology, university of south alabama, mobile, al, , usa department of biology, university of south alabama, mobile, al, , usa department of biochemistry and molecular genetics, university of colorado school of medicine, aurora, co, , usa first affiliated hospital, kunming medical university, kunming, yunnan, china qilu university of technology (shandong academy of science), jinan, shandong, china § the authors wish it to be known that, in their opinion, the first two authors should be regarded as joint first authors. * the authors wish it to be known that, in their opinion, the last three authors should be regarded as joint corresponding authors. to whom correspondence should be addressed: tel: + ; email: borchert@southalabama.edu, tel: + ; email: huang@southalabama.edu, tel: + ; email: wu.bin.kmu@qq.com .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:borchert@southalabama.edu mailto:huang@southalabama.edu mailto:wu.bin.kmu@qq.com https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / abstract the widespread utilization of high-throughput sequencing technologies has unequivocally demonstrated that eukaryotic transcriptomes consist primarily (> %) of non-coding rna (ncrna) transcripts significantly more diverse than their protein-coding counterparts. ncrnas are typically divided into two categories based on their length. ( ) ncrnas less than nucleotides (nt) long are referred as small non-coding rnas (sncrnas) and include micrornas (mirnas), piwi-interacting rnas (pirnas), small nucleolar rnas (snornas), transfer ribonucleic rnas (trnas), etc., and the majority of these are thought to function primarily in controlling gene expression. that said, the full repertoire of sncrnas remains fairly poorly defined as evidenced by two entirely new classes of sncrnas only recently being reported, i.e., snorna-derived rnas (sdrnas) and trna-derived fragments (trfs). ( ) ncrnas longer than nt long are known as long ncrnas (lncrnas). lncrnas represent the nd largest transcriptional output of the cell (behind only ribosomal rnas), and although functional roles for several lncrnas have been reported, most lncrnas remain largely uncharacterized due to a lack of predictive tools aimed at guiding functional characterizations. importantly, whereas the cost of high-throughput transcriptome sequencing is now feasible for most active research programs, tools necessary for the interpretation of these sequencings typically require significant computational expertise and resources markedly hindering widespread utilization of these datasets. in light of this, we have developed a powerful new ncrna transcriptomics suite, salts, which is highly accurate, markedly efficient, and extremely user-friendly. salts stands for surfr (sncrna) and lagoon (lncrna) transcriptomics suite and offers platforms for comprehensive sncrna and lncrna profiling and discovery, ncrna functional prediction, and the identification of significant differential expressions among datasets. notably, salts is accessed through an intuitive web-based interface, can be used to analyze either user- generated, standard next-generation sequencing (ngs) output file uploads (e.g., fastq) or existing ncbi sequence read archive (sra) data, and requires absolutely no dataset pre-processing or knowledge of library adapters/oligonucleotides. salts constitutes the first publically available, web-based, comprehensive ncrna transcriptomic ngs analysis platform designed specifically for users with no computational background, providing a much needed, powerful new resource capable of enabling more widespread ncrna transcriptomic analyses. the salts webserver is freely available online at http://salts.soc.southalabama.edu. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://salts.soc.southalabama.edu/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / general introduction cellular metabolism and survival are greatly dependent on how quickly and efficiently the cell can respond to internal and external stimuli. this process often requires tightly orchestrated genome-wide changes in gene expression. with rapid technological advancements in both genomics and transcriptomics, particularly the development of robust deep sequencing, it is ever more apparent that many regulatory non-coding rnas (ncrnas) that help coordinate gene expression changes remain elusive and the networks created thereof are far more complex than previously thought( ). as many of these are dynamic and their presence or absence is highly conditional (i.e., environmental stress, disease, tissue type, etc.), their identification poses a challenge and many remain undescribed( ). as such, we have developed a set of guidelines and parameters to help confidently identify and characterize these molecules. importantly, by implementing alternative strategies for next-generation sequencing (ngs) analysis based on examining conditional changes in expression and/or fragmentation patterns from individual genomic loci rather than depending on pre-existing annotations, we find previously elusive ncrnas can now be readily identified via our platform. in addition to this, we have also developed an array of downstream analyses to more fully characterize identified ncrnas and predict their functional roles (e.g., molecular targets). to date several platforms aimed at either small non-coding rna (sncrna) or long non-coding rna (lncrna) characterization have been developed( ). although each of these existing platforms possess some unique advantages, each also carry their own critical limitations (detailed herein). that said, to our knowledge, salts is the first-ever resource designed to determine ncrna expressions in both short ncrna-seq and standard rna- seq datasets and to provide functional predictions for ncrnas identified in either. perhaps most importantly, however, in addition to being highly accurate and efficient, salts has been developed to require absolutely no computational background in order to enable widespread ncrna transcriptomic analysis by a much broader community of researchers. of note, a clear, step-by-step user manual for the salts platform is provided in supplemental information file . section . salts tool for small non-coding rna analysis: surfr ncrnas less than nucleotides (nt) in length are referred to as small non-coding rnas (sncrnas) and include micrornas (mirnas), piwi-interacting rnas (pirnas), small nucleolar rnas (snornas), transfer ribonucleic rnas (trnas), etc.( ). one striking example of the regulatory capabilities of sncrnas comes from a group of small yet potent rnas called micrornas (mirnas). mirnas are ~ nt rnas excised from longer pre-mirna hairpins that function through associating with the rna-induced silencing complex (risc) in order to bind to the ’ utrs of their target mrnas and repress their translational activities( ). in just the past two decades, thousands of mirnas have been identified and implicated in regulating cell growth, differentiation, and apoptosis( ), as well as contributing to tumorigenesis( ) and chemoresistance( ). as this group has been thoroughly examined due to its relevance to various types of cancer( ), it is now widely accepted that a single mirna is capable of altering the expression of whole cohorts of protein coding genes( ). importantly, studies aimed at evaluating the transcriptomic changes of mirnas have revealed the existence of mirna-like fragments derived from other ncrna biotypes and suggest similar regulatory capacities may be associated with these novel sncrnas( – ). as such, we suggest that the surfr resource described herein represents an intuitive, high throughput platform capable of revisiting old ngs datasets and identifying novel, relevant mirna-like fragments derived from other types of ncrnas that were previously overlooked. comparably sized, mirna-like fragments excised from many other types of ncrnas have now been reported and many of these shown to similarly regulate gene expressions and/or chromatin compaction (e.g., pirnas, rasirnas, rrnas, scrnas, snornas, snrnas, rnase p, trnas, y rnas, and vault rnas)( – ). that said, the expressions and functions of the vast majority of specific sncrna fragments excised from anything other .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / than annotated mirnas remain largely undefined, although fragments from snornas (sdrnas) and trnas (trfs) have recently begun to receive considerably more attention( , ). in , ender et al. were the first to report a small rna fragment originating from a snorna, aca ( ). despite the principle snorna function being long characterized as guiding rrna modifications, they showed that this snorna-derived rna (sdrna) was not only processed by dicer-like regular mirnas but also capable of silencing cdc l gene in mirna- like manner. since then various other studies have described similar fragments arising from other snornas (reviewed in ( )) as well as from other types of ncrnas. notably, trna-derived fragments (trfs) have recently gained attention due to their differential abundance under highly specific conditions, such as developmental stage( ), stress( ), or viral infection( ). moreover, regulatory capacity of some trfs has been observed; zhou et al., for example, showed that a fragment excised from ’ end of trna-glu regulates bcar expression in ovarian cancer( ). it is now clear that ncrna-derived mirna-like fragments are precisely processed out of various types of ncrna transcripts, and that this processing is evolutionarily conserved across species( – ). while an increasing body of evidence suggests specifically excised sncrna fragments from an array of ncrnas exist and are functionally relevant, there are currently no web-based, user-friendly resources that offer comprehensive sncrna fragment profiling and discovery, functional prediction, and the identification of significant differential expressions of fragments among datasets. to address this gap we present surfr. surfr refers to our short uncharacterized rna fragment recognition tool that identifies all mirna, snorna, and trna fragments (as well as fragments from all other ncrnas annotated in ensembl) specifically excised in a given transcriptome provided as either a raw user-generated rna-seq dataset or ncbi srr file identifier. in addition, surfr can also compare individual fragment expressions among as many as distinct datasets (as well as compare the expressions of full length (non-fragmented) sncrnas). surfr features  identifies fragments specifically excised from all mirnas, trnas, rrnas, scarnas, scrnas, snornas, srnas, vault rnas, and any other ncrnas annotated in the current ensembl assembly( ) in individual small rna-seq datasets.  ten files can be processed at once then up to individual files compared after processing for ncrna fragment differential expression analysis.  surfr can also determine and compare the expressions of all full length (non-fragmented) sncrnas in a given transcriptome.  surfr results are stored on the server indefinitely, protected by powerful state-of-the-art cryptographic algorithms, and can be instantly recalled by the user via entering their session key in the “get results” tab on the surfr home page.  omnisearch-based mirna analysis of annotated mirnas( ).  direct, intuitive ncrna visualization of individual ncrna fragmentation.  easily downloadable excel files of results from a single rna-seq file and/or comparisons among files. these files can be filtered (if desired) and list clearly defined, readily understandable, pertinent data (e.g., fragment expression, host gene links, and the exact fragment sequence excised).  contains prepopulated ncrna databases allowing the identification of ncrna fragments and/or ncrna expressions in unique animal, plant, fungal, protist, and bacterial species. in addition, surfr rna fragment calls require considerably less processing time than previous ncrna fragment identification pipelines for two principle reasons. we have: ( ) developed a novel alignment strategy significantly faster than traditional methods (e.g., blast( )) and ( ) designed a novel method to locate the start and end positions of an ncrna fragment using wavelets. full details of these novel computational methodologies are described in length in supplemental information file . .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / surfr workflow figure . surfr workflow. sequence input (left). the user provides up to ten unmodified small rna-seq datasets as input. these datasets can all be uploaded directly by the user or downloaded from the ncbi sra database by entering sra ids. sncrna fragment analysis (middle). surfr identifies all ncrna fragments (both annotated and novel) and their expressions in up to ten datasets per session. sncrna fragment visualization (top right). graphics of individual host ncrnas and the fragments excised (along with the expressions at each nt position) are provided. in addition, tables comparing the expressions of all fragments within individual datasets and comparing fragment expressions across all datasets are generated. surfr cross section comparison (bottom right). the user can comprehensively compare all fragment expressions identified in up to individual datasets by entering multiple surfr session ids from separate analyses. surfr input under “use surfr”, the user first selects the organism corresponding to the sequences. surfr small rna databases have been prepopulated for species including metazoans, plants, and other fungi, protists, and bacteria. as indicated in figure , the user then provides one to ten small rna sequencing datasets as input. these datasets can be all uploaded directly by the user, or all downloaded from the ncbi sra database( ) by entering sra ids (e.g., srr , srr ), or any combination thereof (for example, three datasets uploaded by the user along with seven datasets downloaded from the ncbi sra database). importantly, a major strength of surfr is that users can upload most raw small rna-seq files directly as original, unmodified, compressed fastq files (as provided by commercial sequencers) with absolutely no preprocessing and with no specifics about library generation, linkers, or oligonucleotides required. allowable formats for uploading are uncompressed, standard fasta or fastq files or any major compression of either. surfr output after the user uploads/specifies the small rna-seq datasets and clicks the “let’s surf” button, the browser is automatically redirected to a report page, progress indicators for each uploaded dataset are provided under the “click here to choose your file” drop down menu at the top of the page (figure a) with individual datasets having completed analysis indicated by a checkmark. following completion of analysis, results for the individual file selected are then displayed on the report page and organized into several sections (figure ). .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / figure . surfr report page. surfr report example. (a) the “click here to choose your file” drop-down menu for selecting individual rna-seq files. (b) a summary of the overall composition of the selected small rna-seq dataset. (c) the “create ncr profile” button automatically populates the derived rna profile section at the bottom of the page. (d) the “derived rna fragments” window detailing each fragment identified in the individual, selected small rna-seq dataset. (e) the user can download an excel file detailing the full set of information presented in the “derived rna fragments” window by pressing the “download results” button. (f) the “differential expression vector (dev)” window illustrates each nucleotide within a host gene and indicates the fragment called with a blue rectangle. the x-axis represents the position in the ncrna selected (e.g., mir- a), and the y-axis depicts the expression levels of the ncrna at each position. (g) the “selected ncrna & called rna fragment sequences” window illustrates the full length host ncrna (mir- a) highlighting the surfr-called fragment in yellow. (h) the “derived rna profile” window details each fragment identified in any of the analyzed small rna-seq datasets and compares fragment expressions across samples. (i) the “omnisearch for mirnas” window lists the top omnisearch entries (reported targets and pubmed publications) for an individual mirna selected in the “derived rna profile” window. (j) the “full length ncrna expression analyses” button in the upper center of the results page redirects the user to a surfr window detailing the expressions of all full length sncrnas in the provided datasets. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / a summary of the overall composition of the selected small rna-seq dataset, including the file size, total number of reads, number of mapped reads, and time taken for analysis is included just below the file selection window at the top of the page (figure b). the user can compare fragment expressions across all datasets by pressing the “create ncr profile” button that automatically populates the derived rna profile section at the bottom of the page (figure c). the “derived rna fragments” window (figure d) details the ensembl gene id, ensembl transcript id, gene annotation (name), the type of gene a fragment was excised from, the start and end positions of a fragment within its host gene, the expression of a fragment in reads per million (rpm), and the nucleotide sequence for each fragment identified in the individual, selected small rna-seq dataset. the “derived rna fragments” window is an interactive table that allows users to view, sort, and filter small rna fragments based on any column value. users can also view host gene information available at the rnacentral browser by selecting a fragment in the table and then clicking the “rnacentral” button on the toolbar( ). the user can download an excel file detailing the full set of information presented in the “derived rna fragments” window (figure d) for each fragment identified in the individual, selected small rna-seq dataset by pressing the “download results” button (figure e). an excel file containing the derived rna fragment information in its entirety will be automatically downloaded to the user’s computer (figure ). figure . derived rna fragments “download results” file. the first few rows of an example “download results” excel file detailing the full set of information presented in the “derived rna fragments” window: ensembl “gene id”, ensembl “transcript id”, gene “annotation” (name), the “type” of gene a fragment was excised from, the start and end positions of a fragment within its host gene, the expression of a fragment in reads per million (rpm), and the nucleotide “sequence” for each fragment identified in the selected small rna-seq dataset. the “differential expression vector (dev)” window (figure f) details the expressions of each nucleotide within a host gene and indicates the fragment called with a blue rectangle. the x-axis in the graph shown in figure f represents the position in the ncrna selected (mir- a), and the y-axis represents the expression levels of the ncrna at each position. the user can also interactively view the expression at each individual nucleotide by panning over the image, zoom in or out using the buttons on the top right, and/or download dev image files and an excel file detailing expression at each nucleotide by selecting the menu button on the top right of the window. the "selected ncrna & called rna fragment sequences" window (figure g) illustrates the full length host ncrna highlighting the surfr-called fragment in yellow just as depicted in the preceding dev window (figure f). the “derived rna profile” window (figure h) details the ensembl gene id, ensembl transcript id, gene annotation (name), the type of gene a fragment was excised from, the average start and end positions of a fragment within its host gene (to be considered the same fragment start and stop positions had to agree within nts.) with corresponding nucleotide sequence for each “average” fragment listed, the start and end positions of a fragment within its host gene along with the fragment’s expression (rpm) in each individual small rna-seq dataset, and finally, the % standard deviation of the expression of individual fragments( ). importantly, the full list of all fragments identified in any of the datasets is presented. the “derived rna profile” window is an interactive gene id transcript id annotation type fragment(start-end) expression(rpm) sequence ensg . enst . mir - mirna - tacagtactgtgataactga ensg . enst . mir a mirna - tagcaccatctgaaatcggtt ensg . enst . mir a mirna - acagtagtctgcacattggtt ensg . enst . mir a mirna - aacccgtagatccgatcttgt ensg . enst . mir a mirna - atcacattgccagggattt ensg . enst . mir mirna - tttgttcgttcggctcgcgtg ensg . enst . mir a mirna - tcagtgcactacagaactttg ensg . enst . mir a mirna - actggacttggagtcagaagg ensg . enst . mir c mirna - taatactgccgggtaatgatgg ensg . enst . scarna scarna - aggtagatagaacaggtcttg ensg . enst . snord d snorna - ggagagaacgcggtctgagtggt .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / table that allows users to view, sort, and filter small rna fragments based on any column value. users can also view host gene information available at the rnacentral browser( ) by selecting a fragment in the table and then clicking the “search ncrna in rnacentral” button on the toolbar. the user can also download an excel file detailing the full set of information presented in the “derived rna profile” window by pressing the “generate report” button at the top right of the window. an excel file containing the derived rna profile information in its entirety will be automatically downloaded to the user’s computer (figure ). in addition, excel file reports can be downloaded following the application of specific filters in the “derived rna profile” window (e.g., only snorna fragments can be included or excluded). figure . derived rna profile “generate report” file. the first few rows of an example “generate report” excel file detailing the full set of information presented in the “derived rna profile” window. the “omnisearch for mirnas” window (figure i) returns the top omnisearch entries( ) (reported targets and pubmed entries) for an individual mirna selected in the preceding “derived rna profile” window. and finally, when desired, the “full length ncrna expression analyses” button (figure j) redirects the user to a surfr window detailing the expressions of all full length sncrnas in the provided datasets regardless of fragmentation. importantly, all pertinent features (e.g. expression table downloads) described above are similarly available for full length sncrna analyses via this resource. surfr example use/case study surfr allows users to profile and compare the expressions of sncrna fragments (both annotated and novel) across multiple small rna-seq experiments in order to identify the top sncrna fragments significantly differentially expressed in a particular disease, tissue, developmental stage, etc.. our group’s interest in fragments excised from ncrnas other than mirnas initially arose from an attempt to identify novel mirna contributors to breast cancer( ). for this work, we performed small rna sequencing on several breast cancer cells lines, and while we failed to identify any (traditional) mirnas of interest, we did identify a snorna fragment (we deemed sdrna- ) that was specifically and significantly overexpressed in mda-mb- cells - a widely studied model of a highly invasive and metastatic human cancer. next, as we found sdrna- to be significantly overexpressed in these cells (≥ x compared to controls), we decided to determine if sdrna- functionally contributed to the malignant phenotype. stringently testing sdrna- inhibitors and mimics in mda-mb- cells across multiple time points revealed that sdrna- gain- and loss- of-function showed profound effects on invasion within standard matrix-based (matrigel) chemoattractant assays. remarkably, sdrna- loss-of-function reduced cell invasion by > % at hours compared to control cells, whereas sdrna- gain-of-function enhanced cell invasion by > %. thus, we showed a single sdrna (sdrna- ) strongly selectively regulates invasion of mda-mb- s. these findings link a specific sdrna (sdrna- ) to an aggressive malignant phenotype (invasion) within an established cancer cell model that is .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / widely used to study invasive behavior. we next employed a blast-based methodology to determine sdrna- expressions across small rna-seq datasets corresponding to unique breast cancer patients and detected strong overexpression of sdrna- in . % of tumors classified as luminal b her +, compared to normal tissue controls (extremely low expression) and other breast cancer subtypes (modest expression levels of - %). thus, this work represented the first evidence demonstrating that sdrnas that regulate specific malignant properties are differentially expressed within divergent molecular subtypes of human breast cancer( ). importantly, our initial blast-based identification of sdrna- as being significantly overexpressed in mda- mb- cells was highly labor intensive taking days to complete. in contrast, when we uploaded our original unmodified fastq sequencing files to surfr, sdrna was readily identified as the most highly differentially expressed snorna fragment between our two cancer cell lines taking just . minutes (figure ). figure . surfr identification of sdrna- . (a) “derived rna fragments” window showing snord derived sdrna- was identified as the second most highly expressed sdrna in the highly invasive breast cancer cell line mda-mb- . (b) alignment among the human genome (grch ch : : : ) (top), snorna- (ensg ) (middle), and next generation small rna sequence read (bottom) obtained by illumina sequencing of mda-mb- rna as originally described in( ). all sequences are in the ′ to ′ direction. an asterisk indicates base identity between the snorna and genome. vertical lines indicate identity across all three sequences. (c) “derived rna profile” window comparing small rna-seq results for mcf- and mda-mb- .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / cells. note snord derived sdrna- was identified as the most significantly differentially expressed sdrna between the weakly and highly invasive breast cancer cell lines. surfr comparison to other existing tools numerous characterizations of significant regulatory roles for sncrna fragments excised from various types of ncrnas other than mirnas have now been reported( – ). as new high-throughput small rna sequencing strategies( ) continue to make small rna-seq faster and less expensive, there is a clear need for tools capable of digesting large amounts of small rna-seq data in order to detect and characterize all small rna genes including specifically-excised small rna fragments. most existing tools (e.g., mirdeep( ), mirspring( ), miranalyzer( ), etc.) focus almost exclusively on mirnas and/or only evaluate existing sncrna annotations and are not capable of fully defining small rna-seq ncrna fragment profiles and differences among these datasets (srnanalyzer( ), oasis . ( ), spar( ), etc.). that said, most existing tools capable of characterizing novel ncrna fragments and their expressions, such as flaimapper( ), sports( ), and deus( ), require fairly extensive computational expertise for utilization, support only pre-aligned file inputs (bam), and/or require standalone installation (table ). as such, we have designed surfr to address the need for a user-friendly, web-based, comprehensive small rna fragment tool requiring no computational expertise to utilize. in stark contrast to most existing platforms, surfr identifies fragments excised from all types of ncrnas annotated in ensembl( ) in a given transcriptome provided as either a raw user-generated rna-seq dataset or ncbi sra file. in addition, surfr can compare individual fragment expressions among as many as distinct datasets, and we have included ncrna databases for unique animal, plant, fungal, protist, and bacterial species. importantly, there are currently no web-based, user-friendly resources that offer comprehensive sncrna fragment profiling and discovery, functional prediction, and the identification of significant differential expressions among datasets comparable to surfr. although two platforms, srna toolbox( ) and srnatools( ), do offer many of surfr’s features, surfr distinguishes itself by providing significantly more intuitive, versatile, and user friendly results generated in less than % of the time required for data upload and processing by these tools. that said, because surfr was developed specifically for ncrna fragment identification, it does not provide expression analysis for full length ncrnas. table . sncrna analysis platform feature comparison. various features offered by surfr were compared to other existing tools including srna toolbox( ), oasis . ( ), srnatools( ), cpss . ( ), spar( ), srnanalyzer, sports . ( ), deus( ), flaimapper( ), and featurecounts( ). features examined were: “online,” if tool is available online; “input,” form of input rna-seq dataset - either raw (direct ngs output) or pre-processed (e.g., requires bam file); “clear, user-friendly results/output,” if interactive and user-friendly results are generated directly; ”library oligo sequences req,” if user knowledge of ngs oligo sequences is required; “tcga, sra, geo, or encode input,” if publically available rna-seq datasets can be specified for examination based on identifier alone; “known full length sncrna expressions,” detection and quantification of known sncrnas; “novel full length sncrna expressions,” detection and quantification of novel sncrnas; “novel sncrna fragment discovery,” detection and quantification of novel ncrna fragments; “differential expression,” ability of the tool to integrate expression data from multiple files (“srnade” denotes that expression analyses can be performed in parallel); and “species,” number of species available for analysis. “user” denotes .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / that the tool has the capacity to perform given task however requires additional user input or user-directed change to program’s code and/or advanced settings. notably, as a verification of surfr’s accuracy, we recreated an analysis of ten prostate cancer small rna-seq files previously performed using flaimapper( ). importantly, flaimapper-based ncrna fragment discovery of these ten files originally identified snorna-derived fragments that were to nt in length and expressed at > rpm. similarly, surfr analysis of the same files identified snorna-derived fragments expressed at > rpm, and strikingly, of these fragments were nearly identically identified (+/- nts) by both methods. notably, we find the majority of the flaimapper-identified sdrna fragments not present in the surfr calls were excluded based on surfr’s % sequence identity requirement (in contrast to flaimapper’s nt mismatch allowance). section . salts tool for long non-coding rna analysis: lagoon ncrnas longer than nt in length are known as long ncrnas (lncrnas). this distinction, while somewhat arbitrary and based on technical aspects of rna isolation methods, serves to distinguish lncrnas from mirnas and other sncrnas. lncrna loci are present in large numbers in eukaryotic genomes typically comparable to or exceeding that of protein coding genes. many lncrnas possess features reminiscent of protein-coding genes, such as having a ′ cap and undergoing alternative splicing( ). in fact, many lncrna genes have two or more exons( ), and about % of lncrnas have polya+ tails. in addition, although numerous long intergenic rnas (lincrnas)( ) including ernas from gene-distal enhancers have recently been reported( ), the majority of lncrna genes identified to date are located within kb of protein-coding genes and typically found to be antisense to coding genes or intronic( ). that said, many lncrnas are expressed at relatively low levels in highly specific cell types( ) both explaining why the majority of lncrnas were thought to be “transcriptional noise” until quite recently and also representing perhaps the single largest challenge in terms of lncrna discovery and characterization. ngs has now identified tens of thousands of lncrna loci in humans alone with the number of lncrnas linked to human diseases quickly increasing. that said, lncrna functionality is highly contentious, and the number of experimentally characterized and / or disease-associated lncrnas remain in the low hundreds, or ≤ % of identified loci( ). this has led to a burgeoning focus on elucidating the molecular mechanisms that underlie lncrna functions( ). although only a minority of identified lncrnas have been functionally characterized, several distinct modes of action for lncrnas have now been described, including functioning as signals, decoys, scaffolds, guides, enhancer rnas, and short peptide messages( )( ). importantly, however, there are currently no web-based, user-friendly resources that offer comprehensive lncrna profiling, functional prediction, and the identification of significant differential expressions among datasets. to address this gap we present lagoon. lagoon refers to our long-noncoding and antisense gene occurrence and ontology tool that identifies all lncrnas expressed in a given human transcriptome from either a user-provided rna-seq dataset or publically available sra file( ). in addition, lagoon can also compare lncrna expressions among datasets and predict likely functional roles for individual lncrnas. lagoon features  direct, intuitive visualization of significant lncrna expressions. determines the expressions of all lncrnas annotated in the current ensembl assembly( ) in individual human rna-seq datasets.  identifies differentially utilized lncrna exons.  up to three files can be processed at once then up to individual files compared after processing for lncrna differential expression analysis. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . /  lagoon results are stored on the server indefinitely, protected by powerful state-of-the-art cryptographic algorithms, and can be instantly recalled by entering a previous session key in “access your results” on the lagoon home page.  easily downloadable excel files of results profiling a single rna-seq file and/or comparisons among various files. these files can be filtered (if desired) and list clearly defined, readily understandable, and pertinent data (e.g., expression, lncrna ensembl id, etc.).  detailed, comprehensive lncrna functional prediction detailing: o if a lncrna serves as a host for a sncrna( ). o significant potentials for a lncrna to serve as a specific mirna sponge( ). o all overlaps between a given lncrna and annotated enhancers( ). o significant potentials for lncrnas to serve as naturally occurring antisense silencers for genes located on the strand opposite to themselves( ). o associations between individual lncrnas and ribosomes suggesting microprotein production( ). importantly, lagoon is the first web-based, user-friendly resource that offers real-time lncrna profiling, the identification of significant differential expressions among datasets, and an array of functional prediction assessments beyond standard mrna interaction characterizations. full details of these novel computational methodologies are described in length in supplemental information file . lagoon workflow figure . lagoon workflow. sequence input (left). the user provides up to two unmodified rna-seq files and one ribo-seq dataset (optional) as input. these datasets can all be uploaded directly by the user or downloaded from the ncbi sra database by entering sra ids. lncrna exon analysis (middle). lagoon enumerates all annotated lncrna expressions in up to three datasets per session. lncrna expression and functional prediction visualization (top right). an interactive table is generated comparing the expressions of all exons within individual datasets and comparing exon expressions across all datasets. tables indicating putative lncrna functions are also depicted. lagoon cross section comparison (bottom right). the user can comprehensively compare all exon expressions identified in up to individual datasets by entering multiple lagoon session ids from separate analyses. lagoon input as summarized in figure , after selecting “start new analysis” on the lagoon homepage, the browser is redirected to the “data transfer options” page where the user provides one or two rna sequencing datasets as input and is given the chance to provide an optional, additional input, i.e., a ribo-seq dataset for determining microprotein coding potentials. these datasets can all be uploaded directly by the user, or all downloaded from the ncbi sra database( ) by entering sra ids (e.g., srr , srr ), or any combination thereof. importantly, a major strength of lagoon is that users can upload most raw rna-seq files directly as original, unmodified, compressed fastq files (as provided by commercial sequencers) with absolutely no preprocessing and with no specifics about library generation, linkers, or oligonucleotides required. there is no limit on the size .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / of sra files whereas individual user uploaded files are limited to gb regardless of format meaning extremely large sequencing files exceeding even this size can be converted to fasta format then compressed prior to being uploaded if necessary. allowable uploaded formats are uncompressed, standard fasta or fastq files or any major compression of either. in addition to this, the lagoon homepage provides links to: ( ) “access your results” where users can retrieve results from previous sessions via providing a session key and then compare results from up to five separate sessions. ( ) “lagoon search” where users can obtain detailed, comprehensive functional predictions for individual lncrnas. and, ( ) “download our databases” where users can download databases containing all the lncrnas and/or lncrna exons employed by lagoon. lagoon output after the user uploads/specifies the rna-seq datasets, the browser is automatically redirected to the lagoon report page (figure ). initially, a summary of the size and composition of individual rna-seq datasets, the number of lncrnas expressed in a dataset, and the top ten most highly expressed lncrnas in the specified dataset are shown. following selection of either one or all of the rna-seq files and the ribo-seq file (if included) analyzed from the file selection toolbar (figure a), results for the file(s) selected are then displayed on the report page under the “results” tab (figure b), and organized into several distinct sections. figure . lagoon report page. lagoon report example. (a) the file selection toolbar contains drop-down menus for selecting individual rna-seq and ribo-seq files. (b) the toolbar allowing selection of either the “summary” or “results” tab. (c) the lncrna expression window displays a filterable table of all lncrna exons expressed in any of the user-provided files. full length lncrna sequence, individual exon sequence, or ensembl lncrna gene information is obtained by selecting an exon in the table and then clicking the “lncrna sequence,” “exon sequence,” or “search lncrna in ensembl” button on the toolbar. (d) the “generate report” button creates and automatically downloads an excel file detailing the full set of information presented in the expression table window. (e) the “exon sponge to (mirna)” window lists all mirna complementarities of ten base pairs or greater occurring within the selected lncrna exon (f) the “lncrna host to” window lists all full length ncrnas contained in any of the selected lncrna’s exons. (g) the “enhancer” window lists all overlaps between a selected lncrna and genehancer annotated enhancer (as well as genes with expression linked to individual enhancers). (h) the “lncrna overlapping genes” window lists all genes even partially overlapping a lncrna locus on either strand. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / the table presented in figure c details the ensembl gene id, ensembl exon id along with gene annotation (name), and expressions (rpm) of all lncrna exons in each individual rna-seq dataset, and finally, the % standard deviation of the expression of individual exons( ). importantly, the full list of all exons found to be expressed in any of the datasets is presented. in addition, the expression table is interactive and allows user to view, sort, and filter based on any column value by clicking the “filter table” button on the toolbar. users can also obtain a full length lncrna sequence, a specific exon sequence, or view the lncrna gene information available at ensembl by selecting an exon in the table and then clicking the “lncrna sequence,” “exon sequence,” or “search lncrna in ensembl” button on the toolbar. the user can also download an excel file detailing the full set of information presented in the expression table window by pressing the “generate report” button at the top right of the window (figure d). an excel file containing the expression table window information in its entirety will be automatically downloaded to the user’s computer (figure ). in addition, refined excel file reports can be downloaded following the application of specific filters (e.g., lncrnas with rpm > in the ribo-seq dataset). figure . lncrna expression table “generate report” file. the first few rows of an example “generate report” excel file detailing the full set of information presented in the lncrna expression window. finally, putative functional roles for lncrnas/lncrna exons selected in the expression table are depicted in figure e-h. as lncrnas frequently function as mirna sponges that directly basepair with and effectively inactivate mature mirnas( ), the “exon sponge to (mirna)” window lists all mirna complementarities of ten base pairs or greater occurring within the selected lncrna exon (figure e). next, as numerous lncrnas have been shown to encode sncrnas (e.g., mirnas and snornas) in their exonic sequences, and sncrna expression often relies on excision from the host lncrna transcript( ), the “lncrna host to” window lists all full length ncrnas contained in any of the selected lncrna’s exons (figure f). in addition, as several lncrnas have been reported to function through regulating the accessibility of transcriptional enhancers overlapping their genomic loci( ), all overlaps between a selected lncrna and genehancer( ) annotated enhancer (and genes with expression linked to individual enhancers) are detailed in the “enhancer” window (figure g). and finally, in addition to lncrna exonic sequences serving as sncrna hosts, many sncrnas are processed from lncrna introns( ). furthermore, many lncrnas serve as naturally occurring antisense silencers of genes located on the strand opposite to themselves( ). for both of these reasons, as well as other potential regulatory relationships, all genes overlapping a lncrna locus on either the positive or negative strand are detailed in the “lncrna overlapping genes” window (figure h). importantly, a comprehensive report detailing each of the functional predictions is also available for individual lncrnas by selecting the “lagoon search” button on the homepage after entering a lncrna ensembl gene identifier. notably, this search functionality does not require full lagoon analysis. lncrna exon srr (rpm) srr (rpm) srr (rpm) % standard deviation ensg ense _ftx_- _ftx transcript, xist regulator [hgnc: ]_lncrna . ensg ense _ftx_- _ftx transcript, xist regulator [hgnc: ]_lncrna . ensg ense _ftx_- _ftx transcript, xist regulator [hgnc: ]_lncrna . ensg ense _jpx_ _jpx transcript, xist activator [hgnc: ]_lncrna . ensg ense _ftx_- _ftx transcript, xist regulator [hgnc: ]_lncrna . ensg ense _ftx_- _ftx transcript, xist regulator [hgnc: ]_lncrna . ensg ense _ftx_- _ftx transcript, xist regulator [hgnc: ]_lncrna . ensg ense _ftx_- _ftx transcript, xist regulator [hgnc: ]_lncrna . ensg ense _ac . _- _novel transcript_lncrna . ensg ense _ap . _- _novel transcript_lncrna . ensg ense _ankrd c-as _- _ankrd c antisense rna [hgnc: ]_lncrna . ensg ense _ankrd c-as _- _ankrd c antisense rna [hgnc: ]_lncrna . ensg ense _ankrd c-as _- _ankrd c antisense rna [hgnc: ]_lncrna . ensg ense _ankrd c-as _- _ankrd c antisense rna [hgnc: ]_lncrna . ensg ense _ankrd c-as _- _ankrd c antisense rna [hgnc: ]_lncrna . ensg ense _lipe-as _ _lipe antisense rna [hgnc: ]_lncrna . ensg ense _lipe-as _ _lipe antisense rna [hgnc: ]_lncrna . ensg ense _lipe-as _ _lipe antisense rna [hgnc: ]_lncrna . ensg ense _lipe-as _ _lipe antisense rna [hgnc: ]_lncrna . ensg ense _ac . _- _novel transcript_lncrna . ensg ense _linc _ _long intergenic non-protein coding rna [hgnc: ]_lncrna . ensg ense _al . _ _novel transcript_lncrna . ensg ense _al . _ _novel transcript_lncrna . ensg ense _al . _ _novel transcript_lncrna . ensg ense _malat _ _metastasis associated lung adenocarcinoma transcript [hgnc: ]_lncrna . ensg ense _malat _ _metastasis associated lung adenocarcinoma transcript [hgnc: ]_lncrna . ensg ense _malat _ _metastasis associated lung adenocarcinoma transcript [hgnc: ]_lncrna . ensg ense _malat _ _metastasis associated lung adenocarcinoma transcript [hgnc: ]_lncrna . .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / lagoon example use/case study ncrnas are becoming major players in disease pathogenesis such as cancer. metastasis associated lung adenocarcinoma transcript (malat ) is a nuclear enriched lncrna that is generally overexpressed in patient tumors and metastases. overexpression of malat has been shown to be positively correlated with tumor progression and metastasis in a large number of tumor types including breast tumors. furthermore, an earlier study evaluating breast cancer patient samples showed that malat expression is higher in breast tumors as compared to adjacent normal tissues (reviewed in ( )). as such we elected to compare lncrna expressions in a breast cancer cell line (mda-mb- ) rna-seq dataset (srr ) with those of a human bone tissue rna-seq dataset (srr ) in order to identify significantly differentially expressed lncrnas and their putative functions, including screening a ribo-seq of the brx- cell line (srr ) established from circulating tumor cells collected from a woman with advanced her -negative breast cancer( ) for potential malat microprotein production. strikingly, the total time for download and analysis of these three ngs datasets by lagoon was only min sec. more importantly, however, lagoon identified malat as the most highly expressed lncrna in mda- mb- breast cancer cells (figure ). in agreement with previous demonstrations that malat- functions (in part) as a mir- - p sponge in numerous malignancies including breast cancer( ), lagoon identified malat as a probable mir- - p sponge (figure a, top right). in addition, lagoon also found malat overlaps with, and may therefore potentially be involved in regulating, several distinct genomic enhancers and sncrnas (figure a, lower windows). finally, similarly in agreement with previous analyses( ), lagoon also identified malat as one of three lncrnas significantly represented in the brx- cell ribo-seq dataset strongly suggesting malat encodes at least one micropeptide. .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / figure . lagoon identification of malat overexpression in breast cancer. (a) the “results” window showing malat was identified as the most highly expressed lncrna in the highly invasive breast cancer cell line mda-mb- (srr ). (b) the “generate report” excel file showing malat (yellow) was identified as the most highly expressed lncrna in mda-mb- cells. both windows indicate malat is present in the breast cancer ribo-seq dataset (srr ). lagoon comparison to other existing tools lncrnas represent the largest single class of ncrnas. however, unlike sncrnas, which are thought to mostly function in gene regulation through complementary basepairing other rnas, the mechanisms through which lncrnas function are highly diverse. lncrna relatively low expressions and tissue specificity have significantly hindered lncrna discovery, our understanding of lncrna regulations, and characterizations of lncrna functional mechanisms to date( )( )( )( ). that said, initiatives such as encode( ), fantom( ), and gencode( ) have now predicted over , human lncrnas and identified associations between many of these and specific diseases. thus far, however, only a handful of these lncrnas have been examined in the literature, with even fewer being assigned any specific mechanistic function. expression data often constitutes the first level of information of use in studying lncrnas as differential expression analysis is clearly of value in prioritizing candidates for further examination. differential expression, however, provides little in the way of functional insights. that said, the majority of computational platforms currently available are primarily aimed at either detecting and quantifying lncrnas (e.g., lncrna-screen( ), rna-code( ), lncrscan( ), etc.) or predicting lncrna:mrna and/or lncrna:protein interactions (e.g., plaidoh( ), lncrna function( ), circlncrnanet( ), etc.) (table ). in contrast, lagoon was designed to comprehensively evaluate lncrna expression as well as the potential for lncrnas to function through other characterized mechanisms including serving as sncrna hosts, mirna sponges, antisense rnas, microprotein transcripts, and/or regulators of genomic enhancers (as well as providing links to predicted lncrna:mrna and/or lncrna:protein interactions). in short, lagoon wholly distinguishes itself from available tools by filling a major gap in available lncrna functional prediction platforms and eliminating the need of the user to switch platforms during the analysis process. table . lncrna analysis platform feature comparison. various features offered by lagoon were compared to other existing tools including lncrna-screen( ), rna-code( ), lncrscan( ), iseerna( ), annocript( ), uclncr( ), lncrna function( ), and circlncrnanet( ). features examined were: “online”, if tool is available online; “input”, form of input rna-seq dataset - either raw (direct ngs output) or pre-processed (e.g., requires bam file); “tcga, sra, or geo”, if publically available rna-seq datasets can be specified for examination based on identifier alone; “known lncrna”, detection and quantification of known lncrnas; “novel lncrna”, detection and quantification of novel lncrnas; “differential expression”, ability of the tool to integrate expression data from multiple files; “chip-seq / ribo-seq”, if identified lncrna occurrences in chip-seq and/or ribo-seq datasets can be determined; “functional prediction”, if potential functional roles of identified lncrnas are assessed; and “interactive results”, if interactive and user-friendly results are generated directly. online input tcga, sra, or geo input known lncrna novel lncrna differential expression chip-seq / ribo-seq functional prediction interactive results lagoon yes raw yes yes no yes yes yes yes lncrna-screen no raw yes yes yes yes yes no yes rna-code no raw no yes no yes no no no lncrscan no raw no no yes yes no no no iseerna yes raw no no yes yes no no yes annocript no raw no no yes no no no no uclncr no pre-processed no no no no no no no lncrna function yes pre-processed no yes no limited no yes yes circlncrnanet yes pre-processed no yes no yes no yes yes .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / discussion despite a mounting body of evidence supporting the physiological relevance of ncrnas, most studies performed to date have focused primarily on proteins themselves or deciphering the pathways associated with annotated ncrnas. moreover, due to the perceived insurmountability of the sheer amount of data generated by ngs/tgs analyses, the full extent of regulatory networks created by ncrnas often gets overlooked( ). in addition, whereas the cost of rna-seq is now reasonable for most active research programs, tools necessary for the interpretation of these sequencing datasets typically require significant computational expertise and resources markedly hindering widespread utilization of these tools. as such, the necessity for development of real-time, user-friendly platforms capable of making the identification and characterization of the ncrnaome accessible to biologists lacking significant computational expertise becomes clear. in light of this, we have developed salts a highly accurate, super efficient, and extremely user-friendly one-stop shop for ncrna transcriptomics. notably, salts is accessed through an intuitive web-based interface, can analyze either user-generated, standard ngs file uploads (e.g., fastq) or existing ncbi sra datasets, and requires absolutely no dataset pre-processing or knowledge of library adapters/oligonucleotides. in short, salts constitutes the first publically available, web- based, comprehensive ncrna transcriptomic ngs analysis platform designed specifically for users with no computational background, providing a much needed, powerful new resource enabling more widespread ncrna transcriptomic analysis. that said, an array of platforms and pipelines, each geared towards a specific type of transcript/ncrna class, have previously been developed. regardless of the platform, the core of ncrna transcriptome expression analysis consists of two main steps: transcript detection and expression quantification( )( ). the first step in this process involves aligning, or mapping, the ngs reads to a reference sequence(s), which can be either ncrna sequence library or an entire reference genome. most standard pipelines use alignment programs such as bowtie ( ), bwa( ), ncbi’s blast( ) or other implementations of existing alignment algorithms like smith-waterman (sw)( ), needleman-wunsch (nw)( ), and burrows wheeler transform (bwt)( ). these aligners often differ in how alignment mis-matches and gaps are scored and as such need to be taken into account when dealing with data containing high sequence variability between the individual transcripts originating from the same genomic locus or between the reads and the reference. in the second step, aligned reads are further analyzed to determine the expression, or the number of reads assigned to individual loci or library entries. this step often includes or is followed by various statistical analysis to determine differential expression and/or variance between replicates (i.e., bayseq( ) or deseq ( )). that said, the strikingly high accuracy and efficiency achieved by our tools as compared to existing platforms is primarily due to a novel computational approach to rna-seq alignment and an innovative analysis based on hilbert and vector spaces developed in the course of this work. brief overviews of the primary constructs critical to toolkit implementation are described below with more in- depth descriptions detailed in supplemental information files and . salts toolkit implementation. of note, both surfr and lagoon were developed into real-time processing systems using the following technology stack: programming languages used: python . , visual c++ , erlang, javascript, php, and sparql. database engines: mongo db . servers: apache web server, + background servers composed using master-worker model to parallelize the workload, and apache jena fuseki. other tools and supporting technologies: rabbit mq, flask, redis, vue js, dropzone js, apexcharts js, bootstrap , ibm aspera, axios js, moment js, tabulator, matplotlib, numpy, scipy, and html . architecture: microservices. hardware specs: intel® xeon® cpu-e - v @ . ghz, gb ram, tb hard disk, windows server . .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / surfr implementation. with surfr, users with no computational background can quickly and easily analyze, visualize, and compare small rna-seq datasets in order to generate clear, informative results. with an interactive, user-friendly interface, surfr is the first web-based resource that provides users the ability to upload unmodified ngs datasets and/or provide sra identifiers to perform comprehensive novel ncrna and ncrna fragment identifications and expression analyses in real-time. this is achieved through employing the following three key components: ( ) hilbert space (hs). in mathematics, a hs is an abstract vector space (with up to infinite dimensions) representing the current physical state of a continuous system routinely applied in quantum mechanics. hss are highly useful in describing the relationship among vector spaces, wavelets, and wave functions( )( ). for our analyses, the term “gene expression” is considered a higher dimensional function representing the activity of the rna across its length where, within a rna, expression is represented using four vectors (for a, c, t, and g) and understood using hss. ( ) movak alignment. based on utilization of the aforementioned hss, we introduced two new data structures, namely, similarity vectors (svs) and differential expression vectors (devs). movak alignment combines svs and devs to profile the exact transcriptomic activity of a given rna-seq dataset and then retrieves a hs for each rna that is expressed in a sample. and ( ) surfr algorithm. by defining the changes in the gene expression using the above hs interpretation, we assign a wavelet function with scales of to to each sncrna micro-like behavior, i.e., mirna-like rnas with lengths ranging from to nt. importantly, our novel methodology carries several advantages over existing computational methods: . compared to current, purely string comparison methods, devs take significantly less time to obtain. . better visualization of ncrnas processing. . surfr data structures consume very little memory thus allowing real-time calculations. . calculus-based modeling can be directly applied to devs to understand ncrna behavior thus providing a mathematical means to study transcriptomic functionality. . our methodology is highly effective and accurate. to be more specific, our wavelet-based analysis on hs typically identifies ncrna-derived rna start and end positions with >= % identity (within nt) to experimentally validated databases like mirbase as opposed to the state-of-the-art methods based on bam files such as flaimapper, which have been reported to correctly predict % of mirna start positions and % of mirna end positions( ). . we have extended our computational methodology to + organisms and all of their sncrnas without the necessity to change any algorithmic criteria. . our method can address the dynamism associated with transcriptomic analysis using topological interpretation. lagoon implementation. similar to surfr, with lagoon, users with no computational background can quickly and easily analyze and compare raw rna-seq datasets to comprehensively evaluate lncrna expressions as well as the potential for lncrnas to function as sncrna hosts, mirna sponges, antisense rnas, microprotein transcripts, and/or regulators of genomic enhancers. in short, lagoon distinguishes itself from existing platforms through offering parallel, real-time expression analysis and functional prediction. of note, lagoon is essentially based on an extended version of movak alignment that similarly employs svs to perform sequence alignments. in lagoon, however, the algorithm was modified during extension in order to trade time and space complexities within the alignment. a detailed explanation regarding these modifications is provided in supplemental information file . .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / references . veneziano,d., nigita,g. and ferro,a. ( ) computational approaches for the analysis of ncrna through deep sequencing techniques. front. bioeng. biotechnol., . . uchida,s. and bolli,r. ( ) short and long noncoding rnas regulate the epigenetic status of cells. antioxidants redox signal., , – . . wolfien,m., brauer,d.l., bagnacani,a. and wolkenhauer,o. ( ) workflow development for the functional characterization of ncrnas. in methods in molecular biology. humana press inc., vol. , pp. – . . ulitsky,i. ( ) interactions between short and long noncoding rnas. febs lett., , – . . nakahara,k. and carthew,r.w. ( ) expanding roles for mirnas and sirnas in cell regulation. curr. opin. cell biol., , – . . cheng,a.m., byrom,m.w., shelton,j. and ford,l.p. ( ) antisense inhibition of human mirnas and indications for an involvement of mirna in cell growth and apoptosis. nucleic acids res., , – . . hwang,h.w. and mendell,j.t. ( ) micrornas in cell proliferation, cell death, and tumorigenesis. br. j. cancer, , – . . singh,s., chitkara,d., mehrazin,r., behrman,s.w., wake,r.w. and mahato,r.i. ( ) chemoresistance in prostate cancer cells is regulated by mirnas and hedgehog pathway. plos one, . . visone,r. and croce,c.m. ( ) mirnas and cancer. am. j. pathol., , – . . rother,s. and meister,g. ( ) small rnas derived from longer non-coding rnas. biochimie, , – . . martens-uzunova,e.s., olvedy,m. and jenster,g. ( ) beyond microrna--novel rnas derived from small non-coding rna and their implication in cancer. cancer lett, , – . . patterson,d.g., roberts,j.t., king,v.m., houserova,d., barnhill,e.c., crucello,a., polska,c.j., brantley,l.w., kaufman,g.c., nguyen,m., et al. ( ) human snorna- is processed into a microrna-like rna that promotes breast cancer cell invasion. npj breast cancer, , . . olvedy,m., scaravilli,m., hoogstrate,y., visakorpi,t., jenster,g. and martens-uzunova,e.s. ( ) a comprehensive repertoire of trna-derived fragments in prostate cancer. oncotarget, , – . . ender,c., krek,a., friedländer,m.r., beitzinger,m., weinmann,l., chen,w., pfeffer,s., rajewsky,n. and meister,g. ( ) a human snorna with microrna-like functions. mol. cell, , – . . martens-uzunova,e.s., olvedy,m. and jenster,g. ( ) beyond microrna--novel rnas derived from small non-coding rna and their implication in cancer. cancer lett, , – . . hirose,y., ikeda,k.t., noro,e., hiraoka,k., tomita,m. and kanai,a. ( ) precise mapping and dynamics of trna-derived fragments (trfs) in the development of triops cancriformis (tadpole shrimp). bmc genet., . . durdevic,z. and schaefer,m. ( ) trna modifications: necessary for correct trna-derived fragments during the recovery from stress? bioessays, , – . . wu,w., choi,e.j., lee,i., lee,y.s. and bao,x. ( ) non-coding rnas and their role in respiratory syncytial virus (rsv) and human metapneumovirus (hmpv) infections. viruses, . . zhou,k., diebel,k.w., holy,j., skildum,a., odean,e., hicks,d.a., schotl,b., abrahante,j.e., spillman,m.a. and bemis,l.t. ( ) a trna fragment, trf -glu, regulates bcar expression and proliferation in ovarian cancer cells. oncotarget, , – . .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / . yates,a., akanni,w., amode,m.r., barrell,d., billis,k., carvalho-silva,d., cummins,c., clapham,p., fitzgerald,s., gil,l., et al. ( ) ensembl . nucleic acids res., , d - . . huang,j., gutierrez,f., strachan,h.j., dou,d., huang,w., smith,b., blake,j.a., eilbeck,k., natale,d.a., lin,y., et al. ( ) omnisearch: a semantic search system based on the ontology for microrna target (omit) for microrna-target gene interaction data. j. biomed. semantics, , . . camacho,c., coulouris,g., avagyan,v., ma,n., papadopoulos,j., bealer,k. and madden,t.l. ( ) blast+: architecture and applications. bmc bioinformatics, , . . leinonen,r., sugawara,h. and shumway,m. ( ) the sequence read archive. nucleic acids res., . . kalvari,i., argasinska,j., quinones-olvera,n., nawrocki,e.p., rivas,e., eddy,s.r., bateman,a., finn,r.d. and petrov,a.i. ( ) rfam . : shifting to a genome-centric resource for non-coding rna families. nucleic acids res, , d –d . . desgranges,e., caldelari,i., marzi,s. and lalaouna,d. ( ) navigation through the twists and turns of rna sequencing technologies: application to bacterial regulatory rnas. biochim. biophys. acta - gene regul. mech., . . friedländer,m.r., chen,w., adamidi,c., maaskola,j., einspanier,r., knespel,s. and rajewsky,n. ( ) discovering micrornas from deep sequencing data using mirdeep. nat. biotechnol., , – . . humphreys,d.t. and suter,c.m. ( ) mirspring: a compact standalone research tool for analyzing mirna-seq data. nucleic acids res., . . hackenberg,m., rodríguez-ezpeleta,n. and aransay,a.m. ( ) miranalyzer: an update on the detection and analysis of micrornas in high-throughput sequencing experiments. nucleic acids res., . . wu,x., kim,t.k., baxter,d., scherler,k., gordon,a., fong,o., etheridge,a., galas,d.j. and wang,k. ( ) srnanalyzer-a flexible and customizable small rna sequencing data analysis pipeline. nucleic acids res., , – . . rahman,r.u., gautam,a., bethune,j., sattar,a., fiosins,m., magruder,d.s., capece,v., shomroni,o. and bonn,s. ( ) oasis : improved online analysis of small rna-seq data. bmc bioinformatics, . . kuksa,p.p., amlie-wolf,a., katanić,Ž., valladares,o., wang,l.s. and leung,y.y. ( ) spar: small rna-seq portal for analysis of sequencing experiments. nucleic acids res., , w –w . . hoogstrate,y., jenster,g. and martens-uzunova,e.s. ( ) flaimapper: computational annotation of small ncrna-derived fragments using rna-seq high-throughput data. bioinformatics, , – . . shi,j., ko,e.a., sanders,k.m., chen,q. and zhou,t. ( ) sports . : a tool for annotating and profiling non-coding rnas optimized for rrna- and trna-derived small rnas. genomics, proteomics bioinforma., , – . . jeske,t., huypens,p., stirm,l., höckele,s., wurmser,c.m., böhm,a., weigert,c., staiger,h., klein,c., beckers,j., et al. ( ) deus: an r package for accurate small rna profiling based on differential expression of unique sequences. bioinformatics, , – . . aparicio-puerta,e., lebrón,r., rueda,a., gómez-martín,c., giannoukakos,s., jaspez,d., medina,j.m., zubkovic,a., jurak,i., fromm,b., et al. ( ) srnabench and srnatoolbox : intuitive fast small rna profiling and differential expression. nucleic acids res., , w –w . . liu,q., ding,c., lang,x., guo,g., chen,j. and su,x. ( ) small noncoding rna discovery and profiling with srnatools based on high-throughput sequencing. brief. bioinform., . /bib/bbz . . wan,c., gao,j., zhang,h., jiang,x., zang,q., ban,r., zhang,y. and shi,q. ( ) cpss . : a computational platform update for the analysis of small rna sequencing data. bioinformatics, , – .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / . . liao,y., smyth,g.k. and shi,w. ( ) featurecounts: an efficient general purpose program for assigning sequence reads to genomic features. bioinformatics, , – . . martens-uzunova,e.s., hoogstrate,y., kalsbeek,a., pigmans,b., vredenbregt-van den berg,m., dits,n., nielsen,s.j., baker,a., visakorpi,t., bangma,c., et al. ( ) c/d-box snorna-derived rna production is associated with malignant transformation and metastatic progression in prostate cancer. oncotarget, , – . . derrien,t., johnson,r., bussotti,g., tanzer,a., djebali,s., tilgner,h., guernec,g., martin,d., merkel,a., knowles,d.g., et al. ( ) the gencode v catalog of human long noncoding rnas: analysis of their gene structure, evolution, and expression. genome res., , – . . ulitsky,i. and bartel,d.p. ( ) xlincrnas: genomics, evolution, and mechanisms. cell, , . . lam,m.t.y., li,w., rosenfeld,m.g. and glass,c.k. ( ) enhancer rnas and regulated transcriptional programs. trends biochem. sci., , – . . rinn,j.l. and chang,h.y. ( ) genome regulation by long noncoding rnas. annu. rev. biochem., , – . . uszczynska-ratajczak,b., lagarde,j., frankish,a., guigó,r. and johnson,r. ( ) towards a complete map of the human long non-coding rna transcriptome. nat. rev. genet., , – . . mercer,t.r., dinger,m.e. and mattick,j.s. ( ) long non-coding rnas: insights into functions. nat. rev. genet., , – . . li,x., wu,z., fu,x. and han,w. ( ) lncrnas: insights into their function and mechanics in underlying disorders. mutat. res. - rev. mutat. res., , – . . moran,v.a., perera,r.j. and khalil,a.m. ( ) emerging functional and mechanistic paradigms of mammalian long non-coding rnas. nucleic acids res., , – . . wang,j., liu,x., wu,h., ni,p., gu,z., qiao,y., chen,n., sun,f. and fan,q. ( ) creb up-regulates long non-coding rna, hulc expression through interaction with microrna- in liver cancer. nucleic acids res., , – . . chen,h., du,g., song,x. and li,l. ( ) non-coding transcripts from enhancers: new insights into enhancer activity and gene expression regulation. genomics, proteomics bioinforma., , – . . malecová,b. and morris,k. v. ( ) transcriptional gene silencing through epigenetic changes mediated by non-coding rnas. curr. opin. mol. ther., , – . . stein,c.s., jadiya,p., zhang,x., mclendon,j.m., abouassaly,g.m., witmer,n.h., anderson,e.j., elrod,j.w. and boudreau,r.l. ( ) mitoregulin: a lncrna-encoded microprotein that supports mitochondrial supercomplexes and respiratory efficiency. cell rep., , - .e . . fishilevich,s., nudel,r., rappaport,n., hadar,r., plaschkes,i., iny stein,t., rosen,n., kohn,a., twik,m., safran,m., et al. ( ) genehancer: genome-wide integration of enhancers and target genes in genecards. database (oxford)., . . arun,g. and spector,d.l. ( ) malat long non-coding rna and breast cancer. rna biol., , – . . jordan,n.v., bardia,a., wittner,b.s., benes,c., ligorio,m., zheng,y., yu,m., sundaresan,t.k., licausi,j.a., desai,r., et al. ( ) her expression identifies dynamic functional states within circulating breast cancer cells. nature, , – . .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / . huang,x.j., xia,y., he,g.f., zheng,l.l., cai,y.p., yin,y. and wu,q. ( ) malat promotes angiogenesis of breast cancer. oncol. rep., , – . . ruiz-orera,j., messeguer,x., subirana,j.a. and alba,m.m. ( ) long non-coding rnas as a source of new peptides. elife, , . . davis,c.a., hitz,b.c., sloan,c.a., chan,e.t., davidson,j.m., gabdank,i., hilton,j.a., jain,k., baymuradov,u.k., narayanan,a.k., et al. ( ) the encyclopedia of dna elements (encode): data portal update. nucleic acids res., , d –d . . lizio,m., abugessaisa,i., noguchi,s., kondo,a., hasegawa,a., hon,c.c., de hoon,m., severin,j., oki,s., hayashizaki,y., et al. ( ) update of the fantom web resource: expansion to provide additional transcriptome atlases. nucleic acids res., , d –d . . gong,y., huang,h.t., liang,y., trimarchi,t., aifantis,i. and tsirigos,a. ( ) lncrna-screen: an interactive platform for computationally screening long non-coding rnas in large genomics datasets. bmc genomics, . . yuan,c. and sun,y. ( ) rna-code: a noncoding rna classification tool for short reads in ngs data lacking reference genomes. plos one, . . sun,l., liu,h., zhang,l. and meng,j. ( ) incrscan-svm: a tool for predicting long non-coding rnas using support vector machine. plos one, . . pyfrom,s.c., luo,h. and payton,j.e. ( ) plaidoh: a novel method for functional prediction of long non-coding rnas identifies cancer-specific lncrna activities. bmc genomics, . . jiang,q., ma,r., wang,j., wu,x., jin,s., peng,j., tan,r., zhang,t., li,y. and wang,y. ( ) lncrna function: a comprehensive resource for functional investigation of human lncrnas based on rna-seq data. bmc genomics, . . wu,s.m., liu,h., huang,p.j., chang,i.y.f., lee,c.c., yang,c.y., tsai,w.s. and tan,b.c.m. ( ) circlncrnanet: an integrated web-based resource for mapping functional networks of long or circular forms of noncoding rnas. gigascience, , – . . sun,k., chen,x., jiang,p., song,x., wang,h. and sun,h. ( ) iseerna: identification of long intergenic non-coding rna transcripts from transcriptome sequencing data. bmc genomics, . . musacchia,f., basu,s., petrosino,g., salvemini,m. and sanges,r. ( ) annocript: a flexible pipeline for the annotation of transcriptomes able to identify putative long noncoding rnas. bioinformatics, , – . . sun,z., nair,a., chen,x., prodduturi,n., wang,j. and kocher,j.p. ( ) uclncr: ultrafast and comprehensive long non-coding rna detection from rna-seq. sci. rep., . . sun,y.-m. and chen,y.-q. ( ) principles and innovative technologies for decrypting noncoding rnas: from discovery and functional prediction to clinical application. j. hematol. oncol., , . . langmead,b. and salzberg,s.l. ( ) fast gapped-read alignment with bowtie . nat. methods, , – . . li,h. and durbin,r. ( ) fast and accurate long-read alignment with burrows-wheeler transform. bioinformatics, , – . . bucher,p. and hofmann,k. ( ) a sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. proc. int. conf. intell. syst. mol. biol., , – . . phillips,a.j. ( ) homology assessment and molecular sequence alignment. j. biomed. inform., , – . .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / . lippert,r.a. ( ) space-efficient whole genome comparisons with burrows-wheeler transforms. j. comput. biol., , – . . kvam,v.m., liu,p. and yaqing,s. ( ) a comparison of statistical methods for detecting differentially expressed genes from rna-seq data. am. j. bot., , – . . costa-silva,j., domingues,d. and lopes,f.m. ( ) rna-seq differential expression analysis: an extended review and a software tool. plos one, . . steeb, w.-h. ( ). hilbert spaces, wavelets, generalised functions and modern quantum mechanics. springer science & business media. . debnath, l., & mikusinski, p. ( ). introduction to hilbert spaces with applications. academic press. . y. hoogstrate, g. jenster, and e. s. martens-uzunova, “flaimapper: computational annotation of small ncrna-derived fragments using rna-seq high-throughput data,” bioinformatics, vol. , no. , pp. – , mar. . .cc-by-nd . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by-nd/ . / ember: multi-label prediction of kinase-substrate phosphorylation events through deep learning ember: multi-label prediction of kinase-substrate phosphorylation events through deep learning kathryn e. kirchoff and shawn m. gomez , , department of computer science, the university of north carolina at chapel hill, chapel hill, nc , usa department of pharmacology, the university of north carolina at chapel hill, chapel hill, nc , usa joint department of biomedical engineering at the university of north carolina at chapel hill and north carolina state university, chapel hill, nc , usa abstract kinase-catalyzed phosphorylation of proteins forms the back- bone of signal transduction within the cell, enabling the coor- dination of numerous processes such as the cell cycle, apop- tosis, and differentiation. while on the order of phos- phorylation events have been described, we know the specific kinase performing these functions for less than % of cases. the ability to predict which kinases initiate specific individual phosphorylation events has the potential to greatly enhance the design of downstream experimental studies, while simultane- ously creating a preliminary map of the broader phosphoryla- tion network that controls cellular signaling. to this end, we de- scribe ember, a deep learning method that integrates kinase- phylogeny information and motif-dissimilarity information into a multi-label classification model for the prediction of kinase- motif phosphorylation events. unlike previous deep learning methods that perform single-label classification, we restate the task of kinase-motif phosphorylation prediction as a multi-label problem, allowing us to train a single unified model rather than a separate model for each of the kinase families. we utilize a siamese network to generate novel vector representations, or an embedding, of motif sequences, and we compare our novel em- bedding to a previously proposed peptide embedding. our mo- tif vector representations are used, along with one-hot encoded motif sequences, as input to a classification network while also leveraging kinase phylogenetic relationships into our model via a kinase phylogeny-weighted loss function. results suggest that this approach holds significant promise for improving our map of phosphorylation relations that underlie kinome signaling. availability: https://github.com/gomezlab/ember correspondence: smgomez@unc.edu introduction phosphorylation is the most abundant post-translational mod- ification of protein structure, affecting from one to two-thirds of eukaryotic proteins. in humans, the number of kinases catalyzing this reaction hints at its importance, with kinases being one of the largest gene families with roughly mem- bers distributed among families ( – ). during phospho- rylation, a kinase facilitates the addition of a phosphate group at serine, threonine, tyrosine, or histidine residues; though other sites exist. phosphorylation of a substrate at any of these residues occurs within the context of specific consen- sus phosphorylation sequences, which we refer to here as “motifs”. additional substrate binding sequences within the kinase or substrate, as well as protein scaffolds that facili- tate structural orientation and downstream catalysis of the re- action, modify the efficacy of motif phosphorylation. typi- cally, the net effect of kinase phosphorylation is to switch the downstream target into an “on” or “off” state, enabling the transmission of information throughout the cell. kinase ac- tivity touches nearly all aspects of cellular behavior, and the alteration of kinase behavior underlies many diseases while simultaneously establishing the basis for therapeutic inter- ventions ( – ). although the importance of phosphorylation in cell informa- tion processing and its dysregulation as a driver of disease is well-recognized, the map of kinase-motif phosphorylation in- teractions is mostly unknown. so, while upwards of , motifs are known to be phosphorylated, less than % of these have an associated kinase identified as the catalyzing agent ( ). this knowledge gap provides a considerable impetus for the development of methods aimed at predicting kinase- motif phosphorylation events that, at a minimum, could help focus experimental efforts. as a result, a number of computational tools have been devel- oped, spanning a myriad of methodological approaches in- cluding random forests ( ), support vector machines ( ), logistic regression ( ), and bayesian decision theory ( ). advances in deep learning have similarly spawned new ap- proaches, with two methods recently described. the first, musitedeep, utilizes a convolutional neural network (cnn) with attention to generate single predictions ( ). the sec- ond deep learning method, deepphos, exploits densely con- nected cnn (dc-cnn) blocks for its predictions ( ). both of these approaches train individual models for each kinase family, requiring a separate model for each of the ki- nase families. in addition to the practical challenge of train- ing many individual models, a further disadvantage of these two deep learning approaches is the potential lost opportunity gains from transfer learning, as models trained independently do not directly incorporate knowledge of motif phosphoryla- tion by kinases from different kinase families. here, we describe, ember (embedding-based multi-label prediction of phosphorylation events), a deep learning ap- proach for predicting multi-label kinase-motif phosphoryla- tion relationships. in our approach, we utilize a siamese neu- kirchoff et al. | bior‰iv | february , | – .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ral network, modified for our multi-label prediction task, to generate a high-dimensional embedding of motif vectors. we further utilize one-hot encoded motif sequences. these two representations are leveraged together as a dual input into our classifier, improving learning and prediction. we also find that our siamese embedding generally outperforms a previ- ously proposed protein embedding, protvec, which is trained on significantly more data ( ). we further integrate infor- mation regarding evolutionary relationships between kinases into our classification network loss function, informing pre- dictions in light of the sparsity associated with these data, and we find that this information improves prediction accuracy. as ember utilizes transfer learning across families, we ex- pect that model accuracy will improve more so than other deep learning approaches as more data describing kinase- substrate relationships are collected. together, these results suggest that ember holds significant promise for improving our map of phosphorylation relationships that underlie the ki- nome and broader cellular signaling. methods kinase-motif interaction data. as documented kinase- motif interactions are sparse in relation to the total number of known phosphorylation events, we attempted to maximize the number of examples of such interactions for training. to do this, we integrated multiple datasets describing kinase- motif relationships across multiple vertebrate species. our data was sourced from phosphositeplus, phosphonetworks, and phospho.elm, all of which are collections of annotated and experimentally verified kinase-motif relationships ( – ). from these data sources, non-redundant kinase-motif relationships were extracted and integrated into a single set of interactions. we used the standard single-letter amino acid code for representation of amino acids, with an additional ’x’ symbol to represent an ambiguous amino acid. we defined our motifs as peptides composed of a central phosphorylat- able amino acid — either serine (s), threonine (t), or tyrosine (y) — flanked by seven amino acids on either side. there- fore, each motif is a -amino acid peptide or “ -mer”. as a phosphorylatable amino acid may not have seven flanking amino acids to either side if it is located near the end of a substrate sequence, we used ‘-’ to represent the absence of an amino acid in order to maintain a consistent motif length of amino acids across all instances. deep learning models are known to generally require large amounts of examples per class in order to achieve adequate performance. our original dataset was considerably imbal- anced in that all positive labels (verified kinase-motif inter- actions) had a very low positive-to-negative label ratio. for example, the tlk kinase family only has nine positive la- bels (verified tlk-motif interactions) and more than , negative labels (lack of evidence for a tlk-motif interac- tion). to maximize our ability to learn from our data, we utilized only kinases that had a relatively large number of ex- perimentally validated motif interactions, reducing the num- ber of kinase-motif relationships to be used as input for our model. this filtering also served to considerably mitigate table . summary of our kinase-motif phosphorylation dataset. shown are the number of kinases per family along with the number of motifs phosphorylated by each kinase family in the training and test sets. family kinases training motifs testing motifs akt cdk ck mapk pikk pka pkc src the label imbalances in our data. from the remain- ing motifs, we set aside motifs for the independent test set, leaving for the training set. then we removed any sequences from the training set that met a % similarity threshold with any sequence in the test set, based on ham- ming distance scores. this process removed motifs from the training set. kinase labels were then grouped into re- spective kinase families contingent on data collected from the regphos ( ) database, resulting in eight kinase families. our resulting data set is comprised of phosphorylatable mo- tifs and their reaction-associated kinase families (table ). furthermore, our data are multi-label in that a single motif may be phosphorylated by multiple kinases, including those from other families, resulting in a data point with potentially multiple positive labels. motif embeddings. protvec embedding. we chose to investigate two methods to achieve our motif embedding. first, we considered protvec, a learned embedding of amino acids, originally intended for protein function classification ( ). protvec is the result of a word vec algorithm trained on a corpus of , se- quences obtained from swiss-prot, which were broken up into amino acid-long subsequences, or " -grams". as a result of this approach, protvec provides a -dimensional distributed representation, analogous to a natural language "word embedding", that establishes coordinates for each pos- sible amino acid -gram. this results in a ◊ matrix of coordinates, one -dimensional coordinate for each - gram. in a preliminary investigation, we found that averaging the protvec coordinates resulted in a higher-quality embed- ding compared to the original protvec coordinates. compar- isons between the two embeddings are provided in supple- mental material. we averaged the embedding coordinates, per amino acid, in the following fashion: we define t = [aaa,ala,laa, ...,unknown], the vector of amino acid -grams provided by the authors of protvec. we also define a = [a,l,s, ...,-], the alphabet comprising the amino acid symbols. we equate “-” to the “unknown” character defined by protvec. then, we compute the matrix of averaged protvec coordinates, c(avg), which will be ◊ dimensions: | bior‰iv kirchoff et al. | ember: kinase-substrate multi-label prediction .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / c(avg) = s wwwww u c , c , c , . . . c , c , c , c , . . . c , c , c , c , . . . c , ... . . . ... c , c , c , . . . c , t xxxxx v ( ) we solve for each element of c(avg) based on the values of c(raw), the original ( x ) protvec matrix: c (avg) ij = |qi| ÿ kœqi c (raw) kj ( ) where c(avg)ij belongs to c (avg), c(raw)ij belongs to c (raw), and qi = {q : ai œ tq } ( ) note that the original protvec matrix was ◊ dimen- sions, thus each j corresponds to the index of one of the original protvec dimensions along the second tensor dimen- sion. siamese embedding. we aimed to produce a final model, composed of an embedding technique and a classification method, that was specific to our motif dataset. to this end, we implemented a siamese network to provide a novel learned representation of our motifs (figure ). the siamese net- work is composed of two identical "twin" networks, deemed as such due to their identical hyperparameters as well as their identical learned weights and biases ( ). during training, each twin network receives a separate motif sequence that is represented as a one-hot encoding, denoted either as a or b in figure . motifs are processed through the network until reaching the final fully-connected layers, ha and hb, which provide the resultant embeddings for the original motif se- quences. next, the layers are joined by calculating the pair- wise euclidean distance, dw , between ha and hb. dw can be interpreted as the overall dissimilarity between the origi- nal motif sequences, a and b. the loss function operates on the final layer, striving to embed relatively more similar data points closer to each other, and relatively more different data points farther away from each other. in this way, the network amplifies the similarities and differences between motifs, and it translates such relationships into a semantically meaningful vector representation for each motif in the embedding space. we utilized a contrastive loss as described in hadsell et al. ( ), but we sought to modify the function to account for the multi-label aspect of our task. the canonical siamese loss between a pair of samples, a and b, is defined as l(a, b, y ) = ( ≠ y ) (dw) + (y ) [max( , m ≠ dw)] , ( ) where dw is the euclidean distance between the outputs of the embedding layer, m is the margin which is a hyperparam- eter defined prior to training, and y œ { , }. the value of y is determined by the label of each data point in the pair. if a fig. . siamese network architecture, composed of twin convolutional neural net- works (cnns). the twin networks are joined at the final layer. a and b represent a pair of motifs from the training set, while ha and hb represent the respective hidden layers output by either cnn. the difference between the hidden layers is calculated to obtain the distance layer, dw . dw is input into the loss along with y , a variable indicating the dissimilarity, regarding kinase interactions, between a and b. after training is complete, the "twin" architecture is no longer necessary; each motif is input into a single twin and the output of the embedding layer gives the resultant representation of the given motif. pair of samples has identical labels, they are declared “same” (y = ). conversely, if a pair of samples has different la- bels, they are declared “different” (y = ). this definition relies on the assumption that each sample may only have one true label. to adapt the original siamese loss to account for the multi-label aspect of our task, we replaced the discrete variable y with a continuous variable, namely, the jaccard distance between kinase-label set pairs. thus, our modified loss function is defined as lj (a, b, y ) = ( ≠ja,b) (dw) +(ja,b) [max( , m≠dw)] , ( ) where ja,b is shorthand for j (ka, kb), which is the jaccard distance between the kinase-label set ka and the kinase-label set kb, associated with motif sample a and motif sample b, respectively. formally, j (ka, kb) = ≠ |ka fl kb| |ka fi kb| ( ) and consequently, Æ j (ka, kb) Æ . ( ) in this way we have defined a continuous metric by which to compare a pair of motifs, rather than the usual “ ” or “ ” distinction. the siamese network was trained for , iterations on the training set, precluding the data points in the independent test set. when composing a mini-batch, we alternated between "similar" and "dissimilar" motif pairs during training. simi- lar pairs were defined as motifs whose j (ka, kb) > . , and dissimilar pairs were defined as motifs whose j (ka, kb) Æ . . after training, we must produce the final embedding space to be used in training of our subsequent classification kirchoff et al. | ember: kinase-substrate multi-label prediction bior‰iv | .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / fig. . ember model architecture. here, the previously-trained siamese network is colored pink, and the classifier architecture is colored orange. the amino acid-length motif, a, is converted into a one-hot encoded matrix, v . the one-hot encoded matrix is then fed into a single twin from the siamese network. the -d embedding, e, is output by the siamese network. here, we reduce e to a -dimensional space for illustrative purposes using umap. then, e is fed into a multilayer perceptron (mlp) alongside v , which is fed into a convolutional neural network (cnn). then, the last layers of the separate networks are concatenated, followed by a series of fully-connected layers. the final output is a vector, k, of length eight, where each value corresponds to the probability that the motif a was phosphorylated by one of the kinase families indicated in k. network. to obtain the final embedding, we input each motif into a single arbitrary twin of the original network (because both twins learn the same weights and biases), producing a high-dimensional ( -dimensional) vector representation of the original motif sequence. the resultant motif embedding effected by the single siamese twin is further discussed in the results section. we used k-nearest neighbors (k-nn) classi- fication on each family to quantitatively compare the predic- tive capabilities of protvec and siamese embeddings in the coordinate-only space. for our k-nn computation, we used a k of . predictive model framework. ember architecture. an overview of the architecture of em- ber is shown in figure . ember takes as input raw motif sequences and the coordinates of each respective motif in the embedding space. we use one-hot encoded motifs as the sec- ond type of input into our model. each motif sequence is represented by a ◊ matrix. in addition, we utilize the embedding provided by our siamese network, which creates a latent space of dimensions m ◊ where m is the number of motifs. the inputs into our classifier network, one-hot sequences and embeddings, are fed through a convolutional neural network (cnn) and a multilayer perceptron (mlp), respectively. the outputs of the two networks are then concatenated, and the concatenated layer is fed through a series of fully-connected layers (a mlp), followed by a sigmoid activation function. we performed -fold cross validation to assess the accuracy of our model when trained on different training-validation folds. we averaged the performance on the independent test set across the five folds to compute our final performance on the classification task. evaluation metrics. in order to quantify the performance of our models, we computed the area under the receiver oper- ating characteristic curve (auroc) and the area under the precision-recall curve (auprc). these metrics were eval- uated per kinase family. we also show the micro-average and macro-average for both auroc and auprc. we define � = {⁄j : j = , ..., q} as the set of all labels. the micro- average, emicro, aggregates the label-wise contributions of each class: emicro = e( qÿ ⁄= tp⁄, qÿ ⁄= tn⁄, qÿ ⁄= f p⁄, qÿ ⁄= f n⁄), ( ) where e is an evaluation metric, in our case, either auroc or auprc. alternatively, the macro-average, emacro, takes into account the score for each respective class and averages those scores together, thus treating all classes equally: emacro = q qÿ ⁄= e(tp⁄, tn⁄, f p⁄, f n⁄), ( ) where e is once again an evaluation metric, in our case, ei- ther auroc or auprc. both the emacro and the emicro are calculated based on tp⁄, tn⁄, f p⁄, and f n⁄, which are, respectively, the number of true positives, the number of true negatives, the number of false positives, and the number of false negatives of label ⁄. | bior‰iv kirchoff et al. | ember: kinase-substrate multi-label prediction .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / kinase phylogenetic distances. we sought to leverage the phylogenetic relationships between kinases to improve predictions of kinase-motif interactions. specifically, we considered the dissimilarity of a pair of kinase families in conjunction with the dissimilarity of the two respective groups of motifs that either kinase family phosphorylates (i.e., “kinase-family dissimilarity” vs. “motif-group dissim- ilarity”). note that the terms “distance” and “dissimilarity” are interchangeable. as the phylogenetic distances given by manning et al. ( ) do not provide distances between typical and atypical kinase families, we established a proxy phylo- genetic distance that allows us to define distances between these two families. we define this proxy phylogenetic dis- tance through the levenshtein edit distance, lev(ka, kb), be- tween kinase-domain sequences. kinase-domain sequences are the specific subsequences of kinases that are directly in- volved in phosphorylation. these kinase-domain sequences were obtained from an online source provided by manning et al. ( ). distances between kinase domain sequences was calculated by performing local alignment, utilizing the blo- sum substitution matrix to weight indels and substitu- tions. to calculate overall kinase-family dissimilarity, we took the average of the levenshtein edit distances between each kinase domain pair, per family, d(fa, fb) = q kaœfa q kbœfb lev(ka, kb) |fa| · |fb| ( ) where d(fa, fb) is the dissimilarity metric (distance) between kinase family a and kinase family b. ka is the kinase-domain sequence of a kinase belonging to family a, kb is the kinase- domain sequence of a kinase belonging to family b, and the levenshtein distance between kinase domain ka and kinase domain kb is determined by lev(ka, kb). this formula was applied per kinase family pair and stored in an a ◊ b kinase- family dissimilarity matrix. we will refer to this proxy metric for evolutionary dissimilarity between kinase families as the “phylogenetic distance” between kinase families. kinase-family dissimilarity vs. motif-group dissimilarity. for our (kinase-family dissimilarity)-(motif-group dissimilarity) correlation, we defined motif-group dissimilarity in the same manner as kinase-family dissimilarity, finding the leven- shtein distance between motifs based on local alignment us- ing blosum . then, we sought to find the correlation between kinase-family dissimilarity and motif-group dissim- ilarity. therefore, calculation of motif-group dissimilarity, per kinase family pair, was defined identically as in equation , but based on the motifs specific to each kinase family, resulting in an a ◊ b motif-group dissimilarity matrix. kinase phylogenetic loss. to leverage evolutionary rela- tionships between kinase families into our predictions, we weighted the original binary cross entropy (bce) loss by a kinase phylogenetic metric. specifically, our weighted bce loss per minibatch is defined as: fig. . heatmap matrix depicting pairwise kinase-domain distances. levenshtein distances were normalized, with the yellow end of the color bar representing far dis- tances (less similar) and the pink end representing close distances (more similar). p bce(ŷ, y) = ≠ n nÿ i p ti yi log(ŷi), ( ) where n is the size of the mini batch, yi is the one-hot actual label vector for sample i, ŷi is the predicted label vector for sample i, and pi is the phylogenetic weight vector for sample i given by pi = # w ,i, ..., w|k|,i $t , ( ) with wk,i being the average phylogenetic weight scalar of label k for sample i: wk,i = |li| ÿ jœli fk,j , ( ) and fk,j is the vector of family weights of label k. finally, li is the set of indices corresponding to positive labels for sample i li = {i œ [ , ..., m ≠ ] : yi = } , ( ) where m is the length of the one-hot true label vector for sample i. results correlation between kinase phylogenetic dissimilarity and phosphorylated motif dissimilarity. we sought to il- luminate the relationship between kinase-family dissimilar- ity and phosphorylated motif-group dissimilarity described in the methods section. that is, we wanted to determine if “similar” kinases tend to phosphorylate “similar” motifs based on some quantitative metric. to this end, we calcu- lated the correlation between average kinase-family dissimi- larities and motif-group dissimilarities based on normalized pairwise alignment scores. from this, we found a pearson correlation of . , indicating a moderate positive relation- ship between kinase dissimilarity and that of their respective kirchoff et al. | ember: kinase-substrate multi-label prediction bior‰iv | .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / phosphorylated motifs. while moderate, this correlation be- tween kinase dissimilarity and motif dissimilarity suggests a potential signal in the phylogenetic relationships that could be leveraged to improve predictions. using our normalized distances as a proxy for phylogenetic distance (see methods), the dissimilarity between kinases is displayed as a heatmap in figure . the akt and pkc fam- ily have the greatest similarity (lowest dissimilarity) of all pairwise comparisons, with pka-akt and mapk-cdk fol- lowing as the next most similar family pairs. together, these results provide motivation to incorporate both motif dissim- ilarity and kinase relatedness into the predictive model, as achieved through our custom phylogenetic loss function de- scribed in methods. the effects of this approach are de- scribed later in results. motif embedding via siamese network. we sought to develop a novel learned representation of motifs using a siamese neural network. siamese networks were first in- troduced in the early s as a method to solve signature verification, posed as an image-to-image matching problem ( ). siamese networks perform metric learning by exploit- ing the dissimilarity between a pair of data points. training a siamese network effects a function with the goal of produc- ing a meaningful embedding, capturing semantic similarity in the form of a distance metric. we hypothesized that incor- porating high-dimensional vector representations of motifs (i.e., an embedding) into the input of a classification network would provide more predictive power than methods that do not utilize such information. in our siamese model, we opted to use convolutional layers as described in methods. we per- formed k-nn on both the protvec and siamese embeddings of motifs and found that the siamese embedding produced better predictions, on average, than the protvec embedding (see table ). more specifically, the siamese embedding resulted in a macro-average auroc of . compared to protvec’s . and a micro-average auroc of . com- pared to protvec’s . . likewise, the siamese embedding had better auprc, with a macro-average auprc of . compared to protvec’s . and a micro-average auprc of . compared to protvec’s . . furthermore, we cal- culated the silhouette scores of both embeddings and found our siamese embedding to have a significantly better mean silhouette score of . compared to protvec’s . . we performed dimensionality reduction for visualization of the siamese embeddings using uniform manifold approxi- mation and projection (umap) ( ). for our umap im- plementation, we used neighbors, a minimum distance of . , and euclidean distance for our metric. the resulting -dimensional umap motif embeddings derived from the siamese network are shown in figure . as can be seen, the motifs phosphorylated by a given kinase family have a dis- tinctive distribution in the embedding space, with some distri- butions being highly unique, and with some significant over- lap between certain families. more specifically, our siamese embedding shows that motifs phosphorylated by either pkc, pka, or akt appear to occupy a similar latent space. sim- ilarly, motifs phosphorylated by either cdk or mapk also fig. . siamese embedding of motifs. each point represents one of the mo- tifs, and each panel displays kinase family-specific phosphorylation patterns. each colored point corresponds to a motif in the test set phosphorylated by a member of the specified kinase family. highlighted points are slightly enlarged in size to enhance readability. | bior‰iv kirchoff et al. | ember: kinase-substrate multi-label prediction .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / table . area under the receiver operating characteristic curve (auroc) and area under the precision recall curve (auprc) scores on independent test set prediction, given by k -nn performed on the protvec and siamese embedding. precision recall family protvec siamese protvec siamese akt . . . . cdk . . . . ck . . . . mapk . . . . pikk . . . . pka . . . . pkc . . . . src . . . . macro-average . . . . micro-average . . . . occupy a similar space. these observations mirror the phylo- genetic relationships shown in figure , where the mapk and cdk families have a relatively short mean evolution- ary distance between them, and the pkc-pka distance, even shorter still. in addition to these overlapping families, we also observe that src-phosphorylated motifs form a distinct cluster. this is likely driven by the fact that src is the only tyrosine ki- nase family among the eight kinase families we investigated, with its motifs invariably having a tyrosine (y) at the eighth position in the -amino acid sequence, compared to the other families whose motifs have either a serine (s) or a (t) in this position. this effects a significant sequence dis- crepancy between src-phosphorylated motifs and remaining motifs. the fact that src-phosphorylated motifs cluster so precisely serves as a sanity check that our siamese embed- ding is capturing sequence (dis)similarity information despite being trained through comparison of kinase-motif phospho- rylation events in lieu of motif sequence comparisons. we note that the embedding produced by our siamese network is quite qualitatively similar to the protvec embedding in terms of these kinase-label clusters indicated in the umap projec- tions. the umap projections of the protvec embeddings are included in supplementary material. prediction of phosphorylation events. following train- ing of ember on both motif sequences and motif vector representations as input, we conducted an ablation test in which we removed the motif vector representation (or coordi- nate) input along with its respective mlp; this was achieved by applying a dropout rate of . on the final layer of the coordinate-associated mlp. this ablation test allowed us to observe how our novel motif sequence-coordinate model compares to a canonical deep learning model whose input consists solely of one-hot encoded motif sequences (such as in the methods utilized by wang et al. ( ) and luo et al. ( )). we also compared ember trained with the standard bce loss to ember trained with our kinase phylogenetic loss. all predictive models, as described in table , were trained on identical training-validation splits and evaluated on the same independent test set. fig. . confusion matrix for ember predictions on the test set. the numbers inside each box represent the raw number of predictions per box. the color scale is based on the ratio of predictions (in the corresponding box) to total predictions, per label. a lighter color corresponds to a larger ratio of predictions to total predictions. comparisons between the predictive capability of the mod- els described here are quantified by auroc and auprc, and these metrics are presented for each of the three mod- els in table . as indicated by table , ember, utilizing both sequence and coordinate information, outperforms the canonical sequence model in both auroc and auprc. in addition, integration of phylogenetic information into the loss provides a generally small but consistent additional boost in performance, showing the best overall results out of the three models for auroc and auprc. individual performance metric curves for each kinase label, produced by ember trained via the phylogenetic loss, are shown in figure . a confusion matrix providing greater detail and illustrating the relative effectiveness of our model for prediction of differ- ent kinase families is shown in figure . in order to compute the confusion matrix, we set a prediction threshold of . , declaring any prediction above . as "positive" and any pre- diction equal to or less than . as "negative". as indicated by the confusion matrix, the model often confounds motifs that are phosphorylated by closely related kinase families, for ex- ample, mapk and cdk. this is presumably due to the close phylogenetic relationship between mapk and cdk, as in- dicated by their relatively low phylogenetic distance of . (figure ). furthermore, our siamese network embeds mo- tifs of these respective families into the same relative space, as shown in figure , further illustrating the confounding na- ture of these motifs. a similar trend is found for motifs phos- phorylated by pkc, pka, and akt. this trio is also shown to be closely related as indicated by the correlations in figure and the embeddings in figure . comparison to existing methods. we sought to compare ember’s performance to the two existing deep learning methods, musitedeep and deepphos, which adopt single- kirchoff et al. | ember: kinase-substrate multi-label prediction bior‰iv | .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / label models. however, this is not a straight-forward com- parison because ember was trained on sequences amino acids in length while musitedeep and deepphos were trained on sequences of and amino acids in length, respec- tively. thus, we must elongate our -mers to lengths of and in order for musitedeep and deepphos to accept those sequences into their architectures as input. to accom- plish this, we queried the uniprot database to find complete protein sequences of which our test set motifs were subse- quences. for instances in which a motif was a subsequence of multiple proteins we chose a protein at random from the set. by referencing the original complete protein sequence we were able to elongate our motifs by adding nine (in the case of musitedeep) or (in the case of deepphos) amino acids to each flank of the original -mer motif. this re- sulted in a test set of -mers for musitedeep and -mers for deepphos. we note that of the eight kinase families for which our model produces predictions, deepphos has functioning models for only four of the families (cdk, ck , mapk, and pkc), and musitedeep has models for only five of the families (cdk, ck , mapk, pka, and pkc). we show auroc and auprc results per kinase label from each of the three meth- ods in figure . ember outperforms musitedeep and deep- phos on all four averaged metrics, indicating that our multi- label approach may be better equipped to solve the problem of kinase-motif prediction compared to the single-label ap- proaches. discussion illuminating the map of kinase-substrate interactions has the potential to enhance our understanding of basic cellular sig- naling as well as drive health applications, for example, by facilitating the development of novel kinase inhibitor-based therapies that disrupt kinase signaling pathways. here, we have presented a deep learning-based approach that aims to predict which substrates are likely to be phosphorylated by a specific kinase family. in particular, our multi-label ap- proach establishes a unified model that utilizes all available kinase-motif data to learn broader structures within the data and improve predictions across all kinase families in tandem. this approach avoids challenges in hyperparameter tuning in- herent in the development of an individual model for each kinase. we believe that this approach will enable continuing improvement in predictions, as newly generated data describ- ing any kinase-motif phosphorylation event can assist in im- proving predictions for all kinases. that is, a kinase-motif interaction discovered for pka will improve the predictions not just for pka, but also for akt, pkc, mapk, etc. through the transfer learning capabilities inherent in our multi-label model. we showed that incorporation of a learned vector repre- sentation of motifs, namely the motifs’ coordinates in the siamese embedding space, serves to improve performance over a model that utilizes only one-hot encoded motif se- quences as input. not only did the siamese embedding im- prove prediction of phosphorylation events through a neu- fig. . auroc and auprc results achieved on the independent test set by deep- phos, musitedeep, and ember. the auroc and auprc of each kinase family label is shown in the respective legends. | bior‰iv kirchoff et al. | ember: kinase-substrate multi-label prediction .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / table . auroc and auprc results achieved on the independent test set across deep learning classification models. the auroc and auprc are presented per kinase family for each model. from left to right, we include results for the ablated sequence-only cnn, ember trained using a canonical bce loss, and ember trained using the kinase phylogeny-weighted loss as described in methods. auroc auprc family seq-cnn ember (bce) ember (pbce) seq-cnn ember (bce) ember (pbce) akt . . . . . . cdk . . . . . . ck . . . . . . mapk . . . . . . pikk . . . . . . pka . . . . . . pkc . . . . . . src . . . . . . macro-average . . . . . . micro-average . . . . . . ral network architecture, but it also outperformed protvec, a previously developed embedding, in a coordinate-based k- nn task. this improvement over protvec was in spite of the fact that the siamese network utilized less than , training sequences of amino acids in length compared to protvec’s , sequences of approximately amino acids in average length. the siamese embedding was further generated through direct comparison of kinase-motif phos- phorylation events rather than simply the sequence-derived data used by protvec. furthermore, protvec is a generalized protein embedding while the siamese embedding described here has the potential to be customized. for example, the use of the jaccard distance in the siamese loss allows the network to be trained on any number of multi-label datasets such acetylation, methylation, and carbonylation reactions. we also found that there is a small though meaningful rela- tionship between the evolutionary distance between kinases and the motifs they phosphorylate, supporting the concept that closely related kinases will tend to phosphorylate similar motifs. when encoded in the form of our phylogenetic loss function, this relationship was able to slightly improve pre- diction accuracies. together, these results suggest that em- ber holds significant promise towards the task of illuminat- ing the currently unknown relationships between kinases and the substrates they act on. acknowledgements we would like to acknowledge members of the gomezlab for helpful comments and feedback. this work was supported by grants through the national institutes of health (grant #s ca , ca , ca , dk ). bibliography . tzong-yi lee, justin bo-kai hsu, wen-chi chang, and hsien-da huang. regphos: a system to explore the protein kinase-substrate phosphorylation network in humans. nucleic acids res., (database issue):d – , january . . g manning, d b whyte, r martinez, t hunter, and s sudarsanam. the protein kinase complement of the human genome. science, ( ): – , december . . panayotis vlastaridis, pelagia kyriakidou, anargyros chaliotis, yves van de peer, stephen g oliver, and grigoris d amoutzias. estimating the total number of phosphopro- teins and phosphorylation sites in eukaryotic proteomes. gigascience, ( ): – , february . . leah j wilson, adam linley, dean e hammond, fiona e hood, judy m coulson, david j macewan, sarah j ross, joseph r slupsky, paul d smith, patrick a eyers, and ian a prior. new perspectives, opportunities, and challenges in exploring the human protein kinome. cancer res., december . . g l johnson and razvan lapadat. mitogen-activated protein kinase pathways mediated by erk , jnk , and p protein kinases. science, ( ): , . . gayathri k perera, chrysanthi ainali, ekaterina semenova, christian hundhausen, guillermo barinaga, deepika kassen, andrew e williams, muddassar m mirza, mercedesz balazs, xiaoting wang, robert sanchez rodriguez, andrej alendar, jonathan barker, sophia tsoka, wenjun ouyang, and frank o nestle. integrative biology approach iden- tifies cytokine targeting strategies for psoriasis. sci. transl. med., ( ): ra , february . . nicole tegtmeyer, matthias neddermann, carmen isabell asche, and steffen backert. sub- version of host kinases: a key network in cellular signaling hijacked by helicobacter pylori caga. mol. microbiol., may . . amandine charras, pinelopi arvaniti, christelle le dantec, marina i arleevskaya, kaliopi zachou, george n dalekos, anne bordon, and yves renaudineau. jak inhibitors suppress innate epigenetic reprogramming: a promise for patients with sjögren’s syndrome. clin. rev. allergy immunol., june . . alessia alunno, ivan padjen, antonis fanouriakis, and dimitrios t boumpas. pathogenic and therapeutic relevance of jak/stat signaling in systemic lupus erythematosus: integra- tion of distinct inflammatory pathways and the prospect of their inhibition with an oral agent. cells, ( ), august . . ya nan deng, joseph a bellanti, and song guo zheng. essential kinases and transcrip- tional regulators and their roles in autoimmunity. biomolecules, ( ), april . . kyla a l collins, timothy j stuhlmiller, jon s zawistowski, michael p east, trang t pham, claire r hall, daniel r goulet, samantha m bevill, steven p angus, sara h velarde, noah sciaky, tudor i oprea, lee m graves, gary l johnson, and shawn m gomez. proteomic analysis defines kinase taxonomies specific for subtypes of breast cancer. oncotarget, ( ): – , march . . elise j needham, benjamin l parker, timur burykin, david e james, and sean j humphrey. illuminating the dark phosphoproteome. sci. signal., ( ), january . . wenwen fan, xiaoyi xu, yi shen, huanqing feng, ao li, and minghui wang. prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional infor- mation and random forest. amino acids, ( ): – , april . . shu-yun huang, shao-ping shi, jian-ding qiu, and ming-chu liu. using support vector machines to identify protein phosphorylation sites in viruses. j. mol. graph. model., : – , march . . fuyi li, chen li, tatiana t marquez-lago, andré leier, tatsuya akutsu, anthony w purcell, a ian smith, trevor lithgow, roger j daly, jiangning song, and kuo-chen chou. quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphory- lation sites in the human proteome. bioinformatics, ( ): – , december . . yu xue, ao li, lirong wang, huanqing feng, and xuebiao yao. ppsp: prediction of pk- specific phosphorylation site with bayesian decision theory. bmc bioinformatics, : , march . . duolin wang, shuai zeng, chunhui xu, wangren qiu, yanchun liang, trupti joshi, and dong xu. musitedeep: a deep-learning framework for general and kinase-specific phos- phorylation site prediction. bioinformatics, ( ): – , december . . fenglin luo, minghui wang, yu liu, xing-ming zhao, and ao li. deepphos: prediction of protein phosphorylation sites with deep learning. bioinformatics, ( ): – , august . . ehsaneddin asgari and mohammad r k mofrad. continuous distributed representation of biological sequences for deep proteomics and genomics. plos one, ( ):e , november . . peter v hornbeck, jon m kornhauser, sasha tkachev, bin zhang, elzbieta skrzypek, beth murray, vaughan latham, and michael sullivan. phosphositeplus: a comprehen- sive resource for investigating the structure and function of experimentally determined post- translational modifications in man and mouse. nucleic acids res., (database issue): d – , january . . jianfei hu, hee-sool rho, robert h newman, jin zhang, heng zhu, and jiang qian. phos- phonetworks: a database for human phosphorylation networks. bioinformatics, ( ): – , january . . holger dinkel, claudia chica, allegra via, cathryn m gould, lars j jensen, toby j gibson, and francesca diella. phospho.elm: a database of phosphorylation sites–update . nucleic acids res., (database issue):d – , january . . jane bromley, isabelle guyon, yann lecun, eduard säckinger, and roopak shah. signa- ture verification using a “siamese” time delay neural network. in j d cowan, g tesauro, and j alspector, editors, advances in neural information processing systems , pages – . morgan-kaufmann, . . r hadsell, s chopra, and y lecun. dimensionality reduction by learning an invariant map- ping. in ieee computer society conference on computer vision and pattern recog- nition (cvpr’ ), volume , pages – , june . kirchoff et al. | ember: kinase-substrate multi-label prediction bior‰iv | .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . leland mcinnes, john healy, and james melville. umap: uniform manifold approximation and projection for dimension reduction. february . | bior‰iv kirchoff et al. | ember: kinase-substrate multi-label prediction .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / s e e a ma e a ember: multi-label prediction of kinase-substrate phosphor lation events through deep learning . metrics definitions of metrics that characteri e the area under the receiver operator curve (auroc) and the precision-recall curve (auprc) as described in the main manuscript: the auroc is the integral of the receiver operator curve, which is found b plotting the true positive rate (tpr) against the false positive rate (fpr) at various decision thresholds. the tpr is defined as tp / (tp + fn) where tp are the true positive predictions and fn are the false negative predictions. the fpr is defined as fp / (fp + tn) where fp are the false positive predictions and tn are the true negative predictions. the auprc is the integral of the receiver operator curve, which is found b plotting the precision against the recall (i.e. tpr) at various decision thresholds. precision is defined as: tp / (tp + fp) where tp are the true positive predictions and fp are the false positive predictions. . protvec embedding figures. here, we show a qualitative comparison, via a umap reduction, between the original protvec embedding and the averaged protvec embedding. original: .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / averaged: . protvec embedding knn results. in the table below we show the auroc and auprc results of the knn classification task on the original protvec embedding and the averaged protvec embedding. for our knn calculation, we used k = . auroc auprc ki a e igi al p vec a e aged p vec igi al p vec a e aged pv ak . . . . cdk . . . . ck . . . . mapk . . . . pikk . . . . pka . . . . pkc . . . . s c . . . . ac . . . . ic . . . . .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / . hard are training and testing of ember occurred on a linux s stem with the following configuration: - pop!_os linux . - intel xeon e - v with cores @ . gh - gb ram - nvidia titan xp on this s stem, the siamese network took around minutes to train, and the classification network took around minutes to train. .cc-by . international licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in perpetuity. it is made the copyright holder for this preprintthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / simulating the outcome of amyloid treatments in alzheimer’s disease from imaging and clinical data simulating the outcome of amyloid treatments in alzheimer's disease from imaging and clinical data clément abi nader , nicholas ayache , giovanni b. frisoni , philippe robert , marco lorenzi , for the alzheimer’s disease neuroimaging initiative* in this study we investigate a novel quantitative instrument for the development of intervention strategies for disease modifying drugs in alzheimer's disease. our framework is based on the modeling of the spatio-temporal dynamics governing the joint evolution of imaging and clinical biomarkers along the history of the disease, and allows the simulation of the effect of intervention time and drug dosage on the biomarkers' progression. when applied to multi- modal imaging and clinical data from the alzheimer's disease neuroimaging initiative our method enables to generate hypothetical scenarios of amyloid lowering interventions. the results quantify the crucial role of intervention time, and provide a theoretical justification for testing amyloid modifying drugs in the pre-clinical stage. our experimental simulations are compatible with the outcomes observed in past clinical trials, and suggest that anti-amyloid treatments should be administered at least years earlier than what is currently being done in order to obtain statistically powered improvement of clinical endpoints. université côte d'azur, inria sophia antipolis, epione research project, france. memory clinic and lanvie-laboratory of neuroimaging of aging, hospitals and university of geneva, geneva, switzerland université côte d'azur, cobtek lab, mnc program, france. * data used in preparation of this article were obtained from the alzheimer’s disease neuroimaging initiative (adni) database (adni.loni.usc.edu). as such, the investigators within the adni contributed to the design and implementation of adni and/or provided data but did not participate in analysis or writing of this report. a complete listing of adni investigators can be found at: http://adni.loni.usc.edu/wpcontent/uploads/how_to_apply/adni_acknowledgement_textunderscore list.pdf. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://adni.loni.usc.edu/wpcontent/uploads/how_to_apply/adni_acknowledgement_textunderscore% list.pdf https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / correspondence to: clément abi nader epione research project, inria sophia-antipolis, , route des lucioles, sophia- antipolis, france. e-mail: clement.abi-nader@inria.fr keywords : alzheimer’s disease ; clinical trials ; disease progression; amyloid hypothesis; biomarkers abbreviations : dpm = disease progression model; ode = ordinary differential equations; adni = alzheimer's disease neuroimaging initiative; nl = healthy; mci = mild cognitive impairment; ad = alzheimer's dementia; av = ( )f-florbetapir amyloid; fdg = ( )f- fluorodeoxyglucose; adas = alzheimer's disease assessment scale; mmse = mini- mental state examination; faq = functional assessment questionnaire; ravlt = rey auditory verbal learning test; cdrsb = clinical dementia rating scale sum of boxes introduction the number of people affected by alzheimer's disease has recently exceeded millions and is expected to double every years (prince et al., ), thus posing significant healthcare challenges. yet, while the disease mechanisms remain in large part unknown, there are still no effective pharmacological treatments leading to tangible improvements of patients' clinical progression. one of the main challenges in understanding alzheimer's disease is that its progression goes through a silent asymptomatic phase that can stretch over decades before a clinical diagnosis can be established based on cognitive and behavioral symptoms. to help designing appropriate intervention strategies, hypothetical models of the disease history have been proposed, characterizing the progression by a cascade of morphological and molecular changes affecting the brain, ultimately leading to cognitive impairment (jack et al., ; jack & holtzman, ). the dominant hypothesis is that disease dynamics along the asymptomatic period are driven by the deposition in the brain of the amyloid  peptide, triggering the so- called “amyloid cascade” (bateman et al., ; braak & braak, ; delacourte et al., ; murphy & levine, ; villemagne et al., ). based on this rationale, clinical trials have been focusing on the development and testing of disease modifiers targeting amyloid  aggregates (cummings, lee, et al., ), for example by increasing its clearance or blocking its accumulation. although the amyloid hypothesis has been recently invigorated by a post-hoc .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint mailto:clement.abi-nader@inria.fr https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / analysis of the aducanumab trial (howard & liu, ), clinical trials failed so far to show efficacy of this kind of treatments (schwarz et al., ), as the clinical primary endpoints were not met (egan et al., ; honig et al., ; wessels et al., ), or because of unacceptable adverse effects (henley et al., ). in the past years, growing consensus emerged about the critical importance of intervention time, and about the need of starting anti-amyloid treatments during the pre-symptomatic stages of the disease (aisen et al., ). nevertheless, the design of optimal intervention strategies is currently not supported by quantitative analysis methods allowing to model and assess the effect of intervention time and dosing (klein et al., ). the availability of models of the pathophysiology of alzheimer’s disease would entail great potential to test and analyze clinical hypothesis characterizing alzheimer’s disease mechanisms, progression, and intervention scenarios. within this context, quantitative models of disease progression, disease progression models referred to as dpms, have been proposed (fonteijn et al., ; jedynak et al., ; nader et al., ; oxtoby et al., ; schiratti et al., ), to quantify the dynamics of the changes affecting the brain during the whole disease span. these models rely on the statistical analysis of large datasets of different data modalities, such as clinical scores, or brain imaging measures derived from mri, amyloid- and fluorodeoxyglucose- pet (bilgel et al., ; burnham et al., ; donohue et al., ; y iturria-medina et al., ; koval et al., ). in general, dpms estimate a long-term disease evolution from the joint analysis of multivariate time-series acquired on a short-term time-scale. due to the temporal delay between the disease onset and the appearance of the first symptoms, dpms rely on the identification of an appropriate temporal reference to describe the long-term disease evolution (lorenzi et al., ; marinescu et al., ). these tools are promising approaches for the analysis of clinical trials data, as they allow to represent the longitudinal evolution of multiple biomarkers through a global model of disease progression. such a model can be subsequently used as a reference in order to stage subjects and quantify their relative progression speed (insel et al., ; li et al., ; oxtoby et al., ; young et al., ). however, these approaches remain purely descriptive as they don't account for causal relationships among biomarkers. therefore, they generally don't allow to simulate progression scenarios based on hypothetical intervention strategies, thus providing a limited interpretation of the pathological dynamics. this latter capability is of utmost importance for planning and assessment of disease modifying treatments. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / to fill this gap, recent works such as (hao & friedman, ; petrella et al., ) proposed to model alzheimer’s disease progression based on specific assumptions on the biochemical processes of pathological protein propagation. these approaches explicitly define biomarkers interactions through the specification of sets of ordinary differential equations (odes), and are ideally suited to simulate the effect of drug interventions (yasser iturria-medina et al., ). however, these methods are mostly based on the arbitrary choices of pre-defined evolution models, which are not inferred from data. this issue was recently addressed by (garbarino & lorenzi, ), where the authors proposed an hybrid modeling method combining traditional dpms with dynamical models of alzheimer’s disease progression. still, since this approach requires to design suitable models of protein propagation across brain regions, extending this method to jointly account for spatio-temporal interactions between several processes, such as amyloid propagation, glucose metabolism, and brain atrophy, is considerably more complex. finally, these methods are usually designed to account for imaging data only, which prevents to jointly simulate heterogeneous measures (antelmi et al., ), such as image-based biomarkers and clinical outcomes, the latter remaining the reference markers for patients and clinicians. in this work we present a novel computational model of alzheimer’s disease progression allowing to simulate intervention strategies across the history of the disease. the model is here used to quantify the potential effect of amyloid modifiers on the progression of brain atrophy, glucose metabolism, and ultimately on the clinical outcomes for different scenarios of intervention. to this end, we model the joint spatio-temporal variation of different modalities along the history of alzheimer’s disease by identifying a system of odes governing the pathological progression. this latent odes system is specified within an interpretable low- dimensional space relating multi-modal information, and combines clinically-inspired constraints with unknown interactions that we wish to estimate. the interpretability of the relationships in the latent space is ensured by mapping each data modality to a specific latent coordinate. the model is formulated within a bayesian framework, where the latent representation and dynamics are efficiently estimated through stochastic variational inference. to generate hypothetical scenarios of amyloid lowering interventions, we apply our approach to multi-modal imaging and clinical data from the alzheimer’s disease neuroimaging initiative (adni). our results provide a meaningful quantification of different intervention strategies, compatible with findings previously reported in clinical studies. for example, we .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / estimate that in a study with individuals per arm, statistically powered improvement of clinical endpoints can be obtained by completely arresting amyloid accumulation at least years before alzheimer's dementia. the minimum intervention time decreases to years for studies based on individuals per arm. materials and methods in the following sections, healthy individuals will be denoted as nl stable, subjects with mild cognitive impairment as mci stable, subjects diagnosed with alzheimer's dementia as ad. we define conversion as the change of diagnosis towards a more pathological state. therefore, nl converters are subjects who were diagnosed as cognitively normal at baseline and whose diagnosis changed either in mci or ad during their follow-up visits. mci converters are subjects who were diagnosed as mci at baseline and subsequently progressed to ad. diagnosis was established using the dx column from the adnimerge file (https://adni.bitbucket.io/index.html), which reflects the standard adni clinical assessment based on wechsler memory scale, mini-mental state examination, and clinical dementia rating. amyloid concentration and glucose metabolism are respectively measured by ( )f- florbetapir amyloid (av )-pet and ( )f-fluorodeoxyglucose (fdg)-pet imaging. cognitive and functional abilities are assessed by the following neuro-psychological tests: alzheimer's disease assessment scale (adas ), mini-mental state examination (mmse), functional assessment questionnaire (faq), rey auditory verbal learning test (ravlt) immediate, ravlt learning, ravlt forgetting, and clinical dementia rating scale sum of boxes (cdrsb). study cohort and biomarkers' changes across clinical groups our study is based on a cohort of amyloid positive individuals composed of nl stable subjects, nl converters subjects, subjects diagnosed with mci, mci converters subjects, and ad patients. among the mci subjects, were early mci and were late mci. concerning the group of mci converters, subjects were late mci at baseline and were early mci. the term ``amyloid positive'' refers to subjects whose amyloid level in the csf was below the nominal cutoff of pg/ml (gamberger et al., ) either at baseline, or during any follow-up visit, and conversion to ad was determined using the last available follow-up information. this preliminary selection of patients aims at constituting a cohort of .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://adni.bitbucket.io/index.html https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / subjects for whom it is more likely to observe “alzheimer’s pathological changes” (jack et al., ). the length of follow-up varies between and years. further information about the data are available on https://adni.bitbucket.io/reference/, while details on data acquisition and processing are provided in section data acquisition and preprocessing. we show in table a socio-demographic information for the training cohort across the different clinical groups. table b shows baseline values and annual rates of change across clinical groups for amyloid burden (average normalized av uptake in frontal cortex, anterior cingulate, precuneus and parietal cortex), glucose metabolism (average normalized fdg uptake in frontal cortex, anterior cingulate, precuneus and parietal cortex), for hippocampal and medial temporal lobe volumes, and for the cognitive ability as measured by adas . compatibly with previously reported results (cash et al., ; schuff et al., ), we observe that while regional atrophy, glucose metabolism and cognition show increasing rate of change when moving from healthy to pathological conditions, the change of av is maximum in nl stable, nl converters and mci stable subjects. we also notice the increased magnitude of adas in ad as compared to the other clinical groups. finally, we note that glucose metabolism and regional atrophy show comparable magnitudes of change. the observations presented in table provide us with a coarse representation of the biomarkers' trajectories characterizing alzheimer’s disease. the complexity of the dynamical changes we may infer is limited, as the clinical stages roughly approximate a temporal scale describing the disease history, while very little insights can be obtained about the biomarkers' interactions. within this context, our model allows the quantification of the fine-grained dynamical relationships across biomarkers at stake during the history of the disease. investigation of intervention scenarios can be subsequently carried out by opportunely modulating the estimated dynamics parameters according to specific intervention hypothesis (e.g. amyloid lowering at a certain time). model overview we provide in figure an overview of the presented method. baseline multi-modal imaging and clinical information for a given subject are transformed into a latent variable composed of four z-scores quantifying respectively the overall severity of atrophy, glucose metabolism, amyloid burden, and cognitive and functional assessment. the model estimates the dynamical relationships across these z-scores to optimally describe the temporal transitions between follow-up observations. these transition rules are here mathematically defined by the .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / parameters of a system of odes, which is estimated from the data. this dynamical system allows to compute the evolution of the z-scores over time from any baseline observation, and to predict the associated multi-modal imaging and clinical measures. it is important to note that this modelling choice requires to have at least one visit per patient for which all the measures are available, in order to compute the z-scores temporal evolution. table a: baseline socio-demographic information for training cohort ( subjects for data points, follow-up from to years depending on subjects). average values, standard deviation in parenthesis. b: baseline values (bl) and annual rates of change (\% change / year) of amyloid burden (average normalized av uptake in frontal cortex, anterior cingulate, precuneus and parietal cortex), glucose metabolism (average normalized fdg uptake in frontal cortex, anterior cingulate, precuneus and parietal cortex), hippocampus volume, medial temporal lobe volume, and adas score for the different clinical groups. median values, interquartile range below. the volumes of the hippocampus and the medial temporal lobe are averaged across left and right hemispheres. nl: healthy individuals, mci: individuals with mild cognitive impairment, ad: patients with alzheimer's dementia. apoe : apolipoprotein e ε . fdg: ( )f-fluorodeoxyglucose positron emission tomography (pet) imaging. av : ( )f-florbetapir amyloid pet imaging. suvr: standardized uptake value ratio. mtl: medial temporal lobe. adas : alzheimer's disease assessment scale-cognitive subscale, items. a: socio-demographics nl nl mci mci ad stable converters stable converters n age (yrs) ( ) ( ) ( ) ( ) ( ) education (yrs) ( ) ( ) ( ) ( ) ( ) apoe -carrier (%) b: biomarkers and rates of change nl nl mci mci ad stable converters stable converters bl % change / bl % change / bl % change / bl % change / bl % change / year year year year year global av . . . . . . . . . . (suvr) [ . ; . ] [ . ; . ] [ . ; . ] [ . ; . ] [ . ; . ] [ . ; . ] [ . ; . ] [- . ; . ] [ . ; . ] [- . ; . ] global fdg . - . . - . . - . . - . . - . (suvr) [ . ; . ] [- . ; . ] [ . ; . ] [- . ; - . ] [ . ; . ] [- . ; . ] [ . ; . ] [- . ; - . ] [ . ; . ] [- . ; - . ] hippocampus . - . . - . . - . . - . . - . (ml) [ . ; . ] [- . ; - . ] [ . ; . ] [- . ; - . ] [ . ; . ] [- . ; - . ] [ . ; . ] [- . ; - . ] [ . ; . ] [- . ; - . ] mtl . - . . - . . - . . - . . - . (ml) [ . ; . ] [- . ; . ] [ . ; . ] [- . ; - . ] [ . ; . ] [- . ; . ] [ . ; . ] [- . ; - . ] [ . ; . ] [- . ; - . ] adas . . . . . . . . . . [ . ; . ] [- . ; . ] [ . ; . ] [- . ; . ] [ . ; . ] [ . ; . ] [ . ; . ] [ . ; . ] [ . ; . ] [ . ; . ] the model thus enables to simulate the pathological progression of biomarkers across the entire history of the disease. once the model is estimated, we can modify the odes parameters to simulate different evolution scenarios according to specific hypothesis. for example, by reducing the parameters associated with the progression rate of amyloid, we can investigate the relative change in the evolution of the other biomarkers. this setup thus .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / provides us with a data-driven system enabling the exploration of hypothetical intervention strategies, and their effect on the pathological cascade. data modelling we consider observations ( ) [ ( ), ( ),..., ( )] m t i i i i t t t t=x x x x , which correspond to multivariate measures derived from m different modalities (e.g clinical scores, mri, av , or fdg measures) at time t for subject i. each vector ( ) m i tx has dimension md . we postulate the figure overview of the method. a) high-dimensional multi-modal measures are projected into a -dimensional latent space. each data modality is transformed in a corresponding z-score zamy, zmet, zatr, zcli. b) the dynamical system describing the relationships between the z-scores allows to compute their transition across the evolution of the disease. c) given the latent space and the estimated dynamics, the follow-up measurements can be reconstructed to match the observed data. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / following generative model, in which the modalities are assumed to be independently generated by a common latent representation of the data ( ) i tz : ( ( ) | ( ), , ) ( ( ) | ( ), , ) ( ( ( ), ), ), ( ) ( ( ), ), ( ) ~ ( ( )), m i i i i m m m m i m m m i i i p t t p t t t t t t t p t      = = =    i x z σ ψ x z z z z z z ( ) where m  is measurement noise, while m  are the parameters of the function m  which maps the latent state to the data space for the modality m. for simplicity of notation we denote ( ) i tz by ( )tz . we assume that each coordinate of z is associated to a specific modality m, leading to an m-dimensional latent space. the  operator which gives the value of the latent representation at a given time t, is defined by the solution of the following system of odes: , ( ) ( )( ( )) ( ), = ,..., . m m m j m m j j m dz t k z t z t z t m m dt   = − + ( ) for each coordinate, the first term of the equation enforces a sigmoidal evolution with a progression rate mk , while the second term accounts for the relationship between modalities m and j through the parameters ,m j  . this system can be rewritten as: i , i, j , , ( ) ( ) ( ) ( ( ), ) where, if i=j, k if i=j and = otherwise; otherwise, if i=j otherwise. ( ) ( ) ( ) ode i i j i j i i j d t t t g t dt k k   = − =   =     =   z wz vz z w v v ( ) ode  denotes the parameters of the system of odes, which correspond to the entries of the matrices w and v. according to equation ( ), for each initial condition ( )z , the latent state at time t can be computed through integration, ( ) ( ) ( ( ), ) t ode t g x dx= +z z z . .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / we resort to variational inference and stochastic gradient descent in order to optimize the parameters of the model. the procedure is detailed in sections variational inference and model optimization of the supplementary material. simulating the long-term progression of alzheimer’s disease to simulate the long-term progression of alzheimer’s disease we first project the ad subjects in the latent space via the encoding functions. we can subsequently follow the trajectories of these subjects backward and forward in time, in order to estimate the associated trajectory from the healthy to their respective pathological condition. in practice, a gaussian mixture model is used to fit the empirical distribution of the ad subjects' latent projection. the number of components and covariance type of the gaussian mixture model is selected by relying on the akaike information criterion (akaike, ). the fitted gaussian mixture model allows us to sample pathological latent representations ( )i tz that can be integrated forward and backward in time thanks to the estimated set of latent odes, to finally obtain a collection of latent trajectories ( ) [ ( ),..., ( )] n t t t=z z z summarizing the distribution of the long-term alzheimer’s disease evolution. simulating intervention in this section we assume that we computed the average latent progression of the disease ( )tz . thanks to the modality-wise encoding (cf. supplementary section variational inference) each coordinate of the latent representation can be interpreted as representing a single data modality. therefore, we propose to simulate the effect of a hypothetical intervention on the disease progression, by modulating the vector ( )d t dt z after each integration step such that: * m ( ) ( ) where, = .( ) d t d t dt dt       =       z z Γ Γ ( ) the values m are fixed between and , allowing to control the influence of the corresponding modalities on the system evolution, and to create hypothetical scenarios of evolution. for .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / example, for a % (resp. %) amyloid lowering intervention we set amy  = (resp. . amy  = ). evaluating disease severity given an evolution ( )tz describing the disease progression in the latent space, we propose to consider this trajectory as a reference and to use it in order to quantify the individual disease severity of a subject x . this is done by estimating a time-shift  defined as: || ( , ) ( ) || | ( , ) ( ) | . argmin t m m m f t f z t    = − = − x z x ( ) this time-shift allows to quantify the pathological stage of a subject with respect to the disease progression along the reference trajectory ( )tz . moreover, the time-shift can still be estimated even in the case of missing data modalities, by only encoding the available measures of the observed subject. statistical analysis the model was implemented using the pytorch library (paszke et al., ). the estimated disease severity was compared group-wise via two-sided wilcoxon-mann-whitney test (p < . ). differences between the clinical outcomes distribution after simulation of intervention were compared via two-sided student’s t-test (p < . ). shadowed areas in the different figures show ± standard deviation of the mean. data availability the data used in this study are available from the adni database (adni.loni.usc.edu). results in the following, mri, fdg-pet, and av -pet images are processed in order to respectively extract regional gray matter density, glucose metabolism and amyloid load from .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / a brain parcellation. the z-scores of gray matter atrophy (zatr), glucose metabolism (zmet), and amyloid burden (zamy), are computed using the measures obtained by this pre-processing step. the clinical z-score zcli is derived from neuro-psychological scores: adas , mmse, faq, ravlt immediate, ravlt learning, ravlt forgetting and cdrsb. this panel of scores was chosen to provide a comprehensive representation of cognitive, memory and functional abilities. data acquisition and preprocessing data used in the preparation of this article were obtained from the adni database. the adni was launched in as a public-private partnership, led by principal investigator michael w. weiner, md. for up-to-date information, see www.adni-info.org. we considered four types of biomarkers, related to clinical scores, gray matter atrophy, amyloid load and glucose metabolism, and respectively denoted by cli, atr, amy and met. mri images were processed following the longitudinal pipeline of freesurfer (reuter et al., ), to obtain gray matter volumes in a standard anatomical space. av -pet and fdg-pet images were aligned to the closest mri in time and normalized to the cerebellum uptake. regional gray matter density, amyloid load and glucose metabolism were extracted from the desikan-killiany parcellation (desikan et al., ). we discarded white-matter, ventricular, and cerebellar regions, thus obtaining regions that were averaged across hemispheres. therefore, for a given subject, xatr, xamy and xmet are respectively -dimensional vectors. the variable xcli is composed of the neuro-psychological scores adas , mmse, ravlt immediate, ravlt learning, ravlt forgetting, faq, and cdrsb. the total number of measures is of longitudinal data points. we recall that the model estimation requires a visit for which all the measures are available in order to obtain the z-scores evolution of a given subject, but can handle missing data in the follow-up by finding the parameters that best match the available measures. progression model and latent relationships we show in figure panel i) the dynamical relationships across the different z-scores estimated by the model, where direction and intensity of the arrows quantify the estimated increase of one variable with respect to the other. being the scores adimensional, they have been conveniently rescaled to the range [ , ] indicating increasing pathological levels. these .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint http://www.adni-info.org/ https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / relationships extend the summary statistics reported in table to a much finer temporal scale and wider range of possible biomarkers' values. we observe in figure a, b and c that large values of the amyloid score zamy trigger the increase of the remaining ones: zmet, zatr, and zcli. figure d shows that large increase of the atrophy score zatr is associated to pathological glucose metabolism indicated by large values of zmet. moreover, we note that high zmet values also contribute to an increase of zcli (figure e). finally, figure f shows that high atrophy values lead to an increase mostly along the clinical dimension zcli. this chain of relationships is in agreement with the cascade hypothesis of ad (jack et al., ; jack & holtzman, ). relying on the dynamical relationships shown in figure panel i), starting from any initial set of biomarkers values we can estimate the relative trajectories over time. figure panel ii) (left), shows the evolution obtained by extrapolating backward and forward in time the trajectory associated to the z-scores of the ad group. the x-axis represents the years from conversion to ad, where the instant t= corresponds to the average time of diagnosis estimated for the group of mci progressing to dementia. as observed in figure panel i) and table , the amyloid score zamy increases and saturates first, followed by zmet and zatr scores whose progression slows down when reaching clinical conversion, while the clinical score exhibits strong acceleration in the latest progression stages. figure panel ii) (right) shows the group- wise distribution of the disease severity estimated for each subject relatively to the modelled long-term latent trajectories. the group-wise difference of disease severity across groups is statistically significant and increases when going from healthy to pathological stages (wilcoxon-mann-whitney test p < . for each comparisons). the reliability of the estimation of disease severity was further assessed through testing on an independent cohort, and by comparison with a previously proposed disease progression modeling method from the state- of-the-art (lorenzi et al., ). the results are provided in section time-shift comparison and .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / validation of the supplementary material and show positive generalization results as well as a favorable comparison with the benchmark method. from the z-score trajectories of figure panel ii) (left) we predict the progression of imaging and clinical measures shown in figure . we observe that amyloid load globally increases and saturates early, compatibly with the positive amyloid condition of the study cohort. abnormal glucose metabolism and gray matter atrophy are delayed with respect to amyloid, and tend to map prevalently temporal and parietal regions. finally, the clinical measures exhibit a non- figure dynamical relationships, z-scores evolution and disease staging. panel i: estimated dynamical relationships across the different z-scores (a to f). given the values of two z-scores, the arrow at the corresponding coordinates indicates how one score evolves with respect to the other. the intensity of the arrow gives the strength of the relationship between the two scores. panel ii, left: estimated long-term latent dynamics (time is relative to conversion to alzheimer's dementia). shadowed areas represent the standard deviation of the average trajectory. panel ii, right: distribution of the estimated disease severity across clinical stages, relatively to the long-term dynamics on the left. nl: normal individuals, mci: mild cognitive impairment, ad: alzheimer's dementia. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / linear pattern of change, accelerating during the latest progression stages. these dynamics are compatible with the summary measures on the raw data reported in table . simulating clinical intervention this experimental section is based on two intervention scenarios: a first one in which amyloid is lowered by %, and a second one in which it is reduced by % with respect to the estimated natural progression. in figure we show the latent z-scores evolution resulting from either % or % amyloid lowering performed at the time t=- years. according to these scenarios, intervention results in a sensitive reduction of the pathological progression for figure model-based progression of alzheimer’s disease. estimated long-term evolution of cortical measurements for the different types of imaging markers, and clinical scores. shadowed areas represent the standard deviation of the average trajectory. brain images were generated using the software provided in (marinescu et al., ). .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / atrophy, glucose metabolism and clinical scores, albeit with a stronger effect in case of total blockage. we further estimated the resulting clinical endpoints associated with the two amyloid lowering scenarios, at increasing time points and for different sample sizes. clinical endpoints consisted in the simulated adas , mmse, faq, ravlt immediate, ravlt learning, ravlt forgetting and cdrsb scores at the reference conversion time (t= ). the case placebo indicates the scenario where clinical values were computed at conversion time from the estimated natural progression shown in figure panel ii) (left). figure shows the change in statistical power depending on intervention time and sample sizes. for large sample sizes ( subjects per arm) a power greater than . can be obtained around years before conversion, depending on the outcome score, where in general we observe that ravlt forgetting exhibits a higher power than the other scores. when sample size is lower than subjects per arm, a power greater than . is reached if intervention is performed at the latest years before conversion, with a mild variability depending on the considered clinical score. we notice that in the case of % amyloid lowering, in order to reach the same power intervention needs to be consistently performed earlier compared to the scenario of % amyloid lowering for the same sample size and clinical score. for instance, if we consider adas with a sample size of subjects per arm, a power of . is obtained for a % amyloid lowering intervention performed . years before conversion, while in case of a % amyloid lowering the equivalent effect would be obtained by intervening years before conversion. we provide in table the estimated improvement for each clinical score at conversion with a sample size of subjects per arm for both % and % amyloid lowering depending on figure simulation of amyloid lowering intervention on the z-scores evolution. hypothetical scenarios of irreversible amyloid lowering interventions at t=- years from alzheimer's dementia diagnosis, with a rate of % (left) or % (right). shadowed areas represent the standard deviation of the average trajectory. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / the intervention time. we observe that for the same intervention time, % amyloid lowering always results in a larger improvement of clinical endpoints compared to % amyloid lowering. we also note that in the case of % lowering, clinical endpoints obtained for intervention at t=- years correspond to typical cutoff values for inclusion into alzheimer’s disease trials (adas = . ± . , mmse = . ± . , see supplementary table ) (gamberger et al., ; kochhann et al., ). discussion we presented a framework to jointly model the progression of multi-modal imaging and clinical data, based on the estimation of latent biomarkers' relationships governing alzheimer’s disease progression. the model is designed to simulate intervention scenarios in clinical trials, and in this study we focused on assessing the effect of anti-amyloid drugs on biomarkers' evolution, by quantifying the effect of intervention time and drug efficacy on clinical outcomes. our results underline the critical importance of intervention time, which should be performed sensibly early during the pathological history to effectively appreciate the effectiveness of disease modifiers. the results obtained with our model are compatible with findings reported in recent clinical studies (egan et al., ; honig et al., ; wessels et al., ). for example, if we consider patients per arm and perform a % amyloid lowering intervention for years to reproduce the conditions of the recent trial of verubecestat (egan et al., ), the average improvement of mmse predicted by our model is of . , falling in the % confidence interval measured during that study ([- . ; . ]). while recent anti-amyloid trials such as (egan et al., ; honig et al., ; wessels et al., ) included between and mild ad subjects per arm and were conducted over a period of two years at most, our analysis suggests that clinical trials performed with less than subjects with mild ad may be consistently under-powered. indeed, we see in figure that with a sample size of subjects per arm and a total blockage of amyloid production, a power of . can be obtained only if intervention is performed at least years before conversion. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / figure evolution of the statistical power in different intervention scenarios. statistical power of the student t-test comparing the estimated clinical outcomes at conversion time between placebo and treated scenarios, according to the year of simulated intervention ( % and % amyloid lowering) and sample size. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / amyloid lowering intervention % point improvement per intervention time - - - . - - - - - adas . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) mmse . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) faq . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) ravlt immediate . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) ravlt learning . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) ravlt forgetting . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) cdrsb . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) these results allow to quantify the crucial role of intervention time, and provide a theoretical justification for testing amyloid modifying drugs in the pre-clinical stage (aisen et al., ; sperling et al., ). this is for example illustrated in table , in which we notice that clinical endpoints are close to placebo even when the simulated intervention takes place years before table : estimated mean (standard deviation) improvement of clinical outcomes at predicted conversion time for the normal progression case by year of simulated intervention ( % and % amyloid lowering interventions). results in bold indicate a statistically significant difference between placebo and treated scenarios (p< . , two-sided t-test, cases per arm). ad: alzheimer's dementia, adas : alzheimer's disease assessment scale, mmse: mini- mental state examination, faq: functional assessment questionnaire, ravlt: rey auditory verbal learning test, cdrsb: clinical dementia rating scale sum of boxes. amyloid lowering intervention % point improvement per intervention time - - - . - - - - - adas . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) mmse . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) faq . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) ravlt immediate . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) ravlt learning . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) ravlt forgetting . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) cdrsb . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) . , ( . ) .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / conversion, while stronger cognitive and functional changes happen when amyloid is lowered by % or % earlier. these findings may be explained by considering that amyloid accumulates over more than a decade, and that when amyloid clearance occurs the pathological cascade is already entrenched (rowe et al., ). our results are thus supporting the need to identify subjects at the pre-clinical stage, that is to say still cognitively normal, which is a challenging task. currently, one of the main criteria to enroll subjects into clinical trials is the presence of amyloid in the brain, and blood-based markers are considered as potential candidates for identifying patients at risk for alzheimer’s disease (zetterberg & burnham, ). moreover, recent works such as (blennow et al., ; westwood et al., ) have proposed more complex entry criteria to constitute cohorts based on multi-modal measurements. within this context, our model could also be used as an enrichment tool by quantifying the disease severity based on multi-modal data as shown in figure panel ii) (right). similarly, the method could be applied to predict the evolution of single patient given its current available measurements. an additional critical aspect of anti-amyloid trials is the effect of dose exposure on the production of amyloid (klein et al., ). currently,  -site amyloid precursor protein cleaving enzyme (bace) inhibitors allow to suppress amyloid production from % to %. in this study we showed that lowering amyloid by % consistently decreases the treatment effect compared to a % lowering at the same time. for instance, if we consider a sample size of subjects per arm in the case of a % amyloid lowering intervention, % power can be reached only years before conversion instead of years for a % amyloid lowering intervention. this ability of our model to control the rate of amyloid progression is fundamental in order to provide realistic simulations of anti-amyloid trials. in figure panel i) we showed that amyloid triggers the pathological cascade affecting the other markers, thus confirming its dominating role on disease progression. assuming that the data used to estimate the model is sufficient to completely depict the history of the pathology, our model can be interpreted from a causal perspective. however, we cannot exclude the existence of other mechanisms driving amyloid accumulation, which our model cannot infer from the existing data. therefore, our findings should be considered with care, while the integration of additional biomarkers of interest will be necessary to account for multiple drivers .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / of the disease. it is worth noting that recent works ventured the idea to combine drugs targeting multiple mechanisms at the same time (gauthier et al., ). for instance, pathologists have shown tau deposition in brainstem nuclei in adolescents and children (kaufman et al., ), and clinicians are currently investigating the pathological effect of early tau spreading on alzheimer’s disease progression (pontecorvo et al., ), raising crucial questions about its relationship with amyloid accumulation, and the impact on cognitive impairment (cummings, blennow, et al., ). in this study, subjects underwent at least one tau-pet scan. however, when considering the subjects for whom there exists one visit in which all the data modalities were available, the number of patients in the study cohort decreased to . this low sample size prevented us from estimating reliable trajectories for this biomarker. it is also important to note that among the subjects with at least one tau-pet scan, only of them had one follow-up visit. this means that tau markers dynamics cannot be reliably estimated. including tau data will require studies on larger cohorts with complete sets of pet imaging acquisitions. this could be part of future extensions of this work, where the inclusion of tau markers will allow to simulate scenarios of production blockage of both amyloid and tau at different rates or intervention time. lately, disappointing results of clinical studies led to hypothesize specific treatments targeting ad sub-populations based on their genotype (safieh et al., ). while in our work we describe a global progression of alzheimer’s disease, in the future we will account for sub- trajectories due to genetic factors, such as the presence of allele of apolipoprotein (apoe ), which is a major risk for developing alzheimer’s disease influencing both disease onset and progression (kim et al., ). this could be done by estimating dynamical systems specific to the genetic condition of each patient. this was not possible in this study due to a strong imbalance between the number of carriers and non-carriers across the different clinical groups (cf. table ). indeed, we observe that the number of adni non-carriers is much lower than the number of carriers, especially in the latest stages of the disease (mci converters and ad). on the contrary, the majority of nl stable subjects are non-carriers. therefore, applying the model in such conditions would lead to a bias towards more represented groups during the different stages of the disease progression (apoe - at early stages and apoe + at late ones), thus preventing us from differentiating the biomarkers dynamics based on the genetic status. yet, simulating dynamical relationships specific to genetic factors is a crucial avenue of improvement of our approach, as it would allow to evaluate the effect of apoe on .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / intervention time or drug dosage. in addition to this example, there exist numerous non-genetic aggravating factors that may also affect disease evolution, such as diabetes, obesity or smoking. extending our model to account for panels of risk factors would ultimately allow to test in silico personalized intervention strategies. moreover, a key aspect of clinical trials is their economic cost. our model could be extended to help designing clinical trials by optimizing intervention with respect to the available funding. given a budget, we could simulate scenarios based on different sample size, and trials duration, while estimating the expected cognitive outcome. results presented in this work are based on a model estimated by relying solely on a subset of subjects and measures from the adni cohort, and therefore they may not be fully representative of the general alzheimer’s disease progression. indeed, subjects included in this cohort were either amyloid-positive at baseline, or became amyloid-positive during their follow-up visits. this was motivated by the consideration that evidence of pathological amyloid levels is a necessary condition for diagnosing ad as it puts subjects within the “alzheimer’s disease continuum” (jack et al., ). by narrowing the list of subjects to a subgroup of amyloid positive we increase the chances of selecting a set of patients likely to develop the disease. moreover, the inclusion of subjects at various clinical stages allows to span the entire spectrum of morphological and physiological changes affecting the brain. through the joint analysis of markers of amyloid, neurodegeneration and cognition, our model estimates the average trajectory that best describes the progression of the observed measures when going from nl individuals towards ad patients. the selection of amyloid positive patients aims at increasing the signal of alzheimer’s pathological changes within this cohort, in order to estimate long-term dynamics for the biomarkers that can be associated to the disease. we believe that this modeling choice is based on a clinically plausible rationale, and allows us to perform our study on a sufficiently large cohort enabling the estimation of our model. bearing this in mind, we acknowledge the potential presence of bias towards the specific inclusion criterion adopted in this work. indeed, the present results may provide a limited representation of the pathological temporal window captured by the model. for example, applying the model on a cohort containing amyloid-negative subjects may provide additional insights on the overall disease history. however, this is a challenging task as it would require to identify sub-trajectories dissociated from normal ageing (lorenzi et al., ; sivera et al., ). another potential bias affecting the results may come from the .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / choice of the clinical scores used to estimate our model. in this study, we relied on a panel of neuro-psychological assessments providing a comprehensive representation of cognitive, memory and functional abilities: adas , mmse, ravlt immediate, ravlt learning, ravlt forgetting, faq, and cdrsb. the choice of these particular scores is consistent with previous literature on dpm (donohue et al., ; lorenzi et al., ). however, it is important to note that our model can handle any type of clinical assessment. therefore, investigating the effect of adding supplementary clinical scores on the model’s findings would be an interesting future application of our approach, and could be done without any modification of its current formulation. finally, in addition to these specific characteristics of the cohort, there exists additional biases impacting the model estimation. for instance, the fact that gray matter atrophy and glucose metabolism become abnormal approximately at the same time in figure can be explained by the high atrophy rate of change in some key regions in normal elders, such as in the hippocampus, compared to the rate of change of fdg (see table ). we note that this stronger change of atrophy with respect to glucose metabolism can already be appreciated in the clinically healthy group. conclusion in this study we investigated a novel quantitative instrument for the development of intervention strategies for disease modifying drugs in ad. our framework enables the simulation of the effect of intervention time and drug dosage on the evolution of imaging and clinical biomarkers in clinical trials. the proposed data-driven approach is based on the modeling of the spatio-temporal dynamics governing the joint evolution of imaging and clinical measurements throughout the disease. the model is formulated within a bayesian framework, where the latent representation and dynamics are efficiently estimated through stochastic variational inference. to generate hypothetical scenarios of amyloid lowering interventions, we applied our approach to multi-modal imaging and clinical data from adni. the results quantify the crucial role of intervention time, and provide a theoretical justification for testing amyloid modifying drugs in the pre-clinical stage. our experimental simulations are compatible with the outcomes observed in past clinical trials and suggest that anti-amyloid treatments should be administered at least years earlier than what is currently being done in order to obtain statistically powered improvement of clinical endpoints. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / funding this work has been supported by the french government, through the ucajedi and ia côte d'azur investments in the future project managed by the national research agency (ref.n anr- -idex- and anr- -p ia- ), the grant aap santé - dga-dsh, and by the inria sophia-antipolis-méditerranée, "nef" computation cluster. acknowledgements data collection and sharing for this project was funded by the alzheimer's disease neuroimaging initiative (adni) and dod adni. adni is funded by the national institute on aging, the national institute of biomedical imaging and bioengineering, and through generous contributions from the following: abbvie, alzheimer’s association; alzheimer’s drug discovery foundation; araclon biotech; bioclinica, inc.; biogen; bristol-myers squibb company;cerespir, inc.;cogstate;eisai inc.; elan pharmaceuticals, inc.; eli lilly and company; euroimmun; f. hoffmann-la roche ltd and its affiliated company genentech, inc.; fujirebio; ge healthcare; ixico ltd.; janssen alzheimer immunotherapy research & development, llc.; johnson\& johnson pharmaceutical research & development llc.;lumosity;lundbeck;merck & co., inc.; meso scale diagnostics, llc.;neurorx research; neurotrack technologies;novartis pharmaceuticals corporation; pfizer inc.; piramal imaging;servier; takeda pharmaceutical company; and transition therapeutics.the canadian institutes of health research is providing funds to support adni clinical sites in canada. private sector contributions are facilitated by the foundation for the national institutes of health (www.fnih.org). the grantee organization is the northern california institute for research and education, and the study is coordinated by the alzheimer’s therapeutic research institute at the university of southern california. adni data are disseminated by the laboratory for neuro imaging at the university of southern california. competing interests the authors declare no competing interests. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / references aisen, p. s., siemers, e., michelson, d., salloway, s., sampaio, c., carrillo, m. c., sperling, r., doody, r., scheltens, p., bateman, r., weiner, m., & vellas, b. ( ). what have we learned from expedition iii and epoch trials ? perspective of the ctad task force. j prev alzheimers dis, ( ), – . akaike, h. ( ). information theory and an extension of the maximum likelihood principle. in selected papers of hirotugu akaike (pp. – ). springer new york. https://doi.org/ . / - - - - _ antelmi, l., ayache, n., robert, p., & lorenzi, m. ( , june). sparse multi-channel variational autoencoder for the joint analysis of heterogeneous data. icml - th international conference on machine learning. bateman, r. j., xiong, c., benzinger, t. l. s., fagan, a. m., goate, a., fox, n. c., marcus, d. s., cairns, n. j., xie, x., blazey, t. m., holtzman, d. m., santacruz, a., buckles, v., oliver, a., moulder, k., aisen, p. s., ghetti, b., klunk, w. e., mcdade, e., … morris, j. c. ( ). clinical and biomarker changes in dominantly inherited alzheimer’s disease. new england journal of medicine, ( ), – . bilgel, m., jedynak, b., wong, d. f., resnick, s. m., & prince, j. l. ( ). temporal trajectory and progression score estimation from voxelwise longitudinal imaging measures: application to amyloid imaging. inf process med imaging, , – . blennow, k., hampel, h., weiner, m., & zetterberg, h. ( ). cerebrospinal fluid and plasma biomarkers in alzheimer disease. nat rev neurol, ( ), – . braak, h., & braak, e. ( ). neuropathological stageing of alzheimer-related changes. acta neuropathol., ( ), – . burnham, s. c., fandos, n., fowler, c., pérez-grijalba, v., dore, v., doecke, j. d., shishegar, r., cox, t., fripp, j., rowe, c., sarasa, m., masters, c. l., pesini, p., & villemagne, v. l. ( ). longitudinal evaluation of the natural history of amyloid-β in plasma and brain. brain communications, ( ). https://doi.org/ . /braincomms/fcaa cash, d. m., frost, c., iheme, l. o., Ünay, d., kandemir, m., fripp, j., salvado, o., bourgeat, p., reuter, m., fischl, b., lorenzi, m., frisoni, g. b., pennec, x., pierson, r. k., gunter, j. l., senjem, m. l., jack, c. r., guizard, n., fonov, v. s., … ourselin, s. .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / ( ). assessing atrophy measurement techniques in dementia: results from the miriad atrophy challenge. neuroimage, , – . cummings, j., blennow, k., johnson, k., keeley, m., bateman, r. j., molinuevo, j. l., touchon, j., aisen, p., & vellas, b. ( ). anti-tau trials for alzheimer’s disease: a report from the eu/us/ctad task force. j prev alzheimers dis, ( ), – . cummings, j., lee, g., ritter, a., sabbagh, m., & zhong, k. ( ). alzheimer’s disease drug development pipeline: . alzheimers dement (n y), , – . delacourte, a., david, j. p., sergeant, n., buée, l., wattez, a., vermersch, p., ghozali, f., fallet-bianco, c., pasquier, f., lebert, f., petit, h., & di menza, c. ( ). the biochemical pathway of neurofibrillary degeneration in aging and alzheimer’s disease. neurology, ( ), – . desikan, r. s., ségonne, f., fischl, b., quinn, b. t., dickerson, b. c., blacker, d., buckner, r. l., dale, a. m., maguire, r. p., hyman, b. t., albert, m. s., & killiany, r. j. ( ). an automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest. neuroimage, ( ), – . donohue, m. c., jacqmin-gadda, h., goff, m. le, thomas, r. g., raman, r., gamst, a. c., beckett, l. a., jack, c. r., weiner, m. w., dartigues, j.-f., & aisen, p. s. ( ). estimating long-term multivariate progression from short-term data. alzheimer’s & dementia, ( , supplement), s –s . https://doi.org/https://doi.org/ . /j.jalz. . . egan, m. f., kost, j., voss, t., mukai, y., aisen, p. s., cummings, j. l., tariot, p. n., vellas, b., van dyck, c. h., boada, m., zhang, y., li, w., furtek, c., mahoney, e., harper mozley, l., mo, y., sur, c., & michelson, d. ( ). randomized trial of verubecestat for prodromal alzheimer’s disease. n. engl. j. med., ( ), – . fonteijn, h. m., modat, m., clarkson, m. j., barnes, j., lehmann, m., hobbs, n. z., scahill, r. i., tabrizi, s. j., ourselin, s., fox, n. c., & alexander, d. c. ( ). an event-based model for disease progression and its application in familial alzheimer’s disease and huntington’s disease. neuroimage, ( ), – . gamberger, d., lavrač, n., srivatsa, s., tanzi, r. e., & doraiswamy, p. m. ( ). identification of clusters of rapid and slow decliners among subjects at risk for alzheimer’s disease. sci rep, ( ), . .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / garbarino, s., & lorenzi, m. ( ). modeling and inference of spatio-temporal protein dynamics across brain networks. ipmi - th international conference on information processing in medical imaging, , – . https://hal.inria.fr/hal- gauthier, s., alam, j., fillit, h., iwatsubo, t., liu-seifert, h., sabbagh, m., salloway, s., sampaio, c., sims, j. r., sperling, b., sperling, r., welsh-bohmer, k. a., touchon, j., vellas, b., & aisen, p. ( ). combination therapy for alzheimer’s disease: perspectives of the eu/us ctad task force. j prev alzheimers dis, ( ), – . hao, w., & friedman, a. ( ). mathematical model on alzheimer’s disease. bmc syst biol, ( ), . henley, d., raghavan, n., sperling, r., aisen, p., raman, r., & romano, g. ( ). preliminary results of a trial of atabecestat in preclinical alzheimer’s disease. n. engl. j. med., ( ), – . honig, l. s., vellas, b., woodward, m., boada, m., bullock, r., borrie, m., hager, k., andreasen, n., scarpini, e., liu-seifert, h., case, m., dean, r. a., hake, a., sundell, k., poole hoffmann, v., carlson, c., khanna, r., mintun, m., demattos, r., … siemers, e. ( ). trial of solanezumab for mild dementia due to alzheimer’s disease. n. engl. j. med., ( ), – . howard, r., & liu, k. y. ( ). questions emerge as biogen claims aducanumab turnaround. nat rev neurol, ( ), – . insel, p. s., mormino, e. c., aisen, p. s., thompson, w. k., & donohue, m. c. ( ). neuroanatomical spread of amyloid β and tau in alzheimer’s disease: implications for primary prevention. brain communications, ( ). https://doi.org/ . /braincomms/fcaa iturria-medina, y, sotero, r. c., toussaint, p. j., mateos-p?rez, j. m., evans, a. c., & initiative., a. d. n. ( ). early role of vascular dysregulation on late-onset alzheimer’s disease based on multifactorial data-driven analysis. nat commun, , . iturria-medina, yasser, carbonell, f. m., sotero, r. c., chouinard-decorte, f., & evans, a. c. ( ). multifactorial causal model of brain (dis)organization and therapeutic intervention: application to alzheimer’s disease. neuroimage, , – . .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / https://doi.org/https://doi.org/ . /j.neuroimage. . . jack, c. r., bennett, d. a., blennow, k., carrillo, m. c., dunn, b., haeberlein, s. b., holtzman, d. m., jagust, w., jessen, f., karlawish, j., liu, e., molinuevo, j. l., montine, t., phelps, c., rankin, k. p., rowe, c. c., scheltens, p., siemers, e., snyder, h. m., … silverberg, n. ( ). nia-aa research framework: toward a biological definition of alzheimer’s disease. alzheimers dement, ( ), – . jack, c. r., & holtzman, d. m. ( ). biomarker modeling of alzheimer’s disease. neuron, ( ), – . jack, c. r., knopman, d. s., jagust, w. j., petersen, r. c., weiner, m. w., aisen, p. s., shaw, l. m., vemuri, p., wiste, h. j., weigand, s. d., lesnick, t. g., pankratz, v. s., donohue, m. c., & trojanowski, j. q. ( ). tracking pathophysiological processes in alzheimer’s disease: an updated hypothetical model of dynamic biomarkers. lancet neurol, ( ), – . jedynak, b. m., lang, a., liu, b., katz, e., zhang, y., wyman, b. t., raunig, d., jedynak, c. p., caffo, b., & prince, j. l. ( ). a computational neurodegenerative disease progression score: method and results with the alzheimer’s disease neuroimaging initiative cohort. neuroimage, ( ), – . kaufman, s. k., del tredici, k., thomas, t. l., braak, h., & diamond, m. i. ( ). tau seeding activity begins in the transentorhinal/entorhinal regions and anticipates phospho-tau pathology in alzheimer’s disease and part. acta neuropathologica, ( ), – . https://doi.org/ . /s - - - kim, j., basak, j. m., & holtzman, d. m. ( ). the role of apolipoprotein e in alzheimer’s disease. neuron, ( ), – . klein, g., delmar, p., voyle, n., rehal, s., hofmann, c., abi-saab, d., andjelkovic, m., ristic, s., wang, g., bateman, r., kerchner, g. a., baudler, m., fontoura, p., & doody, r. ( ). gantenerumab reduces amyloid-$β$ plaques in patients with prodromal to moderate alzheimer’s disease: a pet substudy interim analysis. alzheimer’s research & therapy, ( ), . https://doi.org/ . /s - - -z kochhann, r., varela, j. s., lisboa, c. s. m., & chaves, m. l. f. ( ). the mini mental state examination: review of cutoff points adjusted for schooling in a large southern brazilian sample. dement neuropsychol, ( ), – . .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / koval, i., schiratti, j.-b., routier, a., bacci, m., colliot, o., allassonnière, s., & durrleman, s. ( ). spatiotemporal propagation of the cortical atrophy: population and individual patterns. frontiers in neurology, , . https://doi.org/ . /fneur. . li, d., iddi, s., thompson, w. k., & donohue, m. c. ( ). bayesian latent time joint mixed effect models for multicohort longitudinal data. stat methods med res, ( ), – . lorenzi, m., filippone, m., frisoni, g. b., alexander, d. c., & ourselin, s. ( ). probabilistic disease progression modeling to characterize diagnostic uncertainty: application to staging and prediction in alzheimer’s disease. neuroimage. https://doi.org/https://doi.org/ . /j.neuroimage. . . lorenzi, m., pennec, x., frisoni, g. b., & ayache, n. ( ). disentangling normal aging from alzheimer’s disease in structural magnetic resonance images. neurobiology of aging, , s –s . https://doi.org/https://doi.org/ . /j.neurobiolaging. . . marinescu, r. v, eshaghi, a., lorenzi, m., young, a. l., oxtoby, n. p., garbarino, s., crutch, s. j., & alexander, d. c. ( ). dive: a spatiotemporal progression model of brain pathology in neurodegenerative disorders. neuroimage, , – . murphy, m. p., & levine, h. ( ). alzheimer’s disease and the amyloid-beta peptide. j. alzheimers dis., ( ), – . nader, c. a., ayache, n., robert, p., lorenzi, m., & initiative, a. d. n. ( ). monotonic gaussian process for spatio-temporal disease progression modeling in brain imaging data. neuroimage, . https://doi.org/ . /j.neuroimage. . oxtoby, n. p., garbarino, s., firth, n. c., warren, j. d., schott, j. m., & alexander, d. c. ( ). data-driven sequence of changes to anatomical brain connectivity in sporadic alzheimer’s disease. front neurol, , . oxtoby, n. p., young, a. l., cash, d. m., benzinger, t. l. s., fagan, a. m., morris, j. c., bateman, r. j., fox, n. c., schott, j. m., & alexander, d. c. ( ). data-driven models of dominantly-inherited alzheimer’s disease progression. brain, ( ), – . paszke, a., gross, s., massa, f., lerer, a., bradbury, j., chanan, g., killeen, t., lin, z., .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / gimelshein, n., antiga, l., desmaison, a., kopf, a., yang, e., devito, z., raison, m., tejani, a., chilamkurthy, s., steiner, b., fang, l., … chintala, s. ( ). pytorch: an imperative style, high-performance deep learning library. in h. wallach, h. larochelle, a. beygelzimer, f. d\textquotesingle alché-buc, e. fox, & r. garnett (eds.), advances in neural information processing systems (pp. – ). curran associates, inc. http://papers.neurips.cc/paper/ -pytorch-an-imperative-style-high- performance-deep-learning-library.pdf petrella, j. r., hao, w., rao, a., & doraiswamy, p. m. ( ). computational causal modeling of the dynamic biomarker cascade in alzheimer’s disease. comput math methods med, , . pontecorvo, m. j., devous, m. d., kennedy, i., navitsky, m., lu, m., galante, n., salloway, s., doraiswamy, p. m., southekal, s., arora, a. k., mcgeehan, a., lim, n. c., xiong, h., truocchio, s. p., joshi, a. d., shcherbinin, s., teske, b., fleisher, a. s., & mintun, m. a. ( ). a multicentre longitudinal study of flortaucipir ( f) in normal ageing, mild cognitive impairment and alzheimer’s disease dementia. brain, ( ), – . prince, m. j., wimo, a., guerchet, m. m., ali, g. c., wu, y.-t., & prina, m. ( ). world alzheimer report - the global impact of dementia: an analysis of prevalence, incidence, cost and trends. alzheimer’s disease international. reuter, m., schmansky, n. j., rosas, h. d., & fischl, b. ( ). within-subject template estimation for unbiased longitudinal image analysis. neuroimage, ( ), – . rowe, c. c., ellis, k. a., rimajova, m., bourgeat, p., pike, k. e., jones, g., fripp, j., tochon-danguy, h., morandeau, l., o’keefe, g., price, r., raniga, p., robins, p., acosta, o., lenzo, n., szoeke, c., salvado, o., head, r., martins, r., … villemagne, v. l. ( ). amyloid imaging results from the australian imaging, biomarkers and lifestyle (aibl) study of aging. neurobiol. aging, ( ), – . safieh, m., korczyn, a. d., & michaelson, d. m. ( ). apoe : an emerging therapeutic target for alzheimer’s disease. bmc medicine, . schiratti, j.-b., allassonnière, s., colliot, o., & durrleman, s. ( ). learning spatiotemporal trajectories from manifold-valued longitudinal data. nips, – . schuff, n., woerner, n., boreta, l., kornfield, t., shaw, l. m., trojanowski, j. q., .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / thompson, p. m., jack, c. r., & weiner, m. w. ( ). mri of hippocampal volume loss in early alzheimer’s disease in relation to apoe genotype and biomarkers. brain, (pt ), – . schwarz, a. j., sundell, k. l., charil, a., case, m. g., jaeger, r. k., scott, d., bracoud, l., oh, j., suhy, j., pontecorvo, m. j., dickerson, b. c., & siemers, e. r. ( ). magnetic resonance imaging measures of brain atrophy from the expedition trial in mild alzheimer’s disease. alzheimer’s & dementia: translational research & clinical interventions, ( ), – . https://doi.org/https://doi.org/ . /j.trci. . . sivera, r., capet, n., manera, v., fabre, r., lorenzi, m., delingette, h., pennec, x., ayache, n., & robert, p. ( ). voxel-based assessments of treatment effects on longitudinal brain changes in the multidomain alzheimer preventive trial cohort. neurobiology of aging, , – . https://doi.org/https://doi.org/ . /j.neurobiolaging. . . sperling, r. a., jack, c. r., & aisen, p. s. ( ). testing the right target and right drug at the right stage. sci transl med, ( ), cm . villemagne, v. l., burnham, s., bourgeat, p., brown, b., ellis, k. a., salvado, o., szoeke, c., macaulay, s. l., martins, r., maruff, p., ames, d., rowe, c. c., & masters, c. l. ( ). amyloid Î deposition, neurodegeneration, and cognitive decline in sporadic alzheimer’s disease: a prospective cohort study. lancet neurol, ( ), – . wessels, a. m., tariot, p. n., zimmer, j. a., selzler, k. j., bragg, s. m., andersen, s. w., landry, j., krull, j. h., downing, a. m., willis, b. a., shcherbinin, s., mullen, j., barker, p., schumi, j., shering, c., matthews, b. r., stern, r. a., vellas, b., cohen, s., … sims, j. r. ( ). efficacy and safety of lanabecestat for treatment of early and mild alzheimer disease: the amaranth and daybreak-alz randomized clinical trials. jama neurol. westwood, s., leoni, e., hye, a., lynham, s., khondoker, m. r., ashton, n. j., kiddle, s. j., baird, a. l., sainz-fuertes, r., leung, r., graf, j., hehir, c. t., baker, d., cereda, c., bazenet, c., ward, m., thambisetty, m., & lovestone, s. ( ). blood-based biomarker candidates of cerebral amyloid using pib pet in non-demented elderly. j. alzheimers dis., ( ), – . young, a. l., oxtoby, n. p., daga, p., cash, d. m., fox, n. c., ourselin, s., schott, j. m., & alexander, d. c. ( ). a data-driven model of biomarker changes in sporadic alzheimer’s disease. brain, (pt ), – . .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . / zetterberg, h., & burnham, s. c. ( ). blood-based molecular biomarkers for alzheimer’s disease. molecular brain, ( ), . https://doi.org/ . /s - - - .cc-by . international licenseperpetuity. it is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted biorxiv a license to display the preprint in the copyright holder for thisthis version posted february , . ; https://doi.org/ . / . . . doi: biorxiv preprint https://doi.org/ . / . . . http://creativecommons.org/licenses/by/ . /