10_1101-2020_01_29_925354 ---- Interpretable detection of novel human viruses from genome sequencing data i i “output” — 2020/12/11 — 18:12 — page 1 — #1 i i i i i i Published online DD MM YYYY Preprint, YYYY, Vol. xx, No. xx 1–14 Interpretable detection of novel human viruses from genome sequencing data Jakub M. Bartoszewicz 1,2,3,4∗, Anja Seidel 1,2 and Bernhard Y. Renard 1,3,4∗ 1Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, Berlin, Germany, 2Department of Mathematics and Computer Science, Free University of Berlin, Berlin, Germany, 3Data Analytics and Computation Statistics, Hasso Plattner Institute for Digital Engineering, Potsdam, Brandenburg, Germany and 4Digital Engineering Faculty, University of Postdam, Potsdam, Brandenburg, Germany. 5Current address: Central Research Institute of Ambulatory Health Care, Berlin, Germany. Received YYYY-MM-DD; Revised YYYY-MM-DD; Accepted YYYY-MM-DD ABSTRACT Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics. INTRODUCTION Background Within a globally interconnected and densely populated world, pathogens can spread more easily than they ever had before. As the recent outbreaks of Ebola and Zika viruses have shown, the risks posed even by these previously known agents remain ∗To whom correspondence should be addressed. Tel: +49 331 5509 4960; Email: jakub.bartoszewicz@hpi.de, bernhard.renard@hpi.de unpredictable and their expansion hard to control (1). What is more, it is almost certain that more unknown pathogen species and strains are yet to be discovered, given their constant, extremely fast-paced evolution and unexplored biodiversity, as well as increasing human exposure (2, 3). Some of those novel pathogens may cause epidemics (similar to the SARS and MERS coronavirus outbreaks in 2002 and 2012) or even pandemics (e.g. SARS-CoV-2 and the “swine flu” H1N1/09 strain). Many have more than one host or vector, which makes assessing and predicting the risks even more difficult. For example, Ebola has its natural reservoir most likely in fruit bats (4), but causes deadly epidemics in both humans and chimpanzees. As the state-of-the art approach for the open- view detection of pathogens is genome sequencing (5, 6), it is crucial to develop automated pipelines for characterizing the infectious potential of currently unidentifiable sequences. In practice, clinical samples are dominated by host reads and contaminants, with often less than a hundred reads of the pathogenic virus (7). Metagenomic assembly is challenging, especially in time-critical applications. This creates a need for read-based approaches complementing or substituting assembly where needed. Screening against potentially dangerous subsequences before their synthesis may also be used as a way of ensuring responsible research in synthetic biology. While potentially useful in some applications, engineering of viral genomes could also pose a biosecurity and biosafety threat. Two controversial studies modified the influenza A/H5N1 ("bird flu") virus to be airborne transmissible in mammals (8, 9). A possibility of modifying coronaviruses to enhance their virulence triggered calls for a moratorium on this kind of research (10). Synthesis of an infectious horsepox virus closely related to the smallpox-causing Variola virus (11) caused a public uproar and calls for intensified discussion on risk control in synthetic biology (12). © YYYY The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 2 — #2 i i i i i i 2 Preprint, YYYY, Vol. xx, No. xx Current tools for host range prediction Several computational, genome-based methods exist that allow to predict the host-range of a bacteriophage (a bacteria-infecting virus). A selection of composition-based and alignment-based approaches has been presented in an extensive review by Edwards et al. (13). Prediction of eukariotic host tropism (including humans) based on known protein sequences was shown for the influenza A virus (14). Support-vector machines based on word2vec representations were shown to outperform homology searches with BLAST and HMMs in the same task, but lost their advantage when applied to nucleic acid sequences directly (15). Two recent studies employ k-mer based, k-NN classifiers (16) and deep learning (17) to predict host range for a small set of three well- studied species directly from viral sequences. While those approaches are limited to those particular species and do not scale to viral host-range prediction in general, the Host Taxon Predictor (HTP) (18) uses logistic regression and support vector machines to predict if a novel virus infects bacteria, plants, vertebrates or arthropods. Yet, the authors argue that it is not possible to use HTP in a read-based manner; it requires long sequences of at least 3,000 nucleotides. This is incompatible with modern metagenomic next-generation sequencing (NGS) workflows, where the DNA reads obtained are at least 10-20 times shorter. Another study used gradient boosting machines to predict reservoir hosts and transmission via arthropod vectors for known human-infecting viruses (19). Zhang et al. (20) designed several classifiers explicitly predicting whether a new virus can potentially infect humans. Their best model, a k-NN classifier, uses k-mer frequencies as features representing the query sequence and can yield predictions for sequences as short as 500 base pairs (bp). It worked also with 150bp-long reads from real DNA sequencing runs, although in this case the reads originated also from the viruses present in the training set (and were therefore not "novel"). Deep Learning for DNA sequences While DNA sequences mapped to a reference genome may be represented as images (21), a majority of studies uses a distributed orthographic representation, where each nucleotide {A,C,G,T} in a sequence is represented by a one-hot encoded vector of length 4. An "unknown" nucleotide (N) can be represented as an all-zero vector. Chaos game representation (CGR) and its extension, the frequency matrix CGR (FCGR) are promising alternatives able to encode an arbitrary sequence in an image-like format. FCGR has been used to encode genomic inputs for deep learning approaches, including full bacterial genomes (22) and coding sequences of HIV for the drug resistance prediction task (23). In this study, we use one-hot encoding with Ns as zeroes, which was previously shown to perform well for raw NGS reads (24) and abstract phenotype labels. CNNs and LSTMs have been successfully used for a variety of DNA-based prediction tasks. Early works focused mainly on regulation of gene expression in humans (25, 26, 27, 28, 29), which is still an area of active research (30, 31, 32). In the field of pathogen genomics, deep learning models trained directly on DNA sequences were developed to predict host ranges of three multi-host viral species (33) and to predict pathogenic potentials of novel bacteria (24). DeepVirFinder (34) and ViraMiner (35) can detect viral sequences in metagenomic samples, but they cannot predict the host and focus on previously known species. For a broader view on deep learning in genomics we refer to a recent review by Eraslan et al. (36). Interpretability and explainability of deep learning models for genomics is crucial for their wide-spread adoption, as it is necessary for delivering trustworthy and actionable results. Convolutional filters can be visualized by forward-passing multiple sequences through the network and extracting the most-activating subsequences (25) to create a position weight matrix (PWM) which can be visualized as a sequence logo (37, 38). Direct optimization of input sequences is problematic, as it results in generating a dense matrix even though the input sequences are one-hot encoded (39, 40). This problem can be alleviated with Integrated Gradients (41, 42) or DeepLIFT, which propagates activation differences relative to a selected reference back to the input, reducing the computational overhead of obtaining accurate gradients (43). If the bias terms are zero and a reference of all-zeros is used, the method is analogous to Layer-wise Relevance Propagation (44). DeepLIFT is an additive feature attribution method, and may used to approximate Shapley values if the input features are independent (45). TF-MoDISco (46) uses DeepLIFT to discover consolidated, biologically meaningful DNA motifs (transcription factor binding sites). Contributions In this paper, we first improve the performance of read- based predictions of the viral host (human or non-human) from next-generation sequencing reads. We show that reverse-complement (RC) neural networks (24) significantly outperform both the previous state-of-the-art (20) and the traditional, alignment-based algorithm – BLAST (47, 48), which constitutes a gold standard in homology-based bioinformatics analyses. We show that defining the negative (non-human) class is non-trivial and compare different ways of constructing the training set. Strikingly, a model trained to distinguish between viruses infecting humans and viruses infecting other chordates (a phylum of animals including vertebrates) generalizes well to evolutionarily distant non- human hosts, including even bacteria. This suggests that the host-related signal is strong and the learned decision boundary separates human viruses from other DNA sequences surprisingly well. Next, we propose a new approach for convolutional filter visualization using partial Shapley values to differentiate between simple nucleotide information content and the contribution of each sequence position to the final classification score. To test the biological plausibility of our models, we generate genome-wide maps of "infectious potential" and nucleotide contributions. We show that those maps can be used to visualize and detect virulence-related regions of interest (e.g. genes) in novel genomes. As a proof of concept, we analyzed one of the viruses randomly assigned to the test set – the Taï Forest ebolavirus, which has a history of host-switching and can cause a serious disease. To show that the method can also be used for other biological problems, we investigated the networks trained by .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 3 — #3 i i i i i i Preprint, YYYY, Vol. xx, No. xx 3 Bartoszewicz et al. (24) and their predictions on a genome of a pathogenic bacterium Staphylococcus aureus. The authors used this particular species to assess the performance of their method on real sequencing data. Finally, we studied the SARS-CoV-2 coronavirus, which emerged in December 2019, causing the COVID-19 pandemic (49). MATERIALS AND METHODS Data collection and preprocessing VHDB dataset We accessed the Virus-Host Database (50) on July 31, 2019 and downloaded all the available data. We note that all the reference genomes from NCBI Viral Genomes are present in VHDB, as well as their curated annotations from RefSeq. Additional, manually curated records in VHDB extend on metadata available in NCBI. More non-reference genomes are available, but considering multiple genomes per virus would skew the classifiers’ performance towards the more frequently resequenced ones. The original dataset contained 14,380 records comprising RefSeq IDs for viral sequences and associated metadata. Some viruses are divided into discontiguous segments, which are represented as separate records in VHDB; in those cases the segments were treated as contigs of a single genome in the further analysis. We removed records with unspecified host information and those confusing the highly pathogenic Variola virus with a similarly named genus of fish. Following Zhang et al. (20), we filtered out viroids and satellites, which are classified as subviral agents and not bona fide viruses (51, 52). Note that even though they require helper viruses for replication, this step did not affect ubiquitous adeno-associated viruses and large virophages, which are well established within the viral taxonomy in the families Parvoviridae and Lavidaviridae, respectively. Human-infecting viruses were extracted by searching for records containing "Homo sapiens" in the "host name" field. Note that VHDB contains information about multiple possible hosts for a given virus where appropriate. Any virus infecting humans was assigned to the positive class, also if other, non- human hosts exist. In total, the dataset contained 9,496 viruses (grouped in 7503 species), including 1,309 human viruses (393 species). We considered both DNA and RNA viruses; RNA sequences were encoded in the DNA alphabet, as in RefSeq. Defining the negative class While defining a human-infecting class is relatively straightforward, the reference negative class may be conceptualized in a variety of ways. The broadest definition takes all non-human viruses into account, including bacteriophages (bacterial viruses). This is especially important, as most of known bacteriophages are DNA viruses, while many important human (and animal) viruses are RNA viruses. One could expect that the multitude of available bacteriophage genomes dominating the negative class could lower the prediction performance on viruses similar to those infecting humans. This offers an open-view approach covering a wider part of the sequence space, but may lead to misclassification of potentially dangerous mammalian or avian viruses. As they are often involved in clinically relevant host-switching events, a stricter approach must also be considered. In this case, the negative class comprises only viruses infecting Chordata (a group containing vertebrates and closely related taxa). Two intermediate approaches consider all eukaryotic viruses (including plant and fungi viruses), or only animal-infecting viruses. This amounts to four nested host sets: "All" (8,187 non-human viruses, 7110 species), "Eukaryota" (5,114 viruses, 4275 species), "Metazoa" (2,942 viruses, 2351 species) and "Chordata" (2,078 viruses, 1530 species). Auxiliary sets containing only non-eukaryotic viruses ("non-Eukaryota"), non-animal eukaryotic viruses ("non-Metazoa Eukaryota") etc. can be easily constructed by set subtraction. For the positive class, we randomly generated a training set containing 80% of the genomes, and validation and test sets with 10% of the genomes each. Importantly, the nested structure was kept also during the training-validation-test split: for example, the species assigned to the smallest test set ("Chordata") were also present in all the bigger test sets. The same applied to other taxonomic levels, as well as the training and validation sets wherever applicable. Read simulation We simulated 250bp long Illumina reads following a modification of a previously described protocol (24) and using the Mason read simulator (53). First, we only generated the reads from the genomes of human-infecting viruses. Then, the same steps were applied to each of the four negative class sets. Finally, we also generated a fifth set, "Stratified", containing an equal number of reads drawn from genomes of the following disjunct host classes: "Chordata" (25%), "non-Chordata Metazoa" (25%), "non- Metazoa Eukaryota" (25%) and "non-Eukaryota" (25%). In each of the evaluated settings, we used a total of 20 million (80%) reads for training, 2.5 million (10%) reads for validation and 2.5 million (10%) paired reads as the held-out test set. Read number per genome was proportional to genome length, keeping the coverage uniform on average. Viruses with longer genomes were therefore represented by more reads than shorter viruses. On the other hand, their sequence diversity was covered at a similar level. This length-balancing step was previously shown to work well for bacterial genomes of different lengths (24, 54). While the original datasets are heavily imbalanced, we generated the same number of negative and positive data points (reads) regardless of the negative class definition used. This protocol allowed us to test the impact of defining the negative class, while using the exactly same data as representatives of the positive class. We used three training and validation sets ("All", "Stratified", and "Chordata"), representing the fully open-view setting, a setting more balanced with regard to the host taxonomy, and a setting focused on cases most likely to be clinically relevant. In each setting, the validation set matched the composition of the training set. The evaluation was performed using all five test sets to gain a more detailed insight on the effects of negative class definition on the prediction performance. Human blood virome dataset Similarily to Zhang et al. (20), we used the human blood DNA virome dataset (55) to test the selected classifiers on real data. We obtained 14,242,329 reads of 150bp and searched all of VHDB using blastn (with default parameters) to obtain high-quality reference labels. If .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 4 — #4 i i i i i i 4 Preprint, YYYY, Vol. xx, No. xx a read’s best hit was a human-infecting virus, we assigned it to a positive class; the negative class was assigned if this was not the case. This procedure yielded 14,012,665 "positive" and 229,664 "negative" reads. Virus-level and species-level predictions In this study, we focus on predicting labels for reads originating from novel viruses. What constitutes a "novel" biological entity is an open question – a novel virus does not necessarily belong to a novel species (56). If a given viral isolate clusters with a known group of isolates, it is considered to be the same virus; if it does not, it may be assigned a distinct name and considered novel (56). This is separate from its putative taxonomical assignment. Assigning a novel virus to a novel or a previously established species is performed pursuing a wider set of criteria, and the criteria for delineating distinct species differ between viral families (51, 52, 56, 57). In most cases, species are perceived as human constructs rather than biological entities and host range often is explicitly one of the defining features (56, 58), rendering reasoning based on cross-species homology searches inherently difficult. The most prominent example of this problem is the SARS- CoV-2 virus, which is a novel virus within a previously known species (Severe acute respiratory syndrome–related coronavirus). Other members of this species include the human-infecting SARS-CoV-1, but also multiple related bat SARSr-CoV viruses (e.g. SARSr-CoV RaTG13 or Bat SARS- like coronavirus WIV1). Importantly, SARS-CoV-2 is not a strain of SARS-CoV-1; those two viruses share a common ancestor (56). This echoes similar problems related to pathogenic potential prediction for novel bacterial pathogens. A novel bacterium may be defined as a novel strain or a novel species (24), and the classifiers must be trained according to the desired definition. As the 2020 pandemic has shown, different viruses of the same species can differ wildly in their infectious potential and the broader impact on human societies. Therefore, threat assessment must be performed for novel viruses, not only novel taxa; different related viruses are non-redundant. At the same time, redundancy below this level (i.e. multiple instances of the same virus) must be eliminated from the dataset to ensure reliability of the trained classifier. VHDB tackles this problem by collecting and annotating reference genomes – each virus in the database is a separate entity with its own ID in NCBI Taxonomy. This virus-level approach was previously used by Zhang et al. (20). We show that homology-based algorithms underperform in this setting already, suggesting that machine learning is indeed required to accurately predict labels for novel viruses even if other members of the same species are present in the training database. Nevertheless, a more difficult alternative – predictions for reads of viruses belonging to completely novel species – is a related and potentially equally important task. For bacterial datasets, species novelty can be modelled by selecting a single representative genome per species (24). As the SARS- CoV-2 example shows, this is often not possible for viruses. To assess our approach in this stricter setup, we re-divided the VHDB dataset into training, validation and test sets ensuring that all viruses of a given species were assigned to only one of those subsets. This effectively models a "novel species" scenario while also reflecting within-species phenotype diversity. We recreated the species-wide versions of the "All" and "Chordata" datasets by assigning 80%, 10% and 10% of the species to the training, validation and test datasets, respectively. We resimulated the reads as outlined above and compared the performance of the machine learning and homology-based approaches achieving the highest accuracy in the simpler "novel virus" setting (see Section Prediction performance). Training We used the DeePaC package (24) to investigate RC-CNN and RC-LSTM architectures, which guarantee identical predictions for both forward and reverse-complement orientations of any given nucleotide sequence, and have been previously shown to accurately predict bacterial pathogenicity. Here, we employ an RC-CNN with two convolutional layers with 512 filters of size 15 each, average pooling and 2 fully connected layers with 256 units each. The LSTM used has 384 units (Fig. S1). We use dropout regularization in both cases, together with aggressive input dropout at the rate of 0.2 or 0.25 (tuned for each model). Input dropout may be interpreted as a special case of noise injection, where a fraction of input nucleotides is turned to Ns. Representations of forward and reverse-complement strands are summed before the fully connected layers. As two mates in a read pair should originate from the same virus, predictions obtained for them can be averaged for a boost in performance. If a contig or genome is available, averaging predictions for constituting reads yields a prediction for the whole sequence. We used Tesla P100 and Tesla V100 GPUs for training and an RTX 2080 Ti for visualizations. We wanted the networks to yield accurate predictions for both 250bp (our data, modelling a sequencing run of an Illumina MiSeq device) and 150bp long reads (as in the Human Blood Virome dataset). As shorter reads are padded with zeros, we expected the CNNs trained using average pooling to misclassify many of them. Therefore, we prepared a modified version of the datasets, in which the last 100bp of each read were turned to zeros, mocking a shorter sequencing run while preserving the error model. Then, we retrained the CNN which had performed best on the original dataset. Since in principle, the Human Blood Virome dataset should not contain viruses infecting non-human Chordata, a "Chordata"- trained classifier was not used in this setting. Benchmarking We compare our networks to the the k-NN classifier proposed by Zhang et al. (20), the only other approach explicitly tested on raw NGS reads and detecting human viruses in a fully open view setting (not focusing on a limited number of species). We use the real sequencing data that they used (55) for an unbiased comparison. We trained the classifier on the "All" dataset as described by the authors, i.e. using non-overlapping, 500bp-long contigs generated from the training genomes (retraining on simulated reads is computationally prohibitive). We also tested the performance of using BLAST to search against an indexed database of labeled genomes. We constructed the database from the "All" training set and used discontiguous megablast to achieve high inter-species sensitivity. For NGS mappers .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 5 — #5 i i i i i i Preprint, YYYY, Vol. xx, No. xx 5 (BWA-MEM (59) and Bowtie2 (60)), the indices were constructed analogously. Kraken (61) was previously shown to perform worse than both BLAST and machine learning when faced with read-based pathogenic potential prediction for novel bacterial species (54). Its major advantage – assigning reads to lowest common ancestor (LCA) nodes in ambiguous cases – turns into a problem in the infectivity prediction task, as transferring labels to LCAs is often impossible (54). Therefore, we focus on alignment-based approaches as the most accurate alternative to machine learning in this context. Note that both alignment and k-NN can yield conflicting predictions for the individual mates in a read pair. What is more, BLAST and the mappers yield no prediction at all if no match is found. Therefore, similarly to Bartoszewicz et al. (24), we used the accept anything operator to integrate binary predictions for read pairs and genomes. At least one match is needed to predict a label, and conflicting predictions are treated as if no match was found at all. Missing predictions lower both true positive and true negative rates. Filter visualization Substring extraction In order to visualize the learned convolutional filters, we downsample a matching test set to 125,000 reads and pass it through the network. This is modelled after the method presented by Alipanahi et al. (25). For each filter and each input sequence, the authors extracted a subsequence leading to the highest activation, and created sequence logos from the obtained sequence sets ("max- activation"). We used the DeepSHAP implementation (45) of DeepLIFT (43) to extract score-weighted subsequences with the highest contribution score ("max-contrib") or all score- weighted subsequences with non-zero contributions ("all- contrib"). Computing the latter was costly and did not yield better quality logos. We use an all-zero reference. As reads from real sequencing runs are usually not equally long, shorter reads must be padded with Ns; the "unknown" nucleotide is also called whenever there is not enough evidence to assign any other to the raw sequencing signal. Therefore, Ns are "null" nucleotides and are a natural candidate for the reference input. We do not consider alternative solutions based on GC content or dinucleotide shuffling, as the input reads originate from multiple different species, and the sequence composition may itself be a strong marker of both virus and host taxonomy (13). We also avoid weight-normalization suggested for zero- references (43), as it implicitly models the expected GC content of all possible input sequences, and assumes no Ns present in the data. Finally, we calculate average filter contributions to obtain a crude ranking of feature importance with regard to both the positive and negative class. Partial Shapley values Building sequence logos involves calculating information content (IC) of each nucleotide at each position in a prospective DNA motif. This can be then interpreted as measure of evolutionary sequence conservation. However, high IC does not necessarily imply that a given nucleotide is relevant in terms of its contribution to the classifier’s output. Some sub-motifs may be present in the sequences used to build the logo, even if they do not contribute to the final prediction (or even a given filter’s activation). To test this hypothesis, we introduce partial Shapley values. Intuitively speaking, we capture the contributions of a nucleotide to the network’s output, but only in the context of a given intermediate neuron of the convolutional layer. More precisely, for any given feature xi, intermediate neuron yj and the output neuron z, we aim to measure how xi contributes to z while regarding only the fraction of the total contribution of xi that influences how yj contributes to z. Although similarly named concepts were mentioned before as intermediate computation steps in a different context (62, 63), we define and use partial Shapley values to visualize contribution flow through convolutional filters. This differs from recently introduced contribution weight matrices (32), where feature attributions are used as a representation of an identified transcription factor binding site irreducible to a given intermediate neuron. Using the formalism of DeepLIFT’s multipliers (43) and their reinterpretation in SHAP (45), we backpropagate the activation differences only along the paths "passing through" yj. In Eq. 1, we define partial multipliers µ (yj) xiz and express them in terms of Shapley values φ and activation differences w.r.t. the expected activation values (reference activation). Calculating partial multipliers is equivalent to zeroing out the multipliers mykz for all k 6=j before backpropagating myjz further. µ (yj) xiz =mxiyjmyjz = φi(yj,x)φj(z,y) (xi−E[xi])(yj−E[yj]) (1) We define partial Shapley values ϕ (yj) i (z,x) analogously to how Shapley values can be approximated by a product of multipliers and input differences w.r.t. the reference (Eq. 2): ϕ (yj) i (z,x)=µ (yj) xiz (xi−E[xi])= φi(yj,x)φj(z,y) yj−E[yj] (2) From the chain rule for multipliers (43), it follows that standard multipliers are a sum over all partial multipliers for a given layer y. Therefore, Shapley values as approximated by DeepLIFT are a sum of partial Shapley values for the layer y (Eq. 3). φi(z,x)=mxiz(xi−E[xi])= ∑ j ϕ (yj) i (z,x) (3) Once we calculate the contributions of convolutional filters for the first layer, ϕ (yj) i (z,x) for the first convolutional layer of a network with one-hot encoded inputs and an all-zero reference can be efficiently calculated using weight matrices and filter activation differences (Eq. 4-5). First, in this case we do not traverse any non-linearities and can directly use the linear rule (43) to calculate the contributions of xi to yj as a .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 6 — #6 i i i i i i 6 Preprint, YYYY, Vol. xx, No. xx product of the weight wi and the input xi. Second, the input values may only be 0 or 1. φi(yj,x)=wixi = { wi, if xi =1 0, otherwise (4) ϕ (yj) i (z,x)= wiφj(z,y) yj−E[yj] (5) Resulting partial contributions can be visualized along the IC of each nucleotide of a convolutional kernel. To this end, we design extended sequence logos, where each nucleotide is colored according to its contribution. Positive contributions are shown in red, negative contributions are blue, and near- zero contributions are gray. Therefore, no information is lost compared to standard sequence logos, but the relevance of individual nucleotides and the filter as a whole can be easily seen. Color saturation is limited by the reciprocal of a user- defined gain parameter, here set to nm, where n equals the number of input features xi (sequence length) and m equals the number of convolutional filters yj in a given layer. Genome-wide phenotype analysis We create genome-wide phenotype analysis (GWPA) plots to analyse which parts of a viral genome are associated with the infectious phenotype. We scramble the genome into overlapping, 250bp long subsequences (pseudo-reads) without adding any sequencing noise. For the highest resolution, we use a stride of one nucleotide. For S. aureus, we used a stride of 125bp. We predict the infectious potential of each pseudo-read and average the obtained values at each position of the genome. Analogously, we calculate average contributions of each nucleotide to the final prediction of the convolutional network. Finally, we normalize raw infectious potentials into the [−0.5,0.5] interval for a more intuitive graphical representation. We visualize the resulting nucleotide-resolution maps with IGV (64). For protein structures, we average the scores codon-wise to obtain contribution scores per amino acid and visualize them with PyMOL (65). For well-annotated genomes, we compile a ranking of genes (or other genomic features) sorted by the average infectious potential within a given region. In addition to that, we scan the genome with the learned filters of the first convolutional layer to find genes enriched in subsequences yielding non-zero filter activations. We use Gene Ontology to connect the identified genes of interest with their molecular functions and biological processes they are engaged in. RESULTS Negative class definition Choosing which viruses should constitute the negative class is application dependent and influences the performance of the trained models. Table S1 summarizes the prediction accuracy for different combinations of the training and test set composition. The models trained only on human and Chordata-infecting viruses maintain similar, or even better performance when evaluated on viruses infecting a much broader host range, including bacteria. This suggests that the learned decision boundary separates human viruses from all the others surprisingly well. We hypothesize that the human host signal must be relatively strong and contained within the Chordata host signal. Dropout rate of 0.2 resulted in the highest validation accuracy for CNNStr-150 and LSTMStr. A rate of 0.25 was selected for the other models. Adding more diversity to the negative class may still boost performance on more diverse test sets, as in the case of CNN trained on the "All" dataset (CNNAll). This model performs a bit worse on viruses infecting hosts related to humans, but achieves higher accuracy than the "Chordata"- trained models and the best recall overall. Rebalancing the negative class using the "Stratified" dataset helps to achieve higher performance on animal viruses while maintaing high overall accuracy. The LSTMs are outperformed by the CNNs, but they can be used for shorter reads without retraining (see Sections Training and Prediction performance). Prediction performance We selected LSTMAll and CNNAll for further evaluation. We used a single consumer-grade RTX 2080 Ti GPU to measure inference speed. The CNN classifies 5000 reads/s and the LSTM 1855 reads/s. Analyzing ten million reads takes only 33 minutes using the faster model; linear speed-ups are possible if more GPUs are available. Therefore, the trained models achieve high-throughputs necessary to analyze NGS datasets. Table 1 presents the results of a benchmark using the "All" test set. Low performance of the k-NN classifier (20) is caused by frequent conflicting predictions for each read in a read pair. In a single-read setting it achieves 75.5% accuracy, while our best model achieves 87.8% (Table S2). Although BLAST achieves high precision, it yields no predictions for over 10% of the samples. CNNAll is the most sensitive and accurate. As expected, standard mapping approaches (BWA- MEM and Bowtie2) struggle with analysing novel pathogens – they are the most precise but the least sensitive. Our approach outperforms them by 15-30%. Although we focus on the extreme case of read-based predictions, our method can also be used on assembled contigs and full genomes if they are available, as well as on read sets from pure, single-virus samples. We note that assembly itself does not yield any labels and a follow-up analysis (via alignment, machine learning or other approaches) is required to correctly classify metagenomic contigs in any case. We ran predictions on contigs without any size filtering with both k- NN and BLAST (Table 2). We present performance measures for both individual contigs and whole genome predictions based on contig-wise majority vote. We compare them to BLAST with read-wise majority vote (54) and to read-wise average predictions of our networks, analogous to presented previously for bacteria (24). Our method outperforms BLAST by 1.2% and k-NN by 8.9%, even though they have access to the full biological context (full sequences of all contigs in a genome), while we simply average outputs for short reads originating from the contigs. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 7 — #7 i i i i i i Preprint, YYYY, Vol. xx, No. xx 7 Table 1. Classification performance in the fully open-view setting (all virus hosts), read pairs. Acc. – accuracy, Prec. – precision, Rec. – recall, Spec. – specificity. Bowtie2, BWA-MEM and BLAST yield no predictions for over 35%, 19% and 10% of the samples, respectively. Best performance in bold. ACC. PREC. REC. SPEC. CNNALL (OURS) 89.9 93.9 85.4 94.4 LSTMALL (OURS) 86.4 89.0 83.0 89.8 k-NN 57.1 57.8 52.1 62.0 BOWTIE2 58.6 99.2 59.2 58.0 BWA-MEM 72.8 98.9 73.9 71.8 BLAST 80.6 98.4 79.1 82.2 We benchmarked our models against the human blood virome dataset used by Zhang et al. (20). Our models outperform their k-NN classifier. As the positive class massively outnumbers the negative class, all models achieve over 99% precision. CNNAll-150 performs best (Table 3). However, the positive class is dominated by viruses which are not necessarily novel. The CNN was more accurate on training data, so we expected it to detect those viruses easily. Finally, we repeated the analysis in the "novel species" scenario. Classifying novel viral species when restricted to Chordata-infecting viruses is too challenging for practical purposes (Table S3). Read-wise predictions are not much better than random guesses for both BLAST and CNNs. Low precision of BLAST shows that it often recovers wrong labels even when it does find a match – sequence similarity is not a reliable predictor of the infectious potential in this setting. Even if a whole genome is available, overall accuracy is low. This looks very differently in the fully-open view scenario (Table 4). The CNN trained on the species-wise division of the "All" dataset (CNNSP-All) outperforms BLAST by a wide margin on both reads and genomes. Strikingly, CNNSP-All predictions based on a single read pair achieve higher accuracy than BLAST predictions using whole genomes, mainly due to their significantly higher recall. What is more, pooling predictions from all the reads originating from a given genome does not improve overall CNNSP-All accuracy any further. As CNNSP-All does not reliably outperform its Chordata-trained analog on the "Chordata" dataset (CNNSP-Cho, Table S3), we suspect that its relatively high accuracy on the "All" dataset is caused by its high sensitivity while maintaining good specificity on non-Chordata viruses. Filter visualization Over 84% of all contributing first-layer filters in CNNAll have positive average contribution scores. We comment more on this fact in Section Nucleotide contribution logos. For CNNAll, the average information content of our motifs is strongly correlated nucleotide-wise with IC of DeepBind-like logos (Spearman’s ρ>0.95, p<10−15 for all contributing filter pairs except one). The difference in average IC is negligible (0.04 bit higher for "max-contrib", Wilcoxon test, p<10−15). Therefore, our contribution logos represent analogous "motifs", while extracting additional, nucleotide- level interpretations. For exactly one filter, "max-contrib" and "max-activation" scores are not correlated. A deeper analysis reveals that this particular filter is activated by stretches Table 2. Classification performance, all hosts. Whole available genomes. Negative class is the majority class. BAcc. – balanced accuracy, Rec. – recall, Spec. – specificity. BLAST (reads) and our networks use read-wise majority vote or output averaging to aggregate predictions over all reads from a genome. k-NN (genome) and BLAST (genome) use contig-wise majority vote. k-NN (contigs) and BLAST (contigs) represent performance on individual contigs treated as separate entities. k-NN (reads) was not used, as high conflicting prediction rates made read-wise aggregation impracticable. BACC. AUPR REC. SPEC. CNNALL (OURS) 91.7 91.2 89.3 94.2 LSTMALL (OURS) 86.3 85.8 96.2 76.4 BLAST (READS) 90.3 N/A 85.5 95.1 k-NN (GENOME) 82.8 N/A 93.9 71.6 BLAST (GENOME) 90.5 N/A 86.3 94.6 k-NN (CONTIGS) 83.0 N/A 94.3 71.6 BLAST (CONTIGS) 88.4 N/A 87.1 89.7 Table 3. Classification performance on the human blood virome dataset. Positive class is the majority class. BAcc. – balanced accuracy, Rec. – recall, Spec. – specificity. BACC. AUPR REC. SPEC. CNNALL-150 (OURS) 96.8 >99.9 97.3 96.2 LSTMALL (OURS) 91.8 >99.9 88.2 95.5 k-NN 83.1 99.5 80.9 85.4 Table 4. Classification performance, novel species. Top: paired reads (see Table 1). BLAST yields predictions for only 64.3% of the pairs. Bottom: whole available genomes or contigs – negative class is the majority class (see Table 2). BAcc. – balanced accuracy (equal to accuracy for the balanced paired-read dataset), Rec. – recall, Spec. – specificity. BLAST (reads) and our networks use read-wise majority vote or output averaging to aggregate predictions over all reads from a genome. BLAST (genome) uses contig-wise majority vote. BLAST (contigs) represents performance on individual contigs treated as separate entities. Note that low precision is heavily affected by class imbalance. BACC. PREC. REC. SPEC. CNNSP-ALL (OURS) 77.3 86.2 65.0 89.6 BLAST 47.1 94.1 17.8 76.4 CNNSP-ALL (OURS) 76.8 34.2 69.8 83.8 BLAST (READS) 61.8 46.8 30.2 93.5 BLAST (GENOME) 64.0 44.9 36.5 91.5 BLAST (CONTIGS) 57.9 37.9 33.6 82.1 of 0s (Ns) – it is the only filter with a positive bias, and almost all of its weights are negative (with one near- zero positive). Therefore, an overwhelming majority of its maximum activations are in fact padding artifacts. On the other hand, regions of unambiguous nucleotide sequences result in high positive contributions, since they correspond to a lack of filter activation, where an activation is present for the all-N reference. In fact, for over 99.9% of the reads, positive contributions occur at every single position. We suspect that the filter works as an "ambiguity detector". Since Ns are modelled as all-zero vectors in the one-hot encoding .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 8 — #8 i i i i i i 8 Preprint, YYYY, Vol. xx, No. xx scheme used here, the network represents "meaningful" (i.e. unambiguous) regions of the input as a missing activation of the filter. This is supported by the fact that the filter lacks any further preference for the specific non-zero nucleotide type. Since sequence logos presented here ignore ambiguous (i.e. noninformative) nucleotides, their ICs for this filter are near- zero, preventing meaningful visualization. On the other hand, this ambiguity seems to play a role in the final classification decision, as contribution distributions are well-separated for both classes (Fig. S2). We speculate that this could be caused by lower quality of the non-pathogen reference genomes, but understanding how exactly this information is used would require further investigation, including feature interactions at all layers of the network. Importantly, only the contribution analysis reveals the relevance of the filter beyond simple activation and nucleotide overrepresentation. The choice of the reference input is crucial. In the Fig. 1 we present example filters, visualized as "max- contrib" sequence logos based on mean partial Shapley values for each nucelotide at each position. All nucleotides of the filters with the second-highest (Fig. 1a) and the lowest (Fig. 1b) score have relatively strong contributions in accordance with the filters’ own contributions. However, we observe that some nucleotides consistently appear in the activating subsequences, but the sign of their contributions is opposite to the filter’s (low-IC nucleotides of a different color, Fig. 1c). Those "counter-contributions" may arise if a nucleotide with a negative weight forms a frequent motif with others with positive weights strong enough to activate the filter. We comment on this fact in the Section Nucleotide contribution logos. Some filters seem to learn gapped motifs resembling a codon structure (Fig. 1c). We extracted this filter from the original DeePaC network predicting bacterial pathogenicity (24) where the counter-contributions are common, but we find similar filters in our networks as well (Fig. S3). We scanned a genome of S. aureus subsp. aureus 21200 (RefSeq assembly accession: GCF_000221825.1) with this filter and discovered that the learned motif is indeed significantly enriched in coding sequences (Fisher exact test with Benjamini-Hochberg correction, q<10−15). It is also enriched in a number of specific genes. The one with the most hits (sraP, q< 10−15) is a serine-rich adhesin involved in the pathogenesis of infective endocarditis and mediating binding to human platelets (66). The filter seems to detect serine and glycine repeats in this particular gene (Fig. S5), but a broader, cross-species, multi-gene analysis would be required to fully understand its activation patterns. An analogous analysis revealed that the second-highest contributing filter (Fig. 1a) is overall enriched in coding sequences in both Taï Forest ebolavirus (q<10−15, RefSeq accession: NC_014372) and SARS-CoV-2 coronavirus (q=5.6×10−5, RefSeq accession: NC_045512.2). The top hits are the nucleocapsid (N) protein gene of SARS-CoV-2 and the VP35 ebolavirus gene encoding a polymerase cofactor suppressing innate immune signaling (q<10−15). Genome-wide phenotype analysis We created a GWPA plot for the Taï Forest ebolavirus genome. Most genes (6 out of 7) can be detected with visual inspection by finding peaks of elevated infectious potential score predicted by at least one of the models (Fig. 2a). Intergenic regions are characterized by lower mean scores. Noticeably, most nucleotide contributions are positive, and low non-negative contributions coincide with regions of negative predictions. Taken together with the surprisingly good generalization of Chordata-trained classifiers and a dominance of positive filters discussed above, this suggests that our networks work as positive class detectors, treating all other sequences as “negative” by default. Indeed, the reference sequence of all Ns is predicted to be "non-pathogenic" with a score of 0. We ran a similar analysis of S. aureus using the built-in DeePaC models (24) and our interpretation workflow. While a viral genome contains usually only a handful of genes, by compiling a ranking of 870 annotated genes of the analyzed S. aureus strain we could test if the high-ranking regions are indeed associated with pathogenicity (Table S4). Indeed, out of three top-ranking genes with known biological names and Gene Ontology terms, sarR and sspB are directly engaged in virulence, while hupB regulates expression of virulence- involved genes in many pathogens (67). In contrast to the viral models, both negative and positive contributions are present (Fig. S6), and the model’s output for the all-N reference is slightly above the decision threshold (0.58). Even though the network architecture of the viral and the bacterial model are the same, the latter learns a "two-sided" view of the data. We assume this must be a feature of the dataset itself. Fig. 2b presents a GWPA plot for the whole genome of the SARS-CoV-2 coronavirus, successfully predicted to infect humans, even though the data was collected at least 5 months before its emergence. Interestingly, its mean infectious potential (0.57 as scored by CNNAll) is relatively close to the decision threshold, while its closest known relative, a bat- infecting SARSr-CoV RaTG13, is actually falsely classified as a human virus with a slightly lower mean infectious potential (0.55). What is more, the gene encoding the spike protein, which plays a significant role in host entry (68), has a mean score slightly above the threshold for SARS- CoV-2 (0.52) and below the threshold for RaTG13 (0.49). As shown in the GWPA plots of both viruses (Fig. 2b and Fig. S4), regions that the network has learned to associate with the infectious phenotype are distributed non-uniformly and tend to cluster together. This suggests that low-confidence mean prediction for those viruses is not a result of random guessing, but genuine ambiguity present in the data – and the misclassification of RaTG13 could be indicative of a general zoonotic potential of SARS-related coronaviruses. In the Fig. 2b, we highlighted the score peaks aligning the spike protein gene (S), as well as the E and N genes, which were scored the highest (apart from an unconfirmed ORF10 of just 38aa downstream of N) by the CNN and the LSTM, respectively. Correlation between the CNN and LSTM outputs is significant, but species-dependent and moderate (0.28 for Ebola, 0.48 for SARS-CoV-2), which suggests they capture complementary signals. Fig. 2c shows the nucleotide-level contributions in a small peak within the receptor-binding domain (RBD) of the S protein, crucial for recognizing the host cell. The domain location was predicted with CD-search (69) using the default parameters. The maximum score of this peak is noticeably higher for SARS-CoV-2 (0.87) than for its analog in RaTG13 .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 9 — #9 i i i i i i Preprint, YYYY, Vol. xx, No. xx 9 (a) (b) (c) (d) (e) (f) Figure 1. Nucleotide contribution logos of example filters. 1a: Second-highest mean contribution score (CNNAll). Error bars correspond to Bayesian 95% confidence intervals. 1b: Lowest mean contribution score (CNNAll). 1c: Gaps resembling a codon structure, extracted from Bartoszewicz et al. (24). Consensus sequence: CAWCNNCNNCNNCNN. 1d-1f: Analogous logos created with the DeepBind-like "max-activation" approach. Our "max-contrib" logos visualize contributions of individual nucleotides, including counter-contributions. (0.67). Fig. 3 presents the RBD in the structural context of the whole S protein (PDB ID: 6VSB, (70)), as well as in complex with a SARS-neutralizing antibody CR3022 (PDB ID: 6W41, (71)). The high score peak roughly corresponds to one of the regions associated with reduced expression of the RBD (72), located in the core-RBD subdomain. It covers over 71% of the CR3022 epitope, as well as the neighbouring site of the N343 glycan. The latter is present in the epitope of another core-RBD targeting antibody, S309 (73). All the per-residue average contributions in the region are positive (Fig. S7), even in the regions of lower pathogenicity score, in accordance with the results presented in Fig. 2c. DISCUSSION Accurate predictions from short DNA reads Compared to the previous state-of-the-art in viral host prediction directly from next-generation sequencing reads (20), our models drastically reduce the error rates. This holds also for novel viruses not present in the training set. Generalization of virus-level Chordata models to other host groups is a sign of a strong, “human” signal. We suspect our classifiers detect the positive class treating all other regions of the sequence space as “negative” by default, exhibiting traits of a one-class classifier even without being explicitly trained to do so. We find further support for this hypothesis: the networks learn many more “positive” than “negative” filters and regions of near-zero nucleotide contributions (including the null reference sample) result in negative predictions. As this effect does not occur for bacteria, we expect it do be task- and data-dependent. While we ignore the simulated quality information here, investigating the role of sequencing noise will be an interesting follow-up study. Although the data setup is crucial in general, the modelling step is also important, as shown by our comparison to the baseline k-NN model. The RC-nets are relatively simple, but they are invariant to reverse-complementarity and perform better than random forests, naïve Bayes classifiers and standard NN architectures in another NGS task (24). In the paired read scenario, the previously described k- NN approach fails, and standard, alignment-based homology testing algorithms cannot find any matches in more than 10% of the cases, resulting in relatively low accuracy. On a real human virome sample, where a main source of negative class reads is most likely contamination (55), our method filters out non-human viruses with high specificity. In this scenario, the BLAST-derived ground-truth labels were mined using the complete database (as opposed to just a training set). In all cases, our results are only as good as the training data used; high quality labels and sequences are needed to develop trustworthy models. Ideally, sources of error should be investigated with an in-depth analysis of a model’s performance on multiple genomes covering a wide selection of taxonomic units. This is especially important as the method assumes no mechanistic link between an input sequence and the phenotype of interest, and the input sequence constitutes only a small fraction of the target genome without a wider biological context. Still, it is possible to predict a label even from those small, local fragments. A similar effect was also observed for image classification with CNNs (74). Virulence arises as a complex interplay between the host and the virus, so the predictions reflect only an estimated potential of the infectious phenotype. This mirrors the caveats of bacterial pathogenic potential prediction (24), including the considerations of balancing computational cost, reliability of error estimates, size and composition of the reference database. Even though deep learning outperforms the standard homology-based methods, it is still an open question whether it captures "functional" signals, or just a more flexible sequence similarity function. By the very nature of machine learning and sequence comparison in general, we expect similar viruses to yield similar predictions; in principle this could be used to asses a risk of a host-switching event. The .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 10 — #10 i i i i i i 10 Preprint, YYYY, Vol. xx, No. xx (a) (b) (c) Figure 2. Taï Forest ebolavirus and SARS-CoV-2 coronavirus genomes. Top: score predicted by LSTMAll. Middle: score predicted by CNNAll. Heatmap: nucleotide contributions of CNNAll. Bottom, in blue: reference sequence. 2a: Taï Forest ebolavirus. Genes that can be detected by at least one model are highlighted in black. 2b: Whole genome and sequences encoding the spike protein (S), envelope protein (E) and nucleocapsid protein (N). 2c: Spike protein gene, a small peak (positions 22,595-22,669, dashed line in Fig. 2b) within the receptor-binding domain (predicted by CD-search, positions 22,517-23,185). Binding to the receptor is crucial for entry to the host cell. Local host adaptation could help switch hosts between the animal reservoir and humans. interpretability suite presented here aims at shedding some light on this question, but more research is needed. Dual-use research and biosecurity While we focused on the NGS-based prediction scenario, our models could in principle be used to screen DNA synthesis orders for potentially dangerous sequences the context of cyberbiosecurity in synthetic biology. Since standard, homology-based approaches like BLAST are not enough to guarantee accurate screening at a reasonable cost (75, 76, 77), machine learning methods are a promising solution. This has been suggested before for the bacterial DeePaC models (24), and is applicable to the viral networks presented here as well. However, this line of research can raise questions about possible dual-use. O’Brien and Nelson (78) suggested that while the intended purpose of pathogenicity potential prediction is to mitigate biosecurity threats, it could actually enable designing new pathogens to cause maximal harm. The importance of this concern is difficult to overstate and it must be addressed. If an ML-guided, genome-wide phenotype optimization tool existed, it would indeed be a classical dual-use technology not unlike more established computer-aided design approaches for synthetic biology – potentially dangerous, but offering tremendous benefits (e.g. in agriculture, medicine or manufacturing) as well. However, the models presented here do not allow biologically sensible optimization of target sequences. For example, we find meaningless, low-complexity sequences of mononucleotide repeats corresponding to global maxima (infectious potential of 1.0). These artifacts highlight the fact that only some generally undefined regions of the theoretically possible sequence space are biologically relevant. What is more, we operate on short sequences constituting minuscule fractions of .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 11 — #11 i i i i i i Preprint, YYYY, Vol. xx, No. xx 11 (a) (b) (c) (d) (e) Figure 3. Predicted infectious potentials plotted over the SARS-CoV-2 spike glycoprotein receptor-binding domain. 3a-3c: Top and side view of the spike protein. Three receptor-binding domains (RBDs) are colored in blue, white and red according to the predicted infectious potential of the corresponding genomic sequence. One of the domains is in the "up" conformation. Red regions corresponding to the peak in Fig. 2c are located in the core-RBD subdomain. 3d: RBD in complex with a SARS-neutralizing antibody CR3022 (green). The red region covers over 71% of the CR3022 epitope, but spans also to the neighbouring fragments, including the site of the N343 glycan (carbohydrate in red stick representation). This is a part of the epitope of another neutralizing antibody, S309. 3e: Cartoon representation of Fig. 3d. The red region is centered on two exposed α-helices surrounding the core β-sheet (lower score, white). the whole genome with all its complexity. Although successful deep learning approaches for both protein (79, 80, 81) and regulatory sequence design (82, 83, 84, 85) do exist, moving from read-based classification to genome-wide phenotype optimization would require considerable research effort, if possible at all. This would entail capturing a wealth of biological contexts well beyond the capabilities of even the best classification models currently available. Nucleotide contribution logos Visualizing convolutional filters may help to identify more complex filter structures and disentangle the contributions of individual nucleotides from their "conservation" in contributing sequences. Counter-contributions suggest that the information content and the contribution of a nucleotide are not necessarily correlated. Visualizing learned motifs by aligning the activating sequences (25) would not fully describe how the filter reacts to presented data. It seems that the assumption of nucleotide independence – which is crucial for treating DeepLIFT as a method of estimating Shapley values for input nucleotides (45) – does not hold in full. Indeed, k-mer distribution profiles are frequently used features for modelling DNA sequences (as shown also by the dimer-shuffling method of generating reference sequences proposed by Shrikumar et al. (43)). However, DeepLIFT’s multiple successful applications in genomics indicate that the assumption probably holds approximately. We see information content and DeepLIFT’s contribution values as two complementary channels that can be jointly visualized for better interpretability and explainability of CNNs in genomics. Filter enrichment analysis enables even deeper insight in the inner workings of the networks. We generate activation data for hundreds to thousands of species, genes and filters. Yet, aggregation and interpretation of those results beyond case studies is non-trivial, and a promising avenue for further research. Genome-scale interpretability Mapping predictions back to a target genome can be used both as a way of investigating a given model’s performance and as a method of genome analysis. GWPA plots of well- annotated genomes highlight the sequences with erroneous .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 12 — #12 i i i i i i 12 Preprint, YYYY, Vol. xx, No. xx and correct phenotype predictions at both genome and gene level, and nucleotide-resolution contribution maps help track those regions down to individual amino-acids. On the other hand, once a trusted model is developed, it can be used on newly emerging pathogens, as the SARS-CoV-2 virus briefly analyzed in this work. Therefore, we see GPWA applications in both probing the behaviour of artificial neural networks in pathogen genomics and finding regions of interest in weakly annotated genomes. What is more, the approach could be easily co-opted to genome-wide activation analyses of any arbitrary, intermediate neuron. The methods presented here may also be applied to other biological problems, and extending them to other hosts and pathogen groups, multi-class classification or gene identification is possible. However, experimental work and traditional sequence analysis are required to truly understand the biology behind host adaptation and distinguish true hits from false positives. Conclusion We presented a new approach for predicting a host of a novel virus based on a single DNA read or a read pair, cutting the error rates in half compared to the previous state-of-the-art. For convolutional filters, we jointly visualize nucleotide contributions and information content. Finally, we use GWPA plots to gain insights into the models’ behaviour and analyze a recently emerged SARS-CoV-2 virus. The approach presented here is implemented as a python package (see Data availability) and a command line tool easily installable with Bioconda (86). DATA AVAILABILITY The datasets of simulated reads with associated metadata are hosted at https://doi.org/10.5281/zenodo.4312525. The tool can be installed with Bioconda (conda install deepacvir, requires setting up Bioconda), Docker (docker pull dacshpi/deepac) or pip (pip install deepacvir). Detailed installation instructions, user guide and the main codebase (including the interpretability workflows presented here) are available at https://gitlab.com/dacs-hpi/DeePaC. Source code of the plugin shipping the trained models, config files describing the architectures used and the models themselves are available at https://gitlab.com/dacs-hpi/DeePaC-vir. ACKNOWLEDGEMENTS We gratefully acknowledge Yong-Zhen Zhang and the scientists at the Shanghai Public Health Clinical Center & School of Public Health, Fudan University, who shared the sequence of the SARS-CoV-2 virus ahead of publication. We thank Melania Nowicka (Max Plank Institute for Molecular Genetics) for inspiring discussions on efficient calculations of partial Shapley values, Vitor C. Piro (Hasso Plattner Institute) for discussions on traversing taxonomy graphs, Lothar H. Wieler (Robert Koch Institute) for useful comments on the first draft of the manuscript and the Anonymous Reviewers for their suggestions and feedback. FUNDING This work was supported by the German Academic Scholarship Foundation (JMB), the BMBF Computational Life Sciences initiative (project DeePath, to BYR) and the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A537B, 031A533A, 031A538A, 031A533B, 031A535A, 031A537C, 031A534A, 031A532B). REFERENCES 1. Calvignac-Spencer, S., Schulze, J. M., Zickmann, F., and Renard, B. Y. (2014) Clock rooting further demonstrates that Guinea 2014 EBOV is a member of the Zaïre lineage. PLoS currents, 6. 2. Vouga, M. and Greub, G. (January, 2016) Emerging bacterial pathogens: the past and beyond. Clinical Microbiology and Infection, 22(1), 12–21. 3. Trappe, K., Marschall, T., and Renard, B. Y. (September, 2016) Detecting horizontal gene transfer by mapping sequencing reads across species boundaries. Bioinformatics, 32(17), i595–i604. 4. Leendertz, S. A. J., Gogarten, J. F., Düx, A., Calvignac-Spencer, S., and Leendertz, F. H. (Mar, 2016) Assessing the Evidence Supporting Fruit Bats as the Primary Reservoirs for Ebola Viruses. EcoHealth, 13(1), 18– 25. 5. Lecuit, M. and Eloit, M. (2014) The diagnosis of infectious diseases by whole genome next generation sequencing: a new era is opening. Frontiers in Cellular and Infection Microbiology, 4, 25. 6. Calistri, A. and Palù, G. (2015) Editorial commentary: Unbiased next-generation sequencing and new pathogen discovery: undeniable advantages and still-existing drawbacks. Clinical Infectious Diseases: An Official Publication of the Infectious Diseases Society of America, 60(6), 889–891. 7. Andrusch, A., Dabrowski, P. W., Klenner, J., Tausch, S. H., Kohl, C., Osman, A. A., Renard, B. Y., and Nitsche, A. (2018) PAIPline: pathogen identification in metagenomic and clinical next generation sequencing samples. Bioinformatics, 34(17), i715–i721. 8. Herfst, S., Schrauwen, E. J. A., Linster, M., Chutinimitkul, S., Wit, E. d., Munster, V. J., Sorrell, E. M., Bestebroer, T. M., Burke, D. F., Smith, D. J., Rimmelzwaan, G. F., Osterhaus, A. D. M. E., and Fouchier, R. A. M. (June, 2012) Airborne Transmission of Influenza A/H5N1 Virus Between Ferrets. Science, 336(6088), 1534–1541. 9. Imai, M., Watanabe, T., Hatta, M., Das, S. C., Ozawa, M., Shinya, K., Zhong, G., Hanson, A., Katsura, H., Watanabe, S., Li, C., Kawakami, E., Yamada, S., Kiso, M., Suzuki, Y., Maher, E. A., Neumann, G., and Kawaoka, Y. (June, 2012) Experimental adaptation of an influenza H5 HA confers respiratory droplet transmission to a reassortant H5 HA/H1N1 virus in ferrets. Nature, 486(7403), 420–428. 10. Lipsitch, M. and Inglesby, T. V. (December, 2014) Moratorium on Research Intended To Create Novel Potential Pandemic Pathogens. mBio, 5(6). 11. Noyce, R. S., Lederman, S., and Evans, D. H. (January, 2018) Construction of an infectious horsepox virus vaccine from chemically synthesized DNA fragments. PLOS ONE, 13(1), e0188453. 12. Thiel, V. (2018) Synthetic viruses-Anything new?. PLoS pathogens, 14(10), e1007019. 13. Edwards, R. A., McNair, K., Faust, K., Raes, J., and Dutilh, B. E. (2016) Computational approaches to predict bacteriophage-host relationships. FEMS microbiology reviews, 40(2), 258–272. 14. Eng, C. L., Tong, J. C., and Tan, T. W. (2014) Predicting host tropism of influenza A virus proteins using random forest. BMC Medical Genomics, 7(3), S1. 15. Xu, B., Tan, Z., Li, K., Jiang, T., and Peng, Y. (July, 2017) Predicting the host of influenza viruses based on the word vector. PeerJ, 5, e3579. 16. Li, H. and Sun, F. (2018) Comparative studies of alignment, alignment- free and SVM based approaches for predicting the hosts of viruses based on viral sequences. Scientific Reports, 8(1), 10032. 17. Mock, F., Viehweger, A., Barth, E., and Marz, M. (08, 2020) VIDHOP, viral host prediction with Deep Learning. Bioinformatics, btaa705. 18. Gałan, W., Bąk, M., and Jakubowska, M. (2019) Host Taxon Predictor - A Tool for Predicting Taxon of the Host of a Newly Discovered Virus. Scientific Reports, 9(1), 3436. 19. Babayan, S. A., Orton, R. J., and Streicker, D. G. (November, 2018) .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.5281/zenodo.4312525 https://gitlab.com/dacs-hpi/DeePaC https://gitlab.com/dacs-hpi/DeePaC-vir https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 13 — #13 i i i i i i Preprint, YYYY, Vol. xx, No. xx 13 Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes. Science, 362(6414), 577–580. 20. Zhang, Z., Cai, Z., Tan, Z., Lu, C., Jiang, T., Zhang, G., and Peng, Y. (2019) Rapid identification of human-infecting viruses. Transboundary and Emerging Diseases, 66(6), 2517–2522. 21. Poplin, R., Chang, P.-C., Alexander, D., Schwartz, S., Colthurst, T., Ku, A., Newburger, D., Dijamco, J., Nguyen, N., Afshar, P. T., Gross, S. S., Dorfman, L., McLean, C. Y., and DePristo, M. A. (2018) A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology, 36(10), 983–987. 22. Rizzo, R., Fiannaca, A., La Rosa, M., and Urso, A. (June, 2016) Classification Experiments of DNA Sequences by Using a Deep Neural Network and Chaos Game Representation. In Proceedings of the 17th International Conference on Computer Systems and Technologies 2016 New York, NY, USA: Association for Computing Machinery CompSysTech ’16 pp. 222–228. 23. Löchel, H. F., Eger, D., Sperlea, T., and Heider, D. (January, 2020) Deep learning on chaos game representation for proteins. Bioinformatics, 36(1), 272–279. 24. Bartoszewicz, J. M., Seidel, A., Rentzsch, R., and Renard, B. Y. (07, 2019) DeePaC: predicting pathogenic potential of novel DNA with reverse-complement neural networks. Bioinformatics, 36(1), 81–89. 25. Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology, 33(8), 831–838. 26. Zhou, J. and Troyanskaya, O. G. (2015) Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 12(10), 931–934. 27. Zeng, H., Edwards, M. D., Liu, G., and Gifford, D. K. (2016) Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics, 32(12), i121–i127. 28. Quang, D. and Xie, X. (2016) DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Research, 44(11), e107–e107. 29. Kelley, D. R., Snoek, J., and Rinn, J. L. (2016) Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research, 26(7), 990–999. 30. Greenside, P., Shimko, T., Fordyce, P., and Kundaje, A. (2018) Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics, 34(17), i629–i637. 31. Nair, S., Kim, D. S., Perricone, J., and Kundaje, A. (July, 2019) Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics, 35(14), i108–i116. 32. Avsec, Ž., Weilert, M., Shrikumar, A., Alexandari, A., Krueger, S., Dalal, K., Fropf, R., McAnany, C., Gagneur, J., Kundaje, A., and Zeitlinger, J. (August, 2019) Deep learning at base-resolution reveals motif syntax of the cis-regulatory code. bioRxiv, p. 737981. 33. Mock, F., Viehweger, A., Barth, E., and Marz, M. (2019) Viral host prediction with Deep Learning. bioRxiv, p. 575571. 34. Ren, J., Song, K., Deng, C., Ahlgren, N. A., Fuhrman, J. A., Li, Y., Xie, X., and Sun, F. (June, 2018) Identifying viruses from metagenomic data by deep learning. arXiv:1806.07810 [q-bio], arXiv: 1806.07810. 35. Tampuu, A., Bzhalava, Z., Dillner, J., and Vicente, R. (September, 2019) ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples. PLOS ONE, 14(9), e0222271. 36. Eraslan, G., Avsec, Ž., Gagneur, J., and Theis, F. J. (July, 2019) Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7), 389–403. 37. Schneider, T. D. and Stephens, R. M. (October, 1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Research, 18(20), 6097–6100. 38. Crooks, G. E., Hon, G., Chandonia, J.-M., and Brenner, S. E. (June, 2004) WebLogo: a sequence logo generator. Genome Research, 14(6), 1188– 1190. 39. Lanchantin, J., Singh, R., Lin, Z., and Qi, Y. (2016) Deep Motif: Visualizing Genomic Sequence Classifications. CoRR, abs/1605.01133. 40. Lanchantin, J., Singh, R., Wang, B., and Qi, Y. (2017) Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 22, 254–265. 41. Sundararajan, M., Taly, A., and Yan, Q. (2016) Gradients of Counterfactuals. CoRR, abs/1611.02639. 42. Jha, A., Aicher, J. K., Singh, D., and Barash, Y. (2019) Improving interpretability of deep learning models: splicing codes as a case study. bioRxiv,. 43. Shrikumar, A., Greenside, P., and Kundaje, A. (August, 2017) Learning Important Features Through Propagating Activation Differences. In Precup, D. and Teh, Y. W., (eds.), Proceedings of the 34th International Conference on Machine Learning, International Convention Centre, Sydney, Australia: PMLR Vol. 70 of Proceedings of Machine Learning Research, pp. 3145–3153. 44. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and Samek, W. (July, 2015) On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10(7), e0130140. 45. Lundberg, S. M. and Lee, S.-I. (2017) A Unified Approach to Interpreting Model Predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., (eds.), Advances in Neural Information Processing Systems 30, pp. 4765–4774 Curran Associates, Inc. 46. Shrikumar, A., Tian, K., Shcherbina, A., Avsec, Ž., Banerjee, A., Sharmin, M., Nair, S., and Kundaje, A. (March, 2019) TF-MoDISco v0.4.2.2-alpha: Technical Note. arXiv:1811.00416 [cs, q-bio, stat], arXiv: 1811.00416. 47. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410. 48. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T. L. (December, 2009) BLAST+: architecture and applications. BMC Bioinformatics, 10(1), 421. 49. Wu, F., Zhao, S., Yu, B., Chen, Y.-M., Wang, W., Hu, Y., Song, Z.- G., Tao, Z.-W., Tian, J.-H., Pei, Y.-Y., Yuan, M.-L., Zhang, Y.-L., Dai, F.-H., Liu, Y., Wang, Q.-M., Zheng, J.-J., Xu, L., Holmes, E. C., and Zhang, Y.-Z. (January, 2020) Complete genome characterisation of a novel coronavirus associated with severe human respiratory disease in Wuhan, China. bioRxiv, p. 2020.01.24.919183. 50. Mihara, T., Nishimura, Y., Shimizu, Y., Nishiyama, H., Yoshikawa, G., Uehara, H., Hingamp, P., Goto, S., and Ogata, H. (2016) Linking Virus Genomes with Host Taxonomy. Viruses, 8(3), 66. 51. King, A. M. Q., Adams, M. J., Carstens, E. B., and Lefkowitz, E. J., (eds.) (2012) Virus Taxonomy: Ninth Report of the International Committee on Taxonomy of Viruses, Academic Press, London; Waltham. 52. Lefkowitz, E. J., Dempsey, D. M., Hendrickson, R. C., Orton, R. J., Siddell, S. G., and Smith, D. B. (January, 2018) Virus taxonomy: the database of the International Committee on Taxonomy of Viruses (ICTV). Nucleic Acids Research, 46(D1), D708–D717. 53. Holtgrewe, M. (2010) Mason – A Read Simulator for Second Generation Sequencing Data. Technical Report FU Berlin,. 54. Deneke, C., Rentzsch, R., and Renard, B. Y. (2017) PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data. Scientific Reports, 7, 39194. 55. Moustafa, A., Xie, C., Kirkness, E., Biggs, W., Wong, E., Turpaz, Y., Bloom, K., Delwart, E., Nelson, K. E., Venter, J. C., and Telenti, A. (March, 2017) The blood DNA virome in 8,000 humans. PLOS Pathogens, 13(3), e1006292. 56. Gorbalenya, A. E., Baker, S. C., Baric, R. S., de Groot, R. J., Drosten, C., Gulyaeva, A. A., Haagmans, B. L., Lauber, C., Leontovich, A. M., Neuman, B. W., Penzar, D., Perlman, S., Poon, L. L. M., Samborskiy, D. V., Sidorov, I. A., Sola, I., Ziebuhr, J., and Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (April, 2020) The species Severe acute respiratory syndrome-related coronavirus : classifying 2019-nCoV and naming it SARS-CoV-2. Nature Microbiology, 5(4), 536–544. 57. Simmonds, P. and Aiewsakun, P. (August, 2018) Virus classification – where do you draw the line?. Archives of Virology, 163(8), 2037–2046. 58. Van Regenmortel, M. H. V. (January, 2018) Chapter One - The Species Problem in Virology. In Kielian, M., Mettenleiter, T. C., and Roossinck, M. J., (eds.), Advances in Virus Research, Vol. 100, pp. 1–18 Academic Press. 59. Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760. 60. Langmead, B. and Salzberg, S. L. (2012-03) Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357–359. 61. Wood, D. E. and Salzberg, S. L. (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3), R46. 62. Nix, R. and Kantarciouglu, M. (July, 2012) Incentive Compatible Privacy-Preserving Distributed Classification. IEEE Transactions on .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ i i “output” — 2020/12/11 — 18:12 — page 14 — #14 i i i i i i 14 Preprint, YYYY, Vol. xx, No. xx Dependable and Secure Computing, 9(4), 451–462 Conference Name: IEEE Transactions on Dependable and Secure Computing. 63. Matejczyk, S. and Michalak, T. (2015) Solving Influence Maximization Problem Using Methods from Cooperative Game Theory., Instytut Podstaw Informatyki PAN, Publication Title: k 20533. 64. Thorvaldsdóttir, H., Robinson, J. T., and Mesirov, J. P. (March, 2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics, 14(2), 178– 192. 65. DeLano, W. L. and others (2002) Pymol: An open-source molecular graphics tool. CCP4 Newsletter on protein crystallography, 40(1), 82–92. 66. Yang, Y.-H., Jiang, Y.-L., Zhang, J., Wang, L., Bai, X.-H., Zhang, S.-J., Ren, Y.-M., Li, N., Zhang, Y.-H., Zhang, Z., Gong, Q., Mei, Y., Xue, T., Zhang, J.-R., Chen, Y., and Zhou, C.-Z. (June, 2014) Structural Insights into SraP-Mediated Staphylococcus aureus Adhesion to Host Cells. PLOS Pathogens, 10(6), e1004169. 67. Stojkova, P., Spidlova, P., and Stulik, J. (2019) Nucleoid-Associated Protein HU: A Lilliputian in Gene Regulation of Bacterial Virulence. Frontiers in Cellular and Infection Microbiology, 9, 159. 68. Li, F. (2016) Structure, Function, and Evolution of Coronavirus Spike Proteins. Annual Review of Virology, 3(1), 237–261. 69. Marchler-Bauer, A., Bo, Y., Han, L., He, J., Lanczycki, C. J., Lu, S., Chitsaz, F., Derbyshire, M. K., Geer, R. C., Gonzales, N. R., Gwadz, M., Hurwitz, D. I., Lu, F., Marchler, G. H., Song, J. S., Thanki, N., Wang, Z., Yamashita, R. A., Zhang, D., Zheng, C., Geer, L. Y., and Bryant, S. H. (2017) CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Research, 45(D1), D200–D203. 70. Wrapp, D., Wang, N., Corbett, K. S., Goldsmith, J. A., Hsieh, C.-L., Abiona, O., Graham, B. S., and McLellan, J. S. (March, 2020) Cryo- EM structure of the 2019-nCoV spike in the prefusion conformation. Science, 367(6483), 1260–1263 Publisher: American Association for the Advancement of Science Section: Report. 71. Yuan, M., Wu, N. C., Zhu, X., Lee, C.-C. D., So, R. T. Y., Lv, H., Mok, C. K. P., and Wilson, I. A. (May, 2020) A highly conserved cryptic epitope in the receptor binding domains of SARS-CoV-2 and SARS- CoV. Science, 368(6491), 630–633 Publisher: American Association for the Advancement of Science Section: Report. 72. Starr, T. N., Greaney, A. J., Hilton, S. K., Crawford, K. H., Navarro, M. J., Bowen, J. E., Tortorici, M. A., Walls, A. C., Veesler, D., and Bloom, J. D. (June, 2020) Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. bioRxiv, p. 2020.06.17.157982 Publisher: Cold Spring Harbor Laboratory Section: New Results. 73. Pinto, D., Park, Y.-J., Beltramello, M., Walls, A. C., Tortorici, M. A., Bianchi, S., Jaconi, S., Culap, K., Zatta, F., De Marco, A., Peter, A., Guarino, B., Spreafico, R., Cameroni, E., Case, J. B., Chen, R. E., Havenar-Daughton, C., Snell, G., Telenti, A., Virgin, H. W., Lanzavecchia, A., Diamond, M. S., Fink, K., Veesler, D., and Corti, D. (May, 2020) Cross-neutralization of SARS-CoV-2 by a human monoclonal SARS-CoV antibody. Nature, pp. 1–10 Publisher: Nature Publishing Group. 74. Brendel, W. and Bethge, M. (2019) Approximating CNNs with Bag- of-local-Features models works surprisingly well on ImageNet. In International Conference on Learning Representations. 75. National Research Council (2010) Sequence-Based Classification of Select Agents: A Brighter Line, The National Academies Press, . 76. National Academies of Sciences, Engineering, and Medicine (2018) Biodefense in the Age of Synthetic Biology, The National Academies Press, . 77. Diggans, J. and Leproust, E. (2019) Next Steps for Access to Safe, Secure DNA Synthesis. Frontiers in Bioengineering and Biotechnology, 7. 78. O’Brien, J. T. and Nelson, C. (June, 2020) Assessing the Risks Posed by the Convergence of Artificial Intelligence and Biotechnology. Health Security, 18(3), 219–227. 79. Brookes, D., Park, H., and Listgarten, J. (May, 2019) Conditioning by adaptive sampling for robust design. In International Conference on Machine Learning pp. 773–782. 80. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M., and Church, G. M. (December, 2019) Unified rational protein engineering with sequence- based deep representation learning. Nature Methods, 16(12), 1315–1322. 81. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M., and Church, G. M. (January, 2020) Low-N protein engineering with data-efficient deep learning. bioRxiv, p. 2020.01.23.917682. 82. Gupta, A. and Zou, J. (February, 2019) Feedback GAN for DNA optimizes protein functions. Nature Machine Intelligence, 1(2), 105–111. 83. Gupta, A. and Kundaje, A. (July, 2019) Targeted optimization of regulatory DNA sequences with neural editing architectures. bioRxiv, p. 714402. 84. Linder, J., Bogard, N., Rosenberg, A. B., and Seelig, G. (December, 2019) Deep exploration networks for rapid engineering of functional DNA sequences. bioRxiv, p. 864363. 85. Schreiber, J., Lu, Y. Y., and Noble, W. S. (May, 2020) Ledidi: Designing genomic edits that induce functional activity. bioRxiv, p. 2020.05.21.109686. 86. Grüning, B., Dale, R., Sjödin, A., Chapman, B. A., Rowe, J., Tomkins- Tinch, C. H., Valieris, R., and Köster, J. (July, 2018) Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods, 15(7), 475–476 Number: 7 Publisher: Nature Publishing Group. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint https://doi.org/10.1101/2020.01.29.925354 http://creativecommons.org/licenses/by-nd/4.0/ Interpretable detection of novel human viruses from genome sequencing data Introduction Materials and Methods Results Discussion Data availability Acknowledgements Funding REFERENCES 10_1101-2020_04_17_043323 ---- linus: Conveniently explore, share, and present large-scale biological trajectory data from a web browser   linus​: Conveniently explore, share, and present large-scale biological trajectory data  from a web browser.    Authors:  Johannes Waschke ​1,2 ​, Mario Hlawitschka ​2 ​, Kerim Anlas ​3 ​, Vikas Trivedi ​3,4 ​, Ingo Roeder ​5,6 ​, Jan Huisken ​7 ​,  and Nico Scherf ​1,6*  1​ Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstr. 1a, 04103 Leipzig, Germany  2​ Faculty of Computer Science and Media, Leipzig University of Applied Sciences, 04277 Leipzig, Germany  3​ EMBL Barcelona, C/ Dr. Aiguader 88, 08003 Barcelona, Spain.  4​ EMBL Heidelberg, Developmental Biology Unit, 69117 Heidelberg, Germany.  5​ National Center of Tumor Diseases (NCT), Partner Site Dresden, 01307 Dresden, Germany  6​ Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, School of Medicine, TU  Dresden, 01307 Dresden, Germany  7​ Morgridge Institute for Research, Madison, Wisconsin 53715, USA    * Correspondence: to ​nscherf@cbs.mpg.de    Abstract  In biology, we are often confronted with information-rich, large-scale trajectory data, but exploring and communicating  patterns in such data is often a cumbersome task. Ideally, the data should be wrapped with an interactive visualisation in  one concise package that makes it straightforward to create and test hypotheses collaboratively. To address these  challenges, we have developed a tool, ​linus​, which makes the process of exploring and sharing 3D trajectories as easy  as browsing a website. We provide a python script that reads trajectory data and enriches them with additional features,  such as edge bundling or custom axes and generates an interactive web-based visualisation that can be shared offline  and online. The goal of ​linus​ is to facilitate the collaborative discovery of patterns in complex trajectory data.              .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint mailto:nico.scherf@tu-dresden.de https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/   Introduction  In biology, we often face large-scale trajectory data from dense spatial pathways, such as the brain connectivity obtained  from diffusion MRI imaging ​(Liu et al., 2020)​, or tracking data such as cell trajectories or animal trails ​(Romero-Ferrero et  al., 2018) ​. Although this type of data is becoming increasingly prominent in biomedical research ​(Kwok, 2019; McDole et  al., 2018; Wallingford, 2019)​, exploring, sharing, and communicating patterns in such data are often cumbersome tasks  requiring a set of different software that are often complex to install, learn and use. Recently, new tools have become  available for efficiently visualising 3D volumetric data ​(Pietzsch et al., 2015; Royer et al., 2015; Schmid et al., 2019)​, and  some of those allow the user to overlay tracking data to cross-check the quality of the results or to visualise simple  predefined features (such as speed or time). However, given the more general-purpose design of such software, these  are not ideal solutions to efficiently and collaboratively explore and share the visualisations.​ ​An interactive, scriptable, and  easily shareable visualisation ​(Shneiderman 1996)​ would open up novel ways of communicating and discussing  experimental results and findings ​(Callaway 2016)​. The analysis of complex and large-scale trajectory data and the  creation and testing of hypotheses could then be done collaboratively. Importantly, since such bioinformatics tools would  be right at the interface of computational and life sciences, they need to be accessible and usable for scientists with little  or no background in programming. Ideally, the data should be bundled with a guided, interactive presentation in one  concise visualisation packet that can be passed to a collaborator. To address these challenges, we have developed our  visualisation tool ​linus​, making it easier to explore 3D trajectory data from any device without a local installation of  specialised software. ​linus​ creates interactive visualisation packets that can be explored in a web browser, while keeping  data presentation straightforward and shareable, both offline and online (Fig 1a). We began to develop this tool when we  struggled to find adequate software to explore cell trajectories during zebrafish gastrulation from large-scale fluorescence  microscopy datasets ​(Shah et al., 2019) ​. ​linus​ allowed us now to interactively visualise and analyse the tracks of around  11.000 cells (starting number) as they moved across the zebrafish embryo throughout 16 hrs. More importantly, it  enabled us to share and discuss visualisations with collaborators across disciplines.    Results and Discussion  linus is a python-based tool that is easy to install and use for scientists at the interface between disciplines.  Our overall goal when developing ​linus​ was to create a versatile and lightweight visualisation tool that runs on a wide  range of devices. To this end, we based the visualisation part on web technologies. Specifically, we used TypeScript,  which compiles to JavaScript and WebGL. However, a core component of the visualisation process, the data  preparation, requires local file access and fast computations, both of which are limited in JavaScript. For that reason, we  also created a Python (> v3.0) script that handles the computationally demanding parts of data processing and  automatically generates the web-based visualisation packages.   Creating a visualisation package with ​linus​ is done in a few simple steps (Fig. 1a): The user imports trajectory data from a  generic, plain CSV format (see Methods) or from a variety of established trajectory formats such as SVF ​(McDole et al.,  2018)​, TGMM XML ​(Amat et al., 2014) ​, or the community standard biotracks ​(Gonzalez-Beltran et al., 2019)​, which itself  supports import from a wide variety of cell tracking tools such as CellProfiler ​(McQuin et al., 2018) ​ or TrackMate ​(Tinevez  et al., 2016) ​. During the data conversion, ​linus​ can enrich the trajectory data with additional attributes or spatial context.  For example, we declutter dense trajectories by highlighting the major “highways” through edge-bundling (Fig.1 b). ​linus  can automatically add generic attributes that are useful in a range of applications, such as the local angle of the  trajectories or a timestamp. The user can simply add custom numerical attributes for specific applications by providing  these measurements as extra columns in CSV files (see Methods). The data attributes form the basis for advanced  rendering effects. If users want to give a spatial context, ​linus​ can generate axes automatically, or users can define  custom axes.   For more efficient computing, the preprocessing script uses established and optimised packages from python’s rich  ecosystem, like NumPy and (Py)OpenCL. In particular, the edge bundling algorithm runs highly parallel on the graphics  card and thus, about 10-100 times faster than a CPU-based calculation (with OpenCL-enabled hardware). However,  only the creator of a ​linus​-based visualisation package needs to run this preprocessor script. The target audience  requires only a web browser to view and explore the data. ​The result of the preprocessing is a ready-to-use visualisation  package that can be opened in a web browser on any device with WebGL support. ​The package is a folder containing  HTML, JavaScript, and related files.   .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://paperpile.com/c/M1ZZqK/At4m https://paperpile.com/c/M1ZZqK/RJuF https://paperpile.com/c/M1ZZqK/RJuF https://paperpile.com/c/M1ZZqK/9Srq+Xjeg+5Sfn https://paperpile.com/c/M1ZZqK/9Srq+Xjeg+5Sfn https://paperpile.com/c/M1ZZqK/N6PA+3IfM+njch https://paperpile.com/c/M1ZZqK/6K13 https://paperpile.com/c/M1ZZqK/nHfW https://paperpile.com/c/M1ZZqK/0lD1 https://paperpile.com/c/M1ZZqK/Xjeg https://paperpile.com/c/M1ZZqK/Xjeg https://paperpile.com/c/M1ZZqK/qisc https://paperpile.com/c/M1ZZqK/meu9 https://paperpile.com/c/M1ZZqK/Q3l3 https://paperpile.com/c/M1ZZqK/C6cd https://paperpile.com/c/M1ZZqK/C6cd https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/   Interactive visualisation with configurable filters allows in-depth data exploration for a variety of applications  across sciences.  After configuring and creating the visualisation package with the Python toolkit, further adjustments are possible within  the web browser. ​Opening the index.html file starts the visualisation and shows the trajectories with baseline render  settings (semi-transparent, single-coloured rendering on a grey background). ​The browser renders an interactive  visualisation of the trajectories and an interface for the user to update and adapt the visualisation to their needs (e.g.  colour scales, projections, clipping planes) (Fig. 1b). ​The user interface itself is adapted to each dataset: The  preprocessing script generates a separate property and the corresponding slider (filters and colour mapping) for each  given data attribute in the user interface. If more than one state is available for the dataset (e.g. an edge bundled copy of  the data, or custom projections), the interface automatically offers the functionality to fade between the states (see  Methods).    The user can carve out patterns from the original “hairball” of lines by setting general visualisation parameters like shading  and colour maps (Fig. 2a). To focus on particular parts of the dataset, the user filters the data for the various attributes  such as specific time intervals or user-specified numerical properties such as marker expression in cell tracking (Fig. 2b).  Alternatively, the user can select spatial regions of interest (ROIs) either with cutting planes or with progressively refinable  selections (Fig. 2c). The visual attributes can then be separately defined for the selected in-focus areas and the  (non-selected) context regions (Fig. 2c) to create a focused visualization. Apart from the purpose of qualitative  visualization, the selected trajectories can also be downloaded as CSV files for subsequent quantitative analysis (see  Methods).    One important problem with large-scale trajectory data is the sheer density of tracks that often leads to extreme visual  clutter. To tackle this problem, one prominent feature of ​linus ​is the ability to blend between different data  transformations seamlessly. We provide two main sorts of transformations out-of-the-box: The user can smoothly  transition between original and bundled state to focus on major “highways” (Fig. 2d, Fig. 1b), or between original (3D  cartesian) view and different 2D projections (e.g. a Mercator map) to provide a global, less cluttered perspective on the  trajectories (Fig. 2e,f). If other, application-specific transformations are needed, such as a spatial transformation or any  form of trajectory clustering, the user can provide such an alternative state during preprocessing and then interactively  blend between those states.    However, the choice of a web-based visualisation solution brings some drawbacks. The amount of data that can be  fluently visualised depends on the underlying hardware (smartphones: >2,000 trajectories, notebooks, and desktop  computers: >10,000 trajectories). Another limitation is the reduced feature set which common web browsers offer  regarding graphics card access: Compared to the API of OpenGL, the browser-based WebGL API offers fewer shader  features. These restrictions lead to some limitations for the rendering process. ​A drawback of our rendering approach is  that it creates artifacts related to the rendering order when we rotate the camera. Thus, we have to order the line  fragments ​offline ​(i.e. not on the graphics card, but in JavaScript), which is a time-consuming process. To maintain high  framerates, we only sort line fragments within a second after a user interaction has finished, leading to artifacts during  camera motions (see Methods). Furthermore, we cannot provide correct render order when rendering two datasets in the  same view, and thus ​linus ​ works best when only rendering one dataset at once.  Data and visualisations are easily shareable with collaborators via interactive visualisation packets.  As a straightforward solution to share the results, the user directly exports the visualisations from the webview as static  images and videos (e.g. such as Supplementary Video 1). But sharing the visualisation of the data can go a step beyond  image or video data. The user can conveniently record all these visualisation properties directly in the web-interface of  linus ​to create information-rich, interactive tours. The user adjusts these tours on a detailed level using a timeline-based  editor (Supplementary Fig.3). An icon represents each action that can be moved along the time axis to develop a visual  storyline. Smooth transitions and textual markers that can be precisely timed, facilitate understanding and storytelling. To  communicate and distribute new findings, these tours can easily be shared online or offline with the community  (colleagues, readers of a manuscript, audience of a real or virtual presentation). ​The tours are copied into the source  code of the visualisation package or, if they consist of a limited number of actions (see Methods for details), they can be  shared by a dynamically created URL or a QR Code. ​Fig. 3 shows examples of visualisations that have been created with  linus​ ranging from dynamic trajectories in 2D (Fig. 3a) or on surfaces (Fig. 3b) to static (Fig. 3c) or dynamic 3D (Fig. 3d)  tracks across applications from ethology, neuroscience, and developmental biology. An interactive version of each  example can be found online by simply scanning the respective QR codes in the figure.    .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/   We tested ​linus ​ visualisation packages across various devices and found that performance is the most important aspect  of the user experience that varies between different devices. Desktop computers with mid-range graphics cards (e.g. the  graphics processors that are built-in with current CPUs) can easily handle more than 10,000 trajectories at smooth  framerates. Mid-range smartphones handle the same data with low framerates (ca. 10 fps), which is still usable but does  not feel as smooth. For virtual reality applications, we also tested ​linus ​ on the Oculus Go VR goggles. Here, a high frame  rate is essential as the user experience would be quite discomforting otherwise and we recommend reducing the number  of trajectories further to about 1,000 in this use-case. Due to the differences in performance and user experience, we  recommend creating dedicated visualisation packages (or tours) for the intended type of output device.  In the future, we would like to support further advanced preprocessing options such as trajectory clustering, more  generic transforms or feature extraction. We also would like to extend the visualization part of ​linus, ​ so the user can  interactively annotate the data. Here, we envision that the user can easily label subsets of trajectories and then use this  information for downstream analysis (such as building a trajectory classifier).   Our experience with ​linus​ shows that sharing relatively complex data visualisations in this interactive way makes it much  more efficient to collaboratively find patterns in data and to create and discuss figures or videos for presentations and  manuscripts. More generally, interactive data sharing is helpful when collaborations, presentations, or teaching occur  remotely, as it has been a common situation during the current pandemic. At the same time, during an in-person event  such as a talk or poster session at a conference, the target audience can explore the data instantly on their computers,  tablets, or smartphones. In any case, touch screens or even virtual reality goggles increase the immersion with more  natural controls and true 3D-rendering, helping to grasp the trajectories’ spatial relation. With these features, we are  convinced that approaches like ​linus​ will improve considerably how we collectively explore, communicate, and teach the  spatio-temporal patterns from information-rich, multi-dimensional, experimental data.  Methods  Our software consists conceptually of two parts: a Python-based preprocessing and a web-based visualisation tool.​ We  aimed to move all static and computationally expensive adjustments to the preprocessor, whereas dynamic adjustments  to tweak the visualisations are all be performed directly in the web browser later. After running the preprocessor, a folder  containing HTML, CSS, and JavaScript files is created (called a visualization packet). These files are opened directly or  uploaded to a web server.  Types of input data  We currently support different trajectory file types directly: TGMM ​(Amat et al., 2014) ​, biotracks ​(Gonzalez-Beltran et al.,  2019)​, SVF ​(McDole et al., 2018)​, and custom CSV. Most formats are designed to store 3D coordinates plus a  timestamp primarily, but no other custom data. However, ​linus ​ supports additional numerical attributes that can then be  used to filter or colour the trajectories accordingly. We, therefore, offer a generic CSV format which can be supplemented  with custom numerical data: Each CSV file contains the data for a single trajectory, the first three columns represent the  coordinates (x, y, z) and any further column is interpreted as another attribute. The columns are delimited by semicolons,  and the number of columns must be identical for all CSV files. ​linus ​ reads the first line of a CSV file by default as the  header and uses this information to automatically name the respective properties in the user interface. The data  converter script then expects a folder that exclusively contains CSV files as input.  Implementation of data preprocessing  The trajectory data are then converted to a custom JSON format by our python-based preprocessor. Python has the  advantage of being executable on a wide range of operating systems and hardware. The preprocessor is used with a  command-line interface or by calling the respective commands directly. The command-line interface is easier to use, and  it covers the most common cases (e.g. visualising a dataset with custom attributes, and automatically adding an  edge-bundled version). For more complex cases, e.g. visualising two datasets at once, or using multiple custom states of  the data (e.g. custom projections), users can write their own Python script. We provide detailed and up-to-date  documentation in our repository at ​https://gitlab.com/imb-dev/linus​.  Time-consuming operations are implemented using NumPy, and the most demanding process (edge bundling) is  handled by an OpenCL script, which increases calculation speed by 10-100 fold. All trajectories are resampled to equal  length during the preprocessing step, enabling us to use NumPy’s fast matrix-based algorithms (we use -matrices,n * m   storing trajectories with points in each trajectory). The resulting JSON file then contains a list of datasets. Eachn m   dataset holds a set of trajectories that optionally can be further organised into several states, for example, the original  data and a projected version. At this point, all data are organized in the same structure as it is required by WebGL  (Supplementary Fig. 1), which allows faster loading of the data in the next step.   .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://paperpile.com/c/M1ZZqK/qisc https://paperpile.com/c/M1ZZqK/meu9 https://paperpile.com/c/M1ZZqK/meu9 https://paperpile.com/c/M1ZZqK/Xjeg https://gitlab.com/imb-dev/linus https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/   Implementation of the web-based tool  The visualisation part runs in web languages (HTML, JavaScript, CSS, WebGL). The JSON file containing the  preprocessed data is directly loaded as an object by JavaScript. This part of the software copies the numeric arrays from  the JSON file into WebGL's data buffers like the position buffer, index buffers, and attribute buffers. If a dataset contains  more than one state (e.g. an original state and a projected state), these states are stored in additional attribute buffers.  Depending on the provided data, we also adjust the shader source code dynamically. For example, we inject variables  and specific statements into the shader source code before it is compiled by WebGL. With the dynamic creation of  buffers as well as code statements and variables, we pre-build a shader program that is directly tailored to the properties  of the respective data. As a result, rendering the data allows quick changes of the visualisation (e.g. color mapping or  projections) without the need for updating the datasets on the graphics card, which results in higher frame rates and  smooth transitions compared to approaches where data is transformed offline.  In principle, ​linus​ supports an arbitrary number of attributes and states. However, practically this number is limited by the  particular device’s abilities (i.e. its graphics card) and WebGL in general. Typically, we have eight attribute arrays on  smartphones and sixteen or more on desktop computers. Our software requires four such attribute arrays for internal  purposes, plus one more array for each state or attribute. Thus, for a dataset containing original data, bundled data and  two custom attributes (that are shared between the states) we would need eight attribute buffers in total, which can still  be managed by a smartphone. Visualising adding additional states or attributes requires devices with more capabilities,  like a desktop computer.  The graphical user interface (GUI)   The user interface (see Fig. 1 and Supplementary Fig. 2) consists of a general part that includes options to change the  size of the GUI, the background colour, and camera controls. Furthermore, the user can choose how often the render  order should be restored (see section "current technical limitations"). Additionally, several data-specific settings are  shown, and this section is further divided into:  ● Filters​ for each attribute to only show data within a defined range; if window is a positive value, it will be used to  automatically display a range [min, min+window] (while max is ignored).  ● Render settings​, including colour mapping, shading, transparency, which can be independently set for selected  and unselected trajectories.  ● Mercator projection​ plus rotations that are applied to the 3D positions before the 2D transformation, and  mapping the "free" z component to attributes for 2D + feature plots (e.g. space-time trajectories).  ● Cutting planes​ can be used to generate ​a generic 2D projection. Here, the projection plane can be defined by  selecting a centre point and a normal direction. Everything above the projection plane is then mapped onto the  plane.   ● The last part of the GUI offers options to export selected trajectories and also shows a list of available tours.  This list is used to start or to load a tour into the tour editor.  Sharing visualisations and tours  As explained above, the user receives a self-contained package. This package can be opened with any web browser  that supports WebGL and can be distributed in multiple ways: It can be locally shared (e.g. sent by email or copied  using, e.g. a USB stick) or made easily accessible to a broad audience by uploading it to a web server (as done e.g. on  our companion website for this manuscript https://imb-dev.gitlab.io/linus-manuscript/).   The method of sharing the actual visualisation package also influences how an interactive tour can be distributed. In  order to make a tour reproducible, they are internally represented by a textual list of actions. This script can be copied  directly into the source code of the file main.html of the visualisation package. This method works both for server-based  and for file-based distribution of the package. If the visualisation package is hosted on a web server, the tours can also  be shared simply with a custom URL and QR code that encodes a tour’s actions. However, the length of such tours is  restricted: QR codes are limited in the amount of information they can store, and URLs are usually limited as well (but  typically this limit can be configured in the web server's settings). The commands for camera motion and parameter  adjustment (e.g. changing the colour) are concise and only require a few bytes of the URL or QR code. In contrast,  textual annotations and especially spatial selections require considerably more space. Thus, sharing a tour by QR codes  or URLs usually works for tours without selections and without extensive text annotations.   Specific considerations for virtual reality devices  The virtual reality mode works only when the visualisation package is hosted on a web server. Further, the way of  navigation changes slightly because the head position takes over the task of the camera. For convenience, we introduce  .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/   the possibility to adjust the height of the dataset and to rotate the data horizontally. Inside the VR environment, no GUI is  rendered. To allow controlling the GUI, the user can switch between "2D mode" and "VR mode" instantly.  Export of trajectories  The user can select trajectories and download this selection. The download may take several minutes as the data must  internally be converted into CSV format. The result is a zip folder containing one folder for each data set (usually a single  folder), each containing a separate folder for each state of the data (e.g. "original" and "bundled"). Each trajectory is  saved as a separate CSV file. It should be noted, however, that the user can only download the resampled trajectories  and not trajectories in the raw (temporal or spatial) resolution before the data preprocessing.  Screenshots and videos  At any time, the user can take screenshots and record videos with the respective buttons in the bottom left corner. Video  recording requires an up-to-date Chrome-based browser (Chrome Version 52 or later; other browsers might support it  as well but only with enabled experimental features). The output format is WebM, which is currently the only file type that  can be directly saved from WebGL.  Additional technical limitations  In order to offer the tool for a broader range of platforms, we decided to utilise WebGL 1.0. This web standard provides  the feature set of OpenGL ES 2.0 (https://www.khronos.org/webgl/), which is limited compared to regular OpenGL  versions. WebGL 1.0 is implemented by a wide range of browsers, such as Chrome Version 9, Firefox 4.0, Safari 8.0,  iOS 8, Chrome mobile 30 (or newer, respectively).  When rendering a scene containing both trajectories and context, our application must render two different types of  geometric primitives (lines and triangles) simultaneously. This can only be performed by two consecutive draw calls: the  program first renders all triangles, and then we subsequently render the line segments. Since we need to support  transparent rendering, we cannot rely on the z-buffer for determining the spatial order of the segments as this works only  for non-transparent geometries (The z-buffer usually tells us if a segment should be drawn or not by checking if already  another closer segment has been drawn that would cover the new segment). Thus, we use an alternative to the z-buffer:  we sort the geometry first and render it starting with the most distant element. Step by step, we draw elements that are  closer to the observer over more distant ones ensuring the correct depth ordering of elements. However, we cannot use  this idea to compute the overlap between the set of triangles and the set of line segments since they are different types  of primitives and as such, require separate draw calls. As WebGL currently does not have a geometry shader, we cannot  mix triangles and lines in one draw call. A consequence is that context can only be rendered as a background silhouette.  Our internal resorting procedure can require a noticeable amount of time (e.g. around 0.5 s for 10.000 trajectories). To  ensure a fluent user experience, we use an adaptive strategy and only sort the data when the user stops moving the  camera. This can lead to some visual artifacts during the rotation of the camera, but after stopping the motion, the  correct rendering order is established quickly. For huge amounts of data, or for devices with low CPU performance (the  sorting happens on the CPU, not on the GPU), it is also possible to completely disable the sorting. In that case, we  shuffle the rendering order, which at least avoids distracting global patterns introduced by these artifacts.  Data availability  Exemplary visualizations are available by scanning the QR codes in Fig.1 directly or by visiting  https://imb-dev.gitlab.io/linus-manuscript/   Code availability  The ​linus​ software including source code and documentation is freely available at our repository at  https://gitlab.com/imb-dev/linus​.  Acknowledgments  The authors are grateful to Gopi Shah and Konstantin Thierbach for sharing data and contributing useful feedback. J.W.  received funding from the International Max Planck Research School on Neuroscience of Communication: Function,  Structure, and Plasticity (Leipzig, Germany; ​https://imprs-neurocom.mpg.de ​). K.A. and V.T. acknowledge funding from  European Molecular Biology Laboratory (EMBL) Barcelona and Mesoscopic Imaging Facility, EMBL Barcelona for help  with imaging.   .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://imb-dev.gitlab.io/linus-manuscript/ https://imb-dev.gitlab.io/linus-manuscript/ https://imprs-neurocom.mpg.de/ https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/   Author contributions   N.S., J.H., and I.R. conceived the project. J.W. wrote the software code. M.H. and N.S. supervised the project. N.S. and  J.W. wrote the manuscript. K.A. and V.T. generated the dataset on zebrafish blastoderm explants. All authors read,  edited, and approved the manuscript.   References  Amat F, Lemon W, Mossing DP, McDole K, Wan Y, Branson K, Myers EW, Keller PJ. 2014. Fast, accurate  reconstruction of cell lineages from large-scale fluorescence microscopy data. ​Nat Methods​ ​11 ​:951–958.  Bailey H, Mate BR, Palacios DM, Irvine L. 2009. Behavioural estimation of blue whale movements in the Northeast Pacific  from state-space model analysis of satellite tracks. ​Endanger Species Res​.  Callaway E. 2016. The visualizations transforming biology. ​Nature News​ ​535 ​:187.  Egevang C, Stenhouse IJ, Phillips RA, Petersen A, Fox JW, Silk JRD. 2010. Tracking of Arctic terns Sterna paradisaea  reveals longest animal migration. ​Proc Natl Acad Sci U S A​ ​107 ​:2078–2081.  Gonzalez-Beltran AN, Masuzzo P, Ampe C, Bakker G-J, Besson S, Eibl RH, Friedl P, Gunzer M, Kittisopikul M, Le  Dévédec SE, Leo S, Moore J, Paran Y, Prilusky J, Rocca-Serra P, Roudot P, Schuster M, Sergeant G, Strömblad S,  Swedlow JR, van Erp M, Van Troys M, Zaritsky A, Sansone S-A, Martens L. 2019. Community Standards for Open  Cell Migration Data. ​bioRxiv​. doi:​10.1101/803064  Imirzian N, Zhang Y, Kurze C, Loreto RG, Chen DZ, Hughes DP. 2019. Automated tracking and analysis of ant  trajectories shows variation in forager exploration. ​Sci Rep​ ​9 ​:13246.  Kwok R. 2019. Deep learning powers a motion-tracking revolution. ​Nature​ ​574 ​:137–138.  Liu C, Ye FQ, Newman JD, Szczupak D, Tian X, Yen CC-C, Majka P, Glen D, Rosa MGP, Leopold DA, Silva AC. 2020. A  resource for the detailed 3D mapping of white matter pathways in the marmoset brain. ​Nat Neurosci​ ​23 ​:271–280.  McDole K, Guignard L, Amat F, Berger A, Malandain G, Royer LA, Turaga SC, Branson K, Keller PJ. 2018. In Toto  Imaging and Reconstruction of Post-Implantation Mouse Development at the Single-Cell Level. ​Cell​ ​0 ​.  doi: ​10.1016/j.cell.2018.09.031  McQuin C, Goodman A, Chernyshev V, Kamentsky L, Cimini BA, Karhohs KW, Doan M, Ding L, Rafelski SM, Thirstrup  D, Wiegraebe W, Singh S, Becker T, Caicedo JC, Carpenter AE. 2018. CellProfiler 3.0: Next-generation image  processing for biology. ​PLoS Biol ​ ​16 ​:e2005970.  Pietzsch T, Saalfeld S, Preibisch S, Tomancak P. 2015. BigDataViewer: visualization and processing for large image data  sets. ​Nat Methods ​ ​12 ​:481–483.  Romero-Ferrero F, Bergomi MG, Hinz R, Heras FJH, de Polavieja GG. 2018. idtracker.ai: Tracking all individuals in large  collectives of unmarked animals. ​arXiv [csCV]​.  Royer LA, Weigert M, Günther U, Maghelli N, Jug F, Sbalzarini IF, Myers EW. 2015. ClearVolume: open-source live 3D  visualization for light-sheet microscopy. ​Nat Methods​ ​12 ​:480–481.  Schmid B, Tripal P, Fraaß T, Kersten C, Ruder B, Grüneboom A, Huisken J, Palmisano R. 2019. 3Dscript: animating  3D/4D microscopy data using a natural-language-based syntax. ​Nat Methods ​ ​16 ​:278–280.  Shah G, Thierbach K, Schmid B, Waschke J, Reade A, Hlawitschka M, Roeder I, Scherf N, Huisken J. 2019. Multi-scale  imaging and analysis identify pan-embryo cell dynamics of germlayer formation in zebrafish. ​Nat Commun​ ​10 ​:5753.  Shneiderman B. 1996. The eyes have it: a task by data type taxonomy for information visualizationsProceedings 1996  IEEE Symposium on Visual Languages. pp. 336–343.  Tinevez J-Y, Perry N, Schindelin J, Hoopes GM, Reynolds GD, Laplantine E, Bednarek SY, Shorte SL, Eliceiri KW. 2016.  TrackMate: An open and extensible platform for single-particle tracking. ​Methods​. doi:​10.1016/j.ymeth.2016.09.016  Trivedi V, Fulton T, Attardi A, Anlas K, Dingare C, Martinez-Arias A, Steventon B. 2019. Self-organised symmetry  breaking in zebrafish reveals feedback from morphogenesis to pattern formation. ​bioRxiv​. doi:​10.1101/769257  Wallingford JB. 2019. The 200-year effort to see the embryo. ​Science​ ​365 ​:758–759.      .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint http://paperpile.com/b/M1ZZqK/qisc http://paperpile.com/b/M1ZZqK/qisc http://paperpile.com/b/M1ZZqK/qisc http://paperpile.com/b/M1ZZqK/qisc http://paperpile.com/b/M1ZZqK/qisc http://paperpile.com/b/M1ZZqK/qisc http://paperpile.com/b/M1ZZqK/Ey8c http://paperpile.com/b/M1ZZqK/Ey8c http://paperpile.com/b/M1ZZqK/Ey8c http://paperpile.com/b/M1ZZqK/Ey8c http://paperpile.com/b/M1ZZqK/nHfW http://paperpile.com/b/M1ZZqK/nHfW http://paperpile.com/b/M1ZZqK/nHfW http://paperpile.com/b/M1ZZqK/nHfW http://paperpile.com/b/M1ZZqK/nHfW http://paperpile.com/b/M1ZZqK/IwWE http://paperpile.com/b/M1ZZqK/IwWE http://paperpile.com/b/M1ZZqK/IwWE http://paperpile.com/b/M1ZZqK/IwWE http://paperpile.com/b/M1ZZqK/IwWE http://paperpile.com/b/M1ZZqK/IwWE http://paperpile.com/b/M1ZZqK/meu9 http://paperpile.com/b/M1ZZqK/meu9 http://paperpile.com/b/M1ZZqK/meu9 http://paperpile.com/b/M1ZZqK/meu9 http://paperpile.com/b/M1ZZqK/meu9 http://paperpile.com/b/M1ZZqK/meu9 http://dx.doi.org/10.1101/803064 http://paperpile.com/b/M1ZZqK/uwiZ http://paperpile.com/b/M1ZZqK/uwiZ http://paperpile.com/b/M1ZZqK/uwiZ http://paperpile.com/b/M1ZZqK/uwiZ http://paperpile.com/b/M1ZZqK/uwiZ http://paperpile.com/b/M1ZZqK/uwiZ http://paperpile.com/b/M1ZZqK/5Sfn http://paperpile.com/b/M1ZZqK/5Sfn http://paperpile.com/b/M1ZZqK/5Sfn http://paperpile.com/b/M1ZZqK/5Sfn http://paperpile.com/b/M1ZZqK/5Sfn http://paperpile.com/b/M1ZZqK/At4m http://paperpile.com/b/M1ZZqK/At4m http://paperpile.com/b/M1ZZqK/At4m http://paperpile.com/b/M1ZZqK/At4m http://paperpile.com/b/M1ZZqK/At4m http://paperpile.com/b/M1ZZqK/At4m http://paperpile.com/b/M1ZZqK/Xjeg http://paperpile.com/b/M1ZZqK/Xjeg http://paperpile.com/b/M1ZZqK/Xjeg http://paperpile.com/b/M1ZZqK/Xjeg http://paperpile.com/b/M1ZZqK/Xjeg http://paperpile.com/b/M1ZZqK/Xjeg http://paperpile.com/b/M1ZZqK/Xjeg http://dx.doi.org/10.1016/j.cell.2018.09.031 http://paperpile.com/b/M1ZZqK/Q3l3 http://paperpile.com/b/M1ZZqK/Q3l3 http://paperpile.com/b/M1ZZqK/Q3l3 http://paperpile.com/b/M1ZZqK/Q3l3 http://paperpile.com/b/M1ZZqK/Q3l3 http://paperpile.com/b/M1ZZqK/Q3l3 http://paperpile.com/b/M1ZZqK/Q3l3 http://paperpile.com/b/M1ZZqK/N6PA http://paperpile.com/b/M1ZZqK/N6PA http://paperpile.com/b/M1ZZqK/N6PA http://paperpile.com/b/M1ZZqK/N6PA http://paperpile.com/b/M1ZZqK/N6PA http://paperpile.com/b/M1ZZqK/N6PA http://paperpile.com/b/M1ZZqK/RJuF http://paperpile.com/b/M1ZZqK/RJuF http://paperpile.com/b/M1ZZqK/RJuF http://paperpile.com/b/M1ZZqK/RJuF http://paperpile.com/b/M1ZZqK/3IfM http://paperpile.com/b/M1ZZqK/3IfM http://paperpile.com/b/M1ZZqK/3IfM http://paperpile.com/b/M1ZZqK/3IfM http://paperpile.com/b/M1ZZqK/3IfM http://paperpile.com/b/M1ZZqK/3IfM http://paperpile.com/b/M1ZZqK/njch http://paperpile.com/b/M1ZZqK/njch http://paperpile.com/b/M1ZZqK/njch http://paperpile.com/b/M1ZZqK/njch http://paperpile.com/b/M1ZZqK/njch http://paperpile.com/b/M1ZZqK/njch http://paperpile.com/b/M1ZZqK/0lD1 http://paperpile.com/b/M1ZZqK/0lD1 http://paperpile.com/b/M1ZZqK/0lD1 http://paperpile.com/b/M1ZZqK/0lD1 http://paperpile.com/b/M1ZZqK/0lD1 http://paperpile.com/b/M1ZZqK/0lD1 http://paperpile.com/b/M1ZZqK/6K13 http://paperpile.com/b/M1ZZqK/6K13 http://paperpile.com/b/M1ZZqK/C6cd http://paperpile.com/b/M1ZZqK/C6cd http://paperpile.com/b/M1ZZqK/C6cd http://paperpile.com/b/M1ZZqK/C6cd http://dx.doi.org/10.1016/j.ymeth.2016.09.016 http://paperpile.com/b/M1ZZqK/05iQ http://paperpile.com/b/M1ZZqK/05iQ http://paperpile.com/b/M1ZZqK/05iQ http://paperpile.com/b/M1ZZqK/05iQ http://dx.doi.org/10.1101/769257 http://paperpile.com/b/M1ZZqK/9Srq http://paperpile.com/b/M1ZZqK/9Srq http://paperpile.com/b/M1ZZqK/9Srq http://paperpile.com/b/M1ZZqK/9Srq http://paperpile.com/b/M1ZZqK/9Srq https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/   Figures    Figure 1 Browser-based exploration and sharing of trajectory visualizations with ​linus​.​ (a) ​Control workflow of ​linus​. Starting  with the data, a Python-converter is used to enrich the data with further features (e.g. numeric metrics, an edge-bundled version of the  data, visual context) and to prepare the visualisation package. (b) Within minutes, the data can be visualised and explored in the  browser, and different aspects of the data can be interactively highlighted (example shows the effect of changing the degree of trajectory  bundling).     .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/     Figure 2 Configurable filters allow deep data exploration. ​ ​The user can choose from a range of several visualisation methods  directly in the browser interface to highlight aspects of interest in the data (zebrafish tracking results from ​(Shah et al., 2019) ​ as an  example). (a) The line data is visualized using a range of options for shading and colour mapping. (b-d) ​ The user can filter parts of the  data with respect to specific attributes, such as (b) time intervals or (c) a specific range of signals (marker expression in cells in this case).  (d) The user can further create subselections of the tracks in space using cutting planes or refinable spatial selection. The visual  attributes can be defined separately for the selected focus region and the non-selected context region. (e-g) The web interface can  blend seamlessly between different states of the data. This feature can be used to map between (e) original tracks and their  edge-bundled version, to visualize planar projections of the 3D data (f) locally on a definable (oblique) plane or (g) globally using a  Mercator projection (with definable parameters).     .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://paperpile.com/c/M1ZZqK/0lD1 https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/     Figure 3 ​S​harable interactive visualization packets for a multitude of applications ranging across a variety of sciences. ​ ​The  user can combine the visualization methods, annotations, and camera motion paths in a scheduled tour that can be shared by a custom  URL or QR code generated directly in the browser interface. ​Panels (a)-​(d) demonstrate use cases for real-world datasets with different  characteristics and dimensionality. (a) Ant trails (2D+t) from ​(Imirzian et al., 2019) ​. Bundling and colour-coding (spatial orientation by  mapping (x,y,z) to (R,G,B) values) indicate the major trails running in opposing directions. (b) GPS Animal tracking data for two species  (blue whales ​(Bailey et al., 2009) ​ - blue and arctic tern ​(Egevang et al., 2010) ​ - red) shown on a Mercator projection of the earth’s  surface. For a better orientation, the outline of the continents is included as axes into the visualization that dynamically adapt to the  projections and viewpoint changes (2D surface data + t). (e) Cell movements during the elongation process of zebrafish blastoderm  explants (3D+t) ​(Trivedi et al., 2019) ​. Bundling, colour coding, and spatial selection highlight collective cell movements as the explant  starts elongating, focusing on a subpopulation of cells driving this process. Colour code shows time from early (yellow) to late (red) for  selected tracks. (f) Brain tractography data showing major white matter connectivity from diffusion MRI (3D). The spatial selection  highlights the left hemisphere while anatomical context is provided by the outline of the entire brain (from mesh data) and the defocused  tracts of the right hemisphere.        .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://paperpile.com/c/M1ZZqK/uwiZ https://paperpile.com/c/M1ZZqK/Ey8c https://paperpile.com/c/M1ZZqK/IwWE https://paperpile.com/c/M1ZZqK/05iQ https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/   Supplementary Figures    Supplementary Figure 1: ​Overview of data structure. ​The coordinate list holds the x/y/z values for each supporting point of the  trajectories. For each such point, an arbitrary number (only limited by the graphics card's capabilities) of attributes can be stored. The  attributes must be provided in the same order as the points. To create trajectories from the point set, an index list is provided as well.  Each pair of indices describes one segment of a trajectory. The number of such segments is not restricted, as any point (and its  respective attributes) can be used multiple times.    .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/     Supplementary Figure 2: Overview of settings. ​ ​An overview of the different visualisation settings available to the user from the GUI  (two screenshots merged). For explanations regarding different settings, see text or documentation at ​https://gitlab.com/imb-dev/linus​.    .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://gitlab.com/imb-dev/linus https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/     Supplementary Figure 3: Tour editor. ​ ​The tour actions can be organised by drag and drop (reading order: from left to right, top to  bottom). Every action can be scheduled with a time delay with respect to the end of the previous action. Some actions use transitions  (e.g. camera motions or the adjustment of numeric values) whose duration can be configured as well. Eventually, a URL or a QR code  can be created.  .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2020.04.17.043323doi: bioRxiv preprint https://doi.org/10.1101/2020.04.17.043323 http://creativecommons.org/licenses/by/4.0/ 10_1101-2020_03_27_012757 ---- 67941284 1 2 3 4 5 6 Evaluating the transcriptional fidelity of cancer models 7 8 9 Da Peng1*, Rachel Gleyzer2*, Wen-Hsin Tai2, Pavithra Kumar2, Qin Bian2, Bradley Issacs2, 10 Edroaldo Lummertz da Rocha3, Stephanie Cai1, Kathleen DiNapoli4,5, Franklin W Huang6, 11 Patrick Cahan1,2,7 12 13 1Department of Biomedical Engineering, Johns Hopkins University School of Medicine, 14 Baltimore MD 21205 USA 15 16 2Institute for Cell Engineering, Johns Hopkins University School of Medicine, 17 Baltimore MD 21205 USA 18 19 3Department of Microbiology, Immunology and Parasitology, 20 Federal University of Santa Catarina, Florianópolis SC, Brazil 21 22 4Department of Cell Biology, Johns Hopkins University School of Medicine, 23 Baltimore, MD 21205 USA 24 25 5Department of Electrical and Computer Engineering, Johns Hopkins University, 26 Baltimore MD 21218 USA 27 28 6Division of Hematology/Oncology, Department of Medicine; Helen Diller Family Cancer Center; 29 Bakar Computational Health Sciences Institute; Institute for Human Genetics; 30 University of California, San Francisco, San Francisco, CA 31 32 7Department of Molecular Biology and Genetics, Johns Hopkins University School of Medicine, 33 Baltimore MD 21205 USA 34 35 36 * These authors made equal contributions. 37 38 39 Correspondence to: patrick.cahan@jhmi.edu 40 41 Article type: Research 42 43 Website: http://www.cahanlab.org/resources/cancerCellNet_web 44 45 Code: https://github.com/pcahan1/cancerCellNet 46 47 48 49 50 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 ABSTRACT 51 52 Background: Cancer researchers use cell lines, patient derived xenografts, engineered mice, 53 and tumoroids as models to investigate tumor biology and to identify therapies. The 54 generalizability and power of a model derives from the fidelity with which it represents the tumor 55 type under investigation, however, the extent to which this is true is often unclear. The 56 preponderance of models and the ability to readily generate new ones has created a demand 57 for tools that can measure the extent and ways in which cancer models resemble or diverge 58 from native tumors. 59 60 Methods: We developed a machine learning based computational tool, CancerCellNet, that 61 measures the similarity of cancer models to 22 naturally occurring tumor types and 36 subtypes, 62 in a platform and species agnostic manner. We applied this tool to 657 cancer cell lines, 415 63 patient derived xenografts, 26 distinct genetically engineered mouse models, and 131 64 tumoroids. We validated CancerCellNet by application to independent data, and we tested 65 several predictions with immunofluorescence. 66 67 Results: We have documented the cancer models with the greatest transcriptional fidelity to 68 natural tumors, we have identified cancers underserved by adequate models, and we have 69 found models with annotations that do not match their classification. By comparing models 70 across modalities, we report that, on average, genetically engineered mice and tumoroids have 71 higher transcriptional fidelity than patient derived xenografts and cell lines in four out of five 72 tumor types. However, several patient derived xenografts and tumoroids have classification 73 scores that are on par with native tumors, highlighting both their potential as faithful model 74 classes and their heterogeneity. 75 76 Conclusions: CancerCellNet enables the rapid assessment of transcriptional fidelity of tumor 77 models. We have made CancerCellNet available as freely downloadable software and as a web 78 application that can be applied to new cancer models that allows for direct comparison to the 79 cancer models evaluated here. 80 81 82 83 84 85 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 INTRODUCTION 86 Models are widely used to investigate cancer biology and to identify potential therapeutics. 87 Popular modeling modalities are cancer cell lines (CCLs)1, genetically engineered mouse 88 models (GEMMs)2, patient derived xenografts (PDXs)3, and tumoroids4. These classes of 89 models differ in the types of questions that they are designed to address. CCLs are often used 90 to address cell intrinsic mechanistic questions5, GEMMs to chart progression of molecularly 91 defined-disease6, and PDXs to explore patient-specific response to therapy in a physiologically 92 relevant context7. More recently, tumoroids have emerged as relatively inexpensive, 93 physiological, in vitro 3D models of tumor epithelium with applications ranging from measuring 94 drug responsiveness to exploring tumor dependence on cancer stem cells. Models also differ in 95 the extent to which the they represent specific aspects of a cancer type8. Even with this intra- 96 and inter-class model variation, all models should represent the tumor type or subtype under 97 investigation, and not another type of tumor, and not a non-cancerous tissue. Therefore, cancer-98 models should be selected not only based on the specific biological question but also based on 99 the similarity of the model to the cancer type under investigation9,10. 100 Various methods have been proposed to determine the similarity of cancer models to 101 their intended subjects. Domcke et al devised a 'suitability score' as a metric of the molecular 102 similarity of CCLs to high grade serous ovarian carcinoma based on a heuristic weighting of 103 copy number alterations, mutation status of several genes that distinguish ovarian cancer 104 subtypes, and hypermutation status11. Other studies have taken analogous approaches by 105 either focusing on transcriptomic or ensemble molecular profiles (e.g. transcriptomic and copy 106 number alterations) to quantify the similarity of cell lines to tumors12–14. These studies were 107 tumor-type specific, focusing on CCLs that model, for example, hepatocellular carcinoma or 108 breast cancer. Notably, Yu et al compared the transcriptomes of CCLs to The Cancer Genome 109 Atlas (TCGA) by correlation analysis, resulting in a panel of CCLs recommended as most 110 representative of 22 tumor types15. Most recently, Najgebauer et al16 and Salvadores et al17 111 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 have developed methods to assess CCLs using molecular traits such as copy number 112 alterations (CNA), somatic mutations, DNA methylation and transcriptomics. While all of these 113 studies have provided valuable information, they leave two major challenges unmet. The first 114 challenge is to determine the fidelity of GEMMs, PDXs, and tumoroids, and whether there are 115 stark differences between these classes of models and CCLs. The other major unmet challenge 116 is to enable the rapid assessment of new, emerging cancer models. This challenge is especially 117 relevant now as technical barriers to generating models have been substantially lowered18,19, 118 and because new models such as PDXs and tumoroids can be derived on patient-specific basis 119 therefore should be considered a distinct entity requiring individual validation4,20. 120 To address these challenges, we developed CancerCellNet (CCN), a computational tool 121 that uses transcriptomic data to quantitatively assess the similarity between cancer models and 122 22 naturally occurring tumor types and 36 subtypes in a platform- and species-agnostic manner. 123 Here, we describe CCN’s performance, and the results of applying it to assess 657 CCLs, 415 124 PDXs, 26 GEMMs, and 131 tumoroids. This has allowed us to identify the most faithful models 125 currently available, to document cancers underserved by adequate models, and to find models 126 with inaccurate tumor type annotation. Moreover, because CCN is open-source and easy to 127 use, it can be readily applied to newly generated cancer models as a means to assess their 128 fidelity. 129 130 RESULTS 131 CancerCellNet classifies samples accurately across species and technologies 132 Previously, we had developed a computational tool using the Random Forest 133 classification method to measure the similarity of engineered cell populations to their in vivo 134 counterparts based on transcriptional profiles21,22. More recently, we elaborated on this 135 approach to allow for classification of single cell RNA-seq data in a manner that allows for 136 cross-platform and cross-species analysis23. Here, we used an analogous approach to build a 137 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 platform that would allow us to quantitatively compare cancer models to naturally occurring 138 patient tumors (Fig 1A). In brief, we used TCGA RNA-seq expression data from 22 solid tumor 139 types to train a top-pair multi-class Random forest classifier (Fig 1B). We combined training 140 data from Rectal Adenocarcinoma (READ) and Colon Adenocarcinoma (COAD) into one 141 COAD_READ category because READ and COAD are considered to be virtually 142 indistinguishable at a molecular level24. We included an ‘Unknown’ category trained using 143 randomly shuffled gene-pair profiles generated from the training data of 22 tumor types to 144 identify query samples that are not reflective of any of the training data. To estimate the 145 performance of CCN and how it is impacted by parameter variation, we performed a parameter 146 sweep with a 5-fold 2/3 cross-validation strategy (i.e. 2/3 of the data sampled across each 147 cancer type was used to train, 1/3 was used to validate) (Fig 1C). The performance of CCN, as 148 measured by the mean area under the precision recall curve (AUPRC), did not fall below 0.945 149 and remained relatively stable across parameter sets (Supp Fig 1A). The optimal parameters 150 resulted in 1,979 features. The mean AUPRCs exceeded 0.95 in most tumor types with this 151 optimal parameter set (Fig 1D, Supp Fig 1B). The AUPRCs of CCN applied to independent 152 data RNA-Seq data from 725 tumors across five tumor types from the International Cancer 153 Genome Consortium (ICGC)25 ranged from 0.93 to 0.99, supporting the notion that the platform 154 is able to accurately classify tumor samples from diverse sources (Fig 1E). 155 As one of the central aims of our study is to compare distinct cancer models, including 156 GEMMs, our method needed to be able to classify samples from mouse and human samples 157 equivalently. We used the Top-Pair transform23 to achieve this and we tested the feasibility of 158 this approach by assessing the performance of a normal (i.e. non-tumor) cell and tissue 159 classifier trained on human data as applied to mouse samples. Consistent with prior 160 applications23, we found that the cross-species classifier performed well, achieving mean 161 AUPRC of 0.97 when applied to mouse data (Supp Fig 1C). 162 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 To evaluate cancer models at a finer resolution, we also developed an approach to 163 perform tumor subtype classifications (Supp Fig 1D). We constructed 11 different cancer 164 subtype classifiers based on the availability of expression or histological subtype 165 information24,26–36. We also included non-cancerous, normal tissues as categories for several 166 subtype classifiers when sufficient data was available: breast invasive carcinoma (BRCA), 167 COAD_READ, head and neck squamous cell carcinoma (HNSC), kidney renal clear cell 168 carcinoma (KIRC) and uterine corpus endometrial carcinoma (UCEC). The 11 subtype 169 classifiers all achieved high overall average AUPRs ranging from 0.80 to 0.99 (Supp Fig 1E). 170 171 Fidelity of cancer cell lines 172 Having validated the performance of CCN, we then used it to determine the fidelity of 173 CCLs. We mined RNA-seq expression data of 657 different cell lines across 20 cancer types 174 from the Cancer Cell Line Encyclopedia (CCLE) and applied CCN to them, finding a wide 175 classification range for cell lines of each tumor type (Fig 2A, Supp Tab 1). To verify the 176 classification results, we applied CCN to expression profiles from CCLE generated through 177 microarray expression profiling37. To ensure that CCN would function on microarray data, we 178 first tested it by applying a CCN classifier created to test microarray data to 720 expression 179 profiles of 12 tumor types. The cross-platform CCN classifier performed well, based on the 180 comparison to study-provided annotation, achieving a mean AUPRC of 0.91 (Supp Fig 2A). 181 Next, we applied this cross-platform classifier to microarray expression profiles from CCLE 182 (Supp Fig 2B). From the classification results of 571 cell lines that have both RNA-seq and 183 microarray expression profiles, we found a strong overall positive association between the 184 classification scores from RNA-seq and those from microarray (Supp Fig 2C). This comparison 185 supports the notion that the classification scores for each cell line are not artifacts of profiling 186 methodology. Moreover, this comparison shows that the scores are consistent between the 187 times that the cell lines were first assayed by microarray expression profiling in 2012 and by 188 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 RNA-Seq in 2019. We also observed high level of correlation between our analysis and the 189 analysis done by Yu et al15(Supp Fig 2D), further validating the robustness of the CCN results. 190 Next, we assessed the extent to which CCN classifications agreed with their nominal 191 tumor type of origin, which entailed translating quantitative CCN scores to classification labels. 192 To achieve this, we selected a decision threshold that maximized the Macro F1 measure, 193 harmonic mean of precision and recall, across 50 cross validations. Then, we annotated cell 194 lines based their CCN score profile as follows. Cell lines with CCN scores > threshold for the 195 tumor type of origin were annotated as 'correct'. Cell lines with CCN scores > threshold in the 196 tumor type of origin and at least one other tumor type were annotated as 'mixed'. Cell lines with 197 CCN scores > threshold for tumor types other than that of the cell line's origin were annotated 198 as 'other'. Cell lines that did not receive a CCN score > threshold for any tumor type were 199 annotated as 'none' (Fig 2B). We found that majority of cell lines originally annotated as Breast 200 invasive carcinoma (BRCA), Cervical squamous cell carcinoma and endocervical 201 adenocarcinoma (CESC), Skin Cutaneous Melanoma (SKCM), Colorectal Cancer 202 (COAD_READ) and Sarcoma (SARC) fell into the 'correct' category (Fig 2B). On the other 203 hand, no Esophageal carcinoma (ESCA), Pancreatic adenocarcinoma (PAAD) or Brain Lower 204 Grade Glioma (LGG) were classified as 'correct', demonstrating the need for more 205 transcriptionally faithful cell lines that model those general cancer types. 206 There are several possible explanations for cell lines not receiving a 'correct' 207 classification. One possibility is that the sample was incorrectly labeled in the study from which 208 we harvested the expression data. Consistent with this explanation, we found that colorectal 209 cancer line NCI-H68438,39, a cell line labelled as liver hepatocellular carcinoma (LIHC) by CCLE, 210 was classified strongly as COAD_READ (Supp Tab 1). Another possibility to explain low CCN 211 score is that cell lines were derived from subtypes of tumors that are not well-represented in 212 TCGA. To explore this hypothesis, we first performed tumor subtype classification on CCLs from 213 11 tumor types for which we had trained subtype classifiers (Supp Tab 2). We reasoned that if 214 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 a cell was a good model for a rarer subtype, then it would receive a poor general classification 215 but a high classification for the subtype that it models well. Therefore, we counted the number of 216 lines that fit this pattern. We found that of the 188 lines with no general classification, 25 (13%) 217 were classified as a specific subtype, suggesting that derivation from rare subtypes is not the 218 major contributor to the poor overall fidelity of CCLs. 219 Another potential contributor to low scoring cell lines is intra-tumor stromal and immune 220 cell impurity in the training data. If impurity were a confounder of CCN scoring, then we would 221 expect a strong positive correlation between mean purity and mean CCN classification scores of 222 CCLs per general tumor type. However, the Pearson correlation coefficient between the mean 223 purity of general tumor type and mean CCN classification scores of CCLs in the corresponding 224 general tumor type was low (0.14), suggesting that tumor purity is not a major contributor to the 225 low CCN scores across CCLs (Supp Fig 2E). 226 227 Comparison of SKCM and GBM CCLs to scRNA-seq 228 To more directly assess the impact of intra-tumor heterogeneity in the training data on 229 evaluating cell lines, we constructed a classifier using cell types found in human melanoma and 230 glioblastoma scRNA-seq data40,41. Previously, we have demonstrated the feasibility of using our 231 classification approach on scRNA-seq data23. Our scRNA-seq classifier achieved a high 232 average AUPRC (0.95) when applied to held-out data and high mean AUPRC (0.99) when 233 applied to few purified bulk testing samples (Supp Fig 3A-B). Comparing the CCN score from 234 bulk RNA-seq general classifier and scRNA-seq classifier, we observed a high level of 235 correlation (Pearson correlation of 0.89) between the SKCM CCN classification scores and 236 scRNA-seq SKCM malignant CCN classification scores for SKCM cell lines (Fig 2C, Supp Fig 237 3C). Of the 41 SKCM cell lines that were classified as SKCM by the bulk classifier, 37 were also 238 classified as SKCM malignant cells by the scRNA-seq classifier. Interestingly, we also observed 239 a high correlation between the SARC CCN classification score and scRNA-seq cancer 240 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9 associated fibroblast (CAF) CCN classification scores (Pearson correlation of 0.92). Six of the 241 seven SKCM cell lines that had been classified as exclusively SARC by CCN were classified as 242 CAF by the scRNA-seq classifier (Fig 2D, Supp Fig 3C), which suggests the possibility that 243 these cell lines were derived from CAF or other mesenchymal populations, or that they have 244 acquired a mesenchymal character through their derivation. The high level of agreement 245 between scRNA-seq and bulk RNA-seq classification results shows that heterogeneity in the 246 training data of general CCN classifier has little impact in the classification of SKCM cell lines. 247 In contrast, we observed a weaker correlation between GBM CCN classification scores 248 and scRNA-seq GBM neoplastic CCN classification scores (Pearson correlation of 0.72) for 249 GBM cell lines (Fig 2E, Supp Fig 3D). Of the 31 GBM lines that were not classified as GBM 250 with CCN, 25 were classified as GBM neoplastic cells with the scRNA-seq classifier. Among the 251 22 GBM lines that were classified as SARC with CCN, 15 cell lines were classified as CAF (Fig 252 2F), 10 which were classified as both GBM neoplastic and CAF in the scRNA-seq classifier. 253 Similar to the situation with SKCM lines that classify as CAF, this result is consistent with the 254 possibility that some GBM lines classified as SARC by CCN could be derived from 255 mesenchymal subtypes exhibiting both strong mesenchymal signatures and glioblastoma 256 signatures or that they have acquired a mesenchymal character through their derivation. The 257 lower level of agreement between scRNA-seq and bulk RNA-seq classification results for GBM 258 models suggests that the heterogeneity of glioblastomas42 can impact the classification of GBM 259 cell lines, and that the use of scRNA-seq classifier can resolve this deficiency. 260 261 Immunofluorescence confirmation of CCN predictions 262 To experimentally explore some of our computational analyses, we performed 263 immunofluorescence on three cell lines that were not classified as their labelled categories: the 264 ovarian cancer line SK-OV-3 had a high UCEC CCN score (0.246), the ovarian cancer line 265 A2780 had a high Testicular Germ Cell Tumors (TGCT) CCN score (0.327), and the prostate 266 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 cancer line PC-3 had a high bladder cancer (BLCA) score (0.307) (Supp Tab 1). We reasoned 267 that if SK-OV-3, A2780 and PC-3 were classified most strongly as UCEC, TGCT and BLCA, 268 respectively, then they would express proteins that are indicative of these cancer types. 269 First, we measured the expression of the uterine-associated transcription factor 270 HOXB643,44, and the UCEC serous ovarian tumor biomarker WT145 in SK-OV-3, in the OV cell 271 line Caov-4, and in the UCEC cell line HEC-59. We chose Caov-4 as our positive control for OV 272 biomarker expression because it was determined by our analysis and others11,15 to be a good 273 model of OV. Likewise, we chose HEC-59 to be a positive control for UCEC. We found that SK-274 OV-3 has a small percentage (5%) of cells that expressed the uterine marker HOXB6 and a 275 large proportion (73%) of cells that expressed WT1 (Fig 3A). In contrast, no Caov-4 cells 276 expressed HOXB6, whereas 85% of cells expressed WT1. This suggests that SK-OV-3 exhibits 277 both biomarkers of ovarian tumor and uterine tissue. From our computational analysis and 278 experimental validation, SK-OV-3 is most likely an endometrioid subtype of ovarian cancer. This 279 result is also consistent with prior classification of SK-OV-346, and the fact that SK-OV-3 lacks 280 p53 mutations, which is prevalent in high-grade serous ovarian cancer47, and it harbors an 281 endometrioid-associated mutation in ARID1A11,46,48. Next, we measured the expression of 282 markers of OV and germ cell cancers (LIN28A49) in the OV-annotated cell line A2780, which 283 received a high TCGT CCN score. We found that 54% of A2780 cells expressed LIN28A 284 whereas it was not detected in Caov-4 (Fig 3B). The OV marker WT1 was also expressed in 285 fewer A2780 cells as compared to Caov-4 (48% vs 85%), which suggests that A2780 could be a 286 germ cell derived ovarian tumor. Taken together, our results suggest that SK-OV-3 and A2780 287 could represent OV subtypes of that are not well represented in TCGA training data, which 288 resulted in a low OV score and higher CCN score in other categories. 289 Lastly, we examined PC-3, annotated as a PRAD cell line but classified to be most 290 similar to BLCA. We found that 30% of the PC-3 cells expressed PPARG, a contributor to 291 urothelial differentiation50 that is not detected in the PRAD Vcap cell line but is highly expressed 292 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11 in the BLCA RT4 cell line (Fig 3C). PC-3 cells also expressed the PRAD biomarker FOLH151 293 suggesting that PC-3 has an PRAD origin and gained urothelial or luminal characteristics 294 through the derivation process. In short, our limited experimental data support the CCN 295 classification results. 296 297 Subtype classification of cancer cell lines 298 Next, we explored the subtype classification of CCLs from three general tumor types in 299 more depth. We focused our subtype visualization (Fig 4A-C) on CCL models with general CCN 300 score above 0.1 in their nominal cancer type as this allowed us to analyze those models that fell 301 below the general threshold but were classified as a specific sub-type (Supp Tab 1-2). 302 Focusing first on UCEC, the histologically defined subtypes of UCEC, endometrioid and serous, 303 differ in prevalence, molecular properties, prognosis, and treatment. For instance, the 304 endometrioid subtype, which accounts for approximately 80% of uterine cancers, retains 305 estrogen receptor and progesterone receptor status and is responsive towards progestin 306 therapy52,53. Serous, a more aggressive subtype, is characterized by the loss of estrogen and 307 progesterone receptor and is not responsive to progestin therapy52,53. CCN classified the 308 majority of the UCEC cell lines as serous except for JHUEM-1 which is classified as mixed, with 309 similarities to both endometrioid and serous (Fig 4A). The preponderance CCLE lines of serous 310 versus endometroid character may be due to properties of serous cancer cells that promote 311 their in vitro propagation, such as upregulation of cell adhesion transcriptional programs54. 312 Some of our subtype classification results are consistent with prior observations. For example, 313 HEC-1A, HEC-1B, and KLE were previously characterized as type II endometrial cancer, which 314 includes a serous histological subtype55. On the other hand, our subtype classification results 315 contradict prior observations in at least one case. For instance, the Ishikawa cell line was 316 derived from type I endometrial cancer (endometrioid histological subtype)55,56, however CCN 317 classified a derivative of this line, Ishikawa 02 ER-, as serous. The high serous CCN score 318 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 12 could result from a shift in phenotype of the line concomitant with its loss of estrogen receptor 319 (ER) as this is a distinguishing feature of type II endometrial cancer (serous histological 320 subtype)52. Taken together, these results indicate a need for more endometroid-like CCLs. 321 Next, we examined the subtype classification of Lung Squamous Cell Carcinoma 322 (LUSC) and Lung adenocarcinoma (LUAD) cell lines (Fig 4B-C). All the LUSC lines with at least 323 one subtype classification had an underlying primitive subtype classification. This is consistent 324 either with the ease of deriving lines from tumors with a primitive character, or with a process by 325 which cell line derivation promotes similarity to more primitive subtype, which is marked by 326 increased cellular proliferation28. Some of our results are consistent with prior reports that have 327 investigated the resemblance of some lines to LUSC subtypes. For example, HCC-95, 328 previously been characterized as classical28,57, had a maximum CCN score in the classical 329 subtype (0.429) . Similarly, LUDLU-1 and EPLC-272H, previously reported as classical57 and 330 basal57 respectively, had maximal tumor subtype CCN scores for these sub-types (0.323 and 331 0.256) (Fig 4B, Supp Tab 2) despite classified as Unknown. Lastly, the LUAD cell lines that 332 were classified as a subtype were either classified as proximal inflammation or proximal 333 proliferation (Fig 4C). RERF-LC-Ad1 had the highest general classification score and the 334 highest proximal inflammation subtype classification score. Taken together, these subtype 335 classification results have revealed an absence of cell lines models for basal and secretory 336 LUSC, and for the Terminal respiratory unit (TRU) LUAD subtype. 337 338 Cancer cell lines’ popularity and transcriptional fidelity 339 Finally, we sought to measure the extent to which cell line transcriptional fidelity related 340 to model prevalence. We used the number of papers in which a model was mentioned, 341 normalized by the number of years since the cell line was documented, as a rough 342 approximation of model prevalence. To explore this relationship, we plotted the normalized 343 citation count versus general classification score, labeling the highest cited and highest 344 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 13 classified cell lines from each general tumor type (Fig 4D). For most of the general tumor types, 345 the highest cited cell line is not the highest classified cell line except for Hep G2, AGS and ML-346 1, representing liver hepatocellular carcinoma (LIHC), stomach adenocarcinoma (STAD), and 347 thyroid carcinoma (THCA), respectively. On the other hand, the general scores of the highest 348 cited cell lines representing BLCA (T24), BRCA (MDA-MB-231), and PRAD (PC-3) fall below 349 the classification threshold of 0.25. Notably, each of these tumor types have other lines with 350 scores exceeding 0.5, which should be considered as more faithful transcriptional models when 351 selecting lines for a study (Supp Tab 1 and 352 http://www.cahanlab.org/resources/cancerCellNet_results/). 353 354 Evaluation of patient derived xenografts 355 Next, we sought to evaluate a more recent class of cancer models: PDX. To do so, we 356 subjected the RNA-seq expression profiles of 415 PDX models from 13 different types of cancer 357 types generated previously20 to CCN. Similar to the results of CCLs, the PDXs exhibited a wide 358 range of classification scores (Fig 5A, Supp Tab 3). By categorizing the CCN scores of PDX 359 based on the proportion of samples associated with each tumor type that were correctly 360 classified, we found that SARC, SKCM, COAD_READ and BRCA have higher proportion of 361 correctly classified PDX than those of other cancer categories (Fig 5B). In contrast to CCLs, we 362 found a higher proportion of correctly classified PDX in STAD, PAAD and KIRC (Fig 5B). 363 However, similar to CCLs, no ESCA PDXs were classified as such. This held true when we 364 performed subtype classification on PDX samples: none of the PDX in ESCA were classified as 365 any of the ESCA subtypes (Supp Tab 4). UCEC PDXs had both endometrioid subtypes, serous 366 subtypes, and mixed subtypes, which provided a broader representation than CCLs (Fig 5C). 367 Several LUSC PDXs that were classified as a subtype were also classified as Head and Neck 368 squamous cell carcinoma (HNSC) or mix HNSC and LUSC (Fig 5D). This could be due to the 369 similarity in expression profiles of basal and classical subtypes of HNSC and LUSC28,58, which is 370 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 consistent with the observation that these PDXs were also subtyped as classical. No LUSC 371 PDXs were classified as the secretory subtype. In contrast to LUAD CCLs, four of the five LUAD 372 PDXs with a discernible sub-type were classified as proximal inflammatory (Fig 5E). On the 373 other hand, similar to the CCLs, there were no TRU subtypes in the LUAD PDX cohort. In 374 summary, we found that while individual PDXs can reach extremely high transcriptional fidelity 375 to both general tumor types and subtypes, many PDXs were not classified as the general tumor 376 type from which they originated. 377 378 Evaluation of GEMMs 379 Next, we used CCN to evaluate GEMMs of six general tumor types from nine studies for 380 which expression data was publicly available59–67. As was true for CCLs and PDXs, GEMMs 381 also had a wide range of CCN scores (Fig 6A, Supp Tab 5). We next categorized the CCN 382 scores based on the proportion of samples associated with each tumor type that were correctly 383 classified (Fig 6B). In contrast to LGG CCLs, LGG GEMMs, generated by Nf1 mutations 384 expressed in different neural progenitors in combination with Pten deletion66, consistently were 385 classified as LGG (Fig 6A-B). The GEMM dataset included multiple replicates per model, which 386 allowed us to examine intra-GEMM variability. Both at the level of CCN score and at the level of 387 categorization, GEMMs were invariant. For example, replicates of UCEC GEMMs driven by 388 Prg(cre/+)Pten(lox/lox) received almost identical general CCN scores (Fig 6C, Supp Tab 6). 389 GEMMs sharing genotypes across studies, such as LUAD GEMMs driven by Kras mutation and 390 loss of p5359,65,67, also received similar general and subtype classification scores (Fig 6A,B,E). 391 Next, we explored the extent to which genotype impacted subtype classification in 392 UCEC, LUSC, and LUAD. Prg(cre/+)Pten(lox/lox) GEMMs had a mixed subtype classification of 393 both serous and endometrioid, consistent with the fact that Pten loss occurs in both subtypes 394 (albeit more frequently in endometrioid). We also analyzed Prg(cre/+)Pten(lox/lox)Csf3r-/- 395 GEMMs. Polymorphonuclear neutrophils (PMNs), which play anti-tumor roles in endometrioid 396 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 cancer progression, are depleted in these animals. Interestingly, Prg(cre/+)Pten(lox/lox)Csf3r-/- 397 GEMMs had a serous subtype classification, which could be explained by differences in PMN 398 involvement in endometrioid versus serous uterine tumor development that are reflected in the 399 respective transcriptomes of the TCGA UCEC training data. We note that the tumor cells were 400 sorted prior to RNA-seq and thus the shift in subtype classification is not due to contamination of 401 GEMMs with non-tumor components. In short, this analysis supports the argument that tumor-402 cell extrinsic factors, in this case a reduction in anti-tumor PMNs, can shift the transcriptome of 403 a GEMM so that it more closely resembles a serous rather than endometrioid subtype. 404 The LUSC GEMMs that we analyzed were Lkb1fl/fl and they either overexpressed of 405 Sox2 (via two distinct mechanisms) or were also Ptenfl/fl 65. We note that the eight lenti-Sox2-406 Cre-infected;Lkb1fl/fl and Rosa26LSL-Sox2-IRES-GFP;Lkb1fl/fl samples that classified as 407 'Unknown' had LUSC CCN scores only modestly lower than the decision threshold (Fig 6D) 408 (mean CCN score = 0.217). Thirteen out of the 17 of the Sox2 GEMMs classified as the 409 secretory subtype of LUSC. The consistency is not surprising given both models overexpress 410 Sox2 and lose Lkb1. On the other hand, the Lkb1fl/fl;Ptenfl/fl GEMMs had substantially lower 411 general LUSC CCN scores and our subtype classification indicated that this GEMM was mostly 412 classified as 'Unknown', in contrast to prior reports suggesting that it is most similar to a basal 413 subtype68. None of the three LUSC GEMMs have strong classical CCN scores. Most of the 414 LUAD GEMMs, which were generated using various combinations of activating Kras mutation, 415 loss of Trp53, and loss of Smarca4L59,65,67, were correctly classified (Fig 6E). Those that were 416 not classified have modestly lower CCN score than the decision threshold (mean CCN score = 417 0.214) . There were no substantial differences in general or subtype classification across driver 418 genotypes. Although the sub-type of all LUAD GEMMs was 'Unknown', the subtypes tended to 419 have a mixture of high CCN proximal proliferation, proximal inflammation and TRU scores. 420 Taken together, this analysis suggests that there is a degree of similarity, and perhaps plasticity 421 between the primitive and secretory (but not basal or classical) subtypes of LUSC. On the other 422 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 hand, while the LUAD GEMMs classify strongly as LUAD, they do not have strong particular 423 subtype classification -- a result that does not vary by genotype. 424 425 Evaluation of Tumoroids 426 Lastly, we used CCN to assess a relatively novel cancer model: tumoroids. We 427 downloaded and assessed 131 distinct tumoroid expression profiles spanning 13 cancer 428 categories from The NCI Patient-Derived Models Repository (PDMR)69 and from three individual 429 studies70–72 (Fig 7A, Supp Tab 7). We note that several categories have three or fewer samples 430 (BRCA, CESC, KIRP, OV, LIHC, and BLCA from PDMR). Among the cancer categories 431 represented by more than three samples, only LUSC and PAAD have fewer than 50% classified 432 as their annotated label (Fig 7B). In contrast to GBM CCLs, all three induced pluripotent stem 433 cell-derived GBM tumoroids72 were classified as GBM with high CCN scores (mean = 0.53). To 434 further characterize the tumoroids, we performed subtype classification on them (Supp Tab 8). 435 UCEC tumoroids from PDMR contains a wide range of subtypes with two endometrioid, two 436 serous and one mixed type (Fig 7C). On the other hand, LUSC tumoroids appear to be 437 predominantly of classical subtypes with one tumoroid classified as a mix between classical and 438 primitive (Fig 7D). Lastly, similar to the CCL and PDX counterparts, LUAD tumoroids are 439 classified as proximal inflammatory and proximal proliferation with no tumoroids classified as 440 TRU subtype (Fig 7E). 441 442 Comparison of CCLs, PDXs, GEMMs and tumoroids 443 Finally, we sought to estimate the comparative transcriptional fidelity of the four cancer 444 models modalities. We compared the general CCN scores of each model on a per tumor type 445 basis (Fig 8). In the case of GEMMs, we used the mean classification score of all samples with 446 shared genotypes. We also used mean classification of technical replicates found in LIHC 447 tumoroids70. We evaluated models based on both the maximum CCN score, as this represents 448 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 the potential for a model class, and the median CCN score, as this indicates the current overall 449 transcriptional fidelity of a model class. PDXs achieved the highest CCN scores in three (UCEC, 450 PAAD, LUAD) out of the five cancer categories in which all four modalities were available (Fig 451 8), despite having low median CCN scores. Notably, PDXs have a median CCN score above 452 the 0.25 threshold in PAAD while none of the other three modalities have any samples above 453 the threshold. In LIHC, the highest CCN score for PDX (0.9) is only slightly lower than the 454 highest CCN score for tumoroid (0.91). This suggest that certain individual PDXs most closely 455 mimic the transcriptional state of native patient tumors despite a portion of the PDXs having low 456 CCN scores. Similarly, while the majority of the CCLs have low CCN scores, several lines 457 achieve high transcriptional fidelity in LUSC, LUAD and LIHC (Fig 8). Collectively, GEMMs and 458 tumoroids had the highest median CCN scores in four of the five model classes (LUSC and 459 LUAD for GEMMs and UCEC and LIHC for tumoroids). Notably, both of the LIHC tumoroids 460 achieved CCN scores on par with patient tumors (Fig 8). In brief, this analysis indicates that 461 PDXs and CCLs are heterogenous in terms of transcriptional fidelity, with a portion of the 462 models highly mimicking native tumors and the majority of the models having low transcriptional 463 fidelity (with the exception of PAAD for PDXs). On the other hand, GEMMs and tumoroids 464 displayed a consistently high fidelity across different models. 465 Because the CCN score is based on a moderate number of gene features (i.e. 1,979 466 gene pairs consisting of 1,689 unique genes) relative to the total number of protein-coding 467 genes in the genome, it is possible that a cancer model with a high CCN score might not have a 468 high global similarity to a naturally occurring tumor. Therefore, we also calculated the GRN 469 status, a metric of the extent to which tumor-type specific gene regulatory network is 470 established21, for all models (Supp Fig 4). We observed high level of correlation between the 471 two similarity metrics, which suggests that although CCN classifies on a selected set of genes, 472 its scores are highly correlated with global assessment of transcriptional similarity. 473 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 We also sought to compare model modalities in terms of the diversity of subtypes that 474 they represent (Supp Fig 5). As a reference, we also included in this analysis the overall 475 subtype incidence, as approximated by incidence in TCGA. Replicates in GEMMs and 476 tumoroids were averaged into one classification profile. In models of UCEC, there is a notable 477 difference in endometroid incidence, and the proportion of models classified as endometroid, 478 with PDX and tumoroids having any representatives (Supp Fig 5). All of the CCL, GEMM, and 479 tumoroid models of PAAD have an unknown subtype classification and no correct general 480 classification. However, the majority of PDXs are subtyped as either a mixture of basal and 481 classical, or classical alone. LUAD have proximal inflammation and proximal proliferation 482 subtypes modelled by CCLs and PDX (Supp Fig 5). Likewise, LUSC have basal, classical and 483 primitive subtypes modelled by CCLs and PDXs, and secretory subtype modelled by GEMMs 484 exclusively (Supp Fig 5). Taken together, these results demonstrate the need to carefully select 485 different model systems to more suitably model certain cancer subtypes. 486 487 DISCUSSION 488 A major goal in the field of cancer biology is to develop models that mimic naturally occurring 489 tumors with enough fidelity to enable therapeutic discoveries. However, methods to measure 490 the extent to which cancer models resemble or diverge from native tumors are lacking. This is 491 especially problematic now because there are many existing models from which to choose, and 492 it has become easier to generate new models. Here, we present CancerCellNet (CCN), a 493 computational tool that measures the similarity of cancer models to 22 naturally occurring tumor 494 types and 36 subtypes. While the similarity of CCLs to patient tumors has already been 495 explored in previous work, our tool introduces the capability to assess the transcriptional fidelity 496 of PDXs, GEMMs, and tumoroids. Because CCN is platform- and species-agnostic, it 497 represents a consistent platform to compare models across modalities including CCLs, PDXs, 498 GEMMs and tumoroids. Here, we applied CCN to 657 cancer cell lines, 415 patient derived 499 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 xenografts, 26 distinct genetically engineered mouse models and 131 tumoroids. Several 500 insights emerged from our computational analyses that have implications for the field of cancer 501 biology. 502 First, PDXs have the greatest potential to achieve transcriptional fidelity with three out of 503 five general tumor types for which data from all modalities was available, as indicated by the 504 high scores of individual PDXs. Notably PDXs are the only modality with samples classified as 505 PAAD. At the same time, the median CCN scores of PDXs were lower than that of GEMMs and 506 tumoroids in the other four tumor types. It is unclear what causes such a wide range of CCN 507 scores within PDXs. We suspect that some PDXs might have undergone selective pressures in 508 the host that distort the progression of genomic alterations away from what is observed in 509 natural tumor73. Future work to understand this heterogeneity is important so as to yield 510 consistently high fidelity PDXs, and to identify intrinsic and host-specific factors that so 511 powerfully shape the PDX transcriptome. 512 Second, in general GEMMs and tumoroids have higher median CCN scores than those 513 of PDXs and CCLs. This is also consistent with that fact that GEMMs are typically derived by 514 recapitulating well-defined driver mutations of natural tumors, and thus this observation 515 corroborates the importance of genetics in the etiology of cancer74. Moreover, in contrast to 516 most PDXs, GEMMs are typically generated in immune replete hosts. Therefore, the higher 517 overall fidelity of GEMMs may also be a result of the influence of a native immune system on 518 GEMM tumors75. The high median CCN scores of tumoroids can be attributed to several factors 519 including the increased mechanical stimuli and cell-cell interactions that come from 3D self-520 organizing cultures76,77. 521 Third, we have found that none of the samples that we evaluated here are 522 transcriptionally adequate models of ESCA. This may be due to an inherent lability of the ESCA 523 transcriptome that is often preceded by a metaplasia that has obscured determining its cell 524 type(s) of origin78. Therefore, this tumor type requires further attention to derive new models. 525 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 Fourth, we found that in several tumor types, GEMMs tend to reflect mixtures of 526 subtypes rather than conforming strongly to single subtypes. The reasons for this are not clear 527 but it is possible that in the cases that we examined the histologically defined subtypes have a 528 degree of plasticity that is exacerbated in the murine host environment. 529 Lastly, we recognize that many CCLs are not classified as their annotated labels. While 530 we have suggested that the lack of immune component is not a major confounder, we suspect 531 that the CCLs could undergo genetic divergence due to high number of passages, 532 chemotherapy before biopsy, culture condition and genetic instability79–82, which could all be 533 factors that drive CCLs away from their labelled tumors. 534 Currently, there are several limitations to our CCN tool, and caveats to our analyses 535 which indicate areas for future work and improvement. First, CCN is based on transcriptomic 536 data but other molecular readouts of tumor state, such as profiles of the proteome83, 537 epigenome84, non-coding RNA-ome84, and genome74 would be equally, if not more important, to 538 mimic in a model system. Therefore, it is possible that some models reflect tumor behavior well, 539 and because this behavior is not well predicted by transcriptome alone, these models have 540 lower CCN scores. To both measure the extent that such situations exist, and to correct for 541 them, we plan in the future to incorporate other omic data into CCN so as to make more 542 accurate and integrated model evaluation possible. As a first step in this direction, we plan to 543 incorporate DNA methylation and genomic sequencing data as additional features for our 544 Random forest classifier as this data is becoming more readily available for both training and 545 cancer models. We expect that this will allow us to both refine our tumor subtype categories and 546 it will enable more accurate predictions of how models respond to perturbations such as drug 547 treatment. 548 A second limitation is that in the cross-species analysis, CCN implicitly assumes that 549 homologs are functionally equivalent. The extent to which they are not functionally equivalent 550 determines how confounded the CCN results will be. This possibility seems to be of limited 551 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 consequence based on the high performance of the normal tissue cross-species classifier and 552 based on the fact that GEMMs have the highest median CCN scores (in addition to tumoroids). 553 A third caveat to our analysis is that there were many fewer distinct GEMMs and 554 tumoroids than CCLs and PDXs. As more transcriptional profiles for GEMMs and tumoroids 555 emerge, this comparative analysis should be revisited to assess the generality of our results. 556 Finally, the TCGA training data is made up of RNA-Seq from bulk tumor samples, which 557 necessarily includes non-tumor cells, whereas the CCLs are by definition cell lines of tumor 558 origin. Therefore, CCLs theoretically could have artificially low CCN scores due to the presence 559 of non-tumor cells in the training data. This problem appears to be limited as we found no 560 correlation between tumor purity and CCN score in the CCLE samples. However, this problem 561 is related to the question of intra-tumor heterogeneity. We demonstrated the feasibility of using 562 CCN and single cell RNA-seq data to refine the evaluation of cancer cell lines contingent upon 563 availability of scRNA-seq training data. As more training single cell RNA-seq data accrues, CCN 564 would be able to not only evaluate models on a per cell type basis, but also based on cellular 565 composition. 566 We have made the results of our analyses available online so that researchers can 567 easily explore the performance of selected models or identify the best models for any of the 22 568 general tumor types and the 36 subtypes presented here. To ensure that CCN is widely 569 available we have developed a free web application, which performs CCN analysis on user-570 uploaded data and allows for direct comparison of their data to the cancer models evaluated 571 here. We have also made the CCN code freely available under an Open Source license and as 572 an easily installed R package, and we are actively supporting its further development. Included 573 in the web application are instructions for training CCN and reproducing our analysis. The 574 documentation describes how to analyze models and compare the results to the panel of 575 models that we evaluated here, thereby allowing researchers to immediately compare their 576 models to the broader field in a comprehensive and standard fashion. 577 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 578 Online Methods 579 Training General CancerCellNet Classifier 580 To generate training data sets, we downloaded 8,991 patient tumor RNA-seq expression 581 count matrix and their corresponding sample table across 22 different tumor types from TCGA 582 using TCGAWorkflowData, TCGAbiolinks85 and SummarizedExperiment86 packages. We used 583 all the patient tumor samples for training the general CCN classifier. We limited training and 584 analysis of RNA-seq data to the 13,142 genes in common between the TCGA dataset and all 585 the query samples (CCLs, PDXs, GEMMs, and tumoroids). To train the top pair Random forest 586 classifier, we used a method similar to our previous method23. CCN first normalized the training 587 counts matrix by down-sampling the counts to 500,000 counts per sample. To significantly 588 reduce the execution time and memory of generating gene pairs for all possible genes, CCN 589 then selected n up-regulated genes, n down-regulated genes and n least differentially 590 expressed genes (CCN training parameter nTopGenes = n) for each of the 22 cancer 591 categories using template matching87 as the genes to generate top scoring gene pairs. In short, 592 for each tumor type, CCN defined a template vector that labelled the training tumor samples in 593 cancer type of interest as 1 and all other tumor samples as 0 CCN then calculated the Pearson 594 correlation coefficient between template vector and gene expressions for all genes. The genes 595 with strong match to template as either upregulated or downregulated had large absolute 596 Pearson correlation coefficient. CCN chose the upregulated, downregulated and least 597 differentially expressed genes based on the magnitude of Pearson correlation coefficient. 598 After CCN selected the genes for each cancer type, CCN generated gene pairs among 599 those genes. Gene pair transformation was a method inspired by the top-scoring pair classifier88 600 to allow compatibility of classifier with query expression profiles that were collected through 601 different platforms (e.g. microarray query data applied to RNA-seq training data). In brief, the 602 gene pair transformation compares 2 genes within an expression sample and encodes the 603 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 “gene1_gene2” gene-pair as 1 if the first gene has higher expression than the second gene. 604 Otherwise, gene pair transformation would encode the gene-pair as 0. Using all the gene pair 605 combinations generated through the gene sets per cancer type, CCN then selected top m 606 discriminative gene pairs (CCN training parameter nTopGenePairs = m) for each category using 607 template matching (with large absolute Pearson correlation coefficient) described above. To 608 prevent any single gene from dominating the gene pair list, we allowed each gene to appear at 609 maximum of three times among the gene pairs selected as features per cancer type. 610 After the top discriminative gene pairs were selected for each cancer category, CCN 611 grouped all the gene pairs together and gene pair transformed the training samples into a binary 612 matrix with all the discriminative gene pairs as row names and all the training samples as 613 column names. Using the binary gene pair matrix, CCN randomly shuffled the binary values 614 across rows then across columns to generate random profiles that should not resemble training 615 data from any of the cancer categories. CCN then sampled 70 random profiles, annotated them 616 as “Unknown” and used them as training data for the “Unknown” category. Using gene pair 617 binary training matrix, CCN constructed a multi-class Random Forest classifier of 2000 trees 618 and used stratified sampling of 60 sample size to ensure balance of training data in constructing 619 the decision trees. 620 To identify the best set of genes and gene-pair parameters (n and m), we used a grid-621 search cross-validation89 strategy with 5 cross-validations at each parameter set. The specific 622 parameters for the final CCN classifier using the function “broadClass_train” in the package 623 cancerCellNet are in Supp Tab 9. The gene-pairs are in Supp Tab 10. 624 625 Validating General CancerCellNet Classifier 626 Two thirds of patient tumor data from each cancer type were randomly sampled as 627 training data to construct a CCN classifier. Based on the training data, CCN selected the 628 classification genes and gene-pairs and trained a classifier. After the classifier was built, 35 629 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 24 held-out samples from each cancer category were sampled and 40 “Unknown” profiles were 630 generated for validation. The process of randomly sampling training set from 2/3 of all patient 631 tumor data, selecting features based on the training set, training classifier and validating was 632 repeated 50 times to have a more comprehensive assessment of the classifier trained with the 633 optimal parameter set. To test the performance of final CCN on independent testing data, we 634 applied it to 725 profiles from ICGC spanning 6 projects that do not overlap with TCGA (BRCA-635 KR, LIRI-JP, OV-AU, PACA-AU, PACA-CA, PRAD-FR). 636 637 Selecting Decision Thresholds 638 Our strategy for selecting a decision threshold was to find the value that maximizes the 639 average Macro F1 measure90 for each of the 50 cross-validations that were performed with the 640 optimal parameter set, testing thresholds between 0 and 1 with a 0.01 increment. The F1 641 measure is defined as: 642 𝑀𝑎𝑐𝑟𝑜 𝐹1 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 643 We selected the most commonly occurring threshold above 0.2 that maximized the average 644 Macro F1 measure across the 50 cross-validations as the decision threshold for the final 645 classifier (threshold = 0.25). The same approach was applied for the subtype classifiers. The 646 thresholds and the corresponding average precision, recall and F1 measures are recorded in 647 (Supp Tab 11). 648 649 Classifying Query Data into General Cancer Categories 650 We downloaded the RNA-seq cancer cell lines expression profiles and sample table 651 from (https://portals.broadinstitute.org/ccle/data), and microarray cancer cell lines expression 652 profiles and sample table from Barretina et al 37. We extracted two WT control NCCIT RNA-seq 653 expression profiles from Grow et al91. We received PDX expression estimates and sample 654 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 25 annotations from the authors of Gao et al 20. We gathered GEMM expression profiles from nine 655 different studies59–67. We downloaded tumoroid expression profiles from The NCI Patient-656 Derived Models Repository (PDMR)69 and from three individual studies70–72. To use CCN 657 classifier on GEMM data, the mouse genes from GEMM expression profiles were converted into 658 their human homologs. The query samples were classified using the final CCN classifier. Each 659 query classification profile was labelled as one of the four classification categories: “correct”, 660 “mixed”, “none” and “other” based on classification profiles. If a sample has a CCN score higher 661 than the decision threshold in the labelled cancer category, we assigned that as “correct”. If a 662 sample has CCN score higher than the decision threshold in labelled cancer category and in 663 other cancer categories, we assigned that as “mixed”. If a sample has no CCN score higher 664 than the decision threshold in any cancer category or has the highest CCN score in ‘Unknown’ 665 category, then we assigned it as “none”. If a sample has CCN score higher than the decision 666 threshold in a cancer category or categories not including the labelled cancer category, we 667 assigned it as ”other”. We analyzed and visualized the results using R and R packages 668 pheatmap92 and ggplot293. 669 670 Cross-Species Assessment 671 To assess the performance of cross-species classification, we downloaded 1003 672 labelled human tissue/cell type and 1993 labelled mouse tissue/cell type RNA-seq expression 673 profiles from Github (https://github.com/pcahan1/CellNet). We first converted the mouse genes 674 into human homologous genes. Then we found the intersecting genes between mouse 675 tissue/cell expression profiles and human tissue/cell expression profiles. Limiting the input of 676 human tissue RNA-seq profiles to the intersecting genes, we trained a CCN classifier with all 677 the human tissue/cell expression profiles. The parameters used for the function 678 “broadClass_train” in the package cancerCellNet are in Supp Tab 9. We randomly sampled 75 679 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 26 samples from each tissue category in mouse tissue/cell data and applied the classifier on those 680 samples to assess performance. 681 682 Cross-Technology Assessment 683 To assess the performance of CCN in applications to microarray data, we gathered 684 6,219 patient tumor microarray profiles across 12 different cancer types from more than 100 685 different projects (Supp Tab 12). We found the intersecting genes between the microarray 686 profiles and TCGA patient RNA-seq profiles. Limiting the input of RNA-seq profiles to the 687 intersecting genes, we created a CCN classifier with all the TCGA patient profiles using 688 parameters for the function “broadClass_train” listed in Supp Tab 9. After the microarray 689 specific classifier was trained, we randomly sampled 60 microarray patient samples from each 690 cancer category and applied CCN classifier on them as assessment of the cross-technology 691 performance in Supp Fig 2A. The same CCN classifier was used to assess microarray CCL 692 samples Supp Fig 2B. 693 694 Training and validating scRNA-seq Classifier 695 We extracted labelled human melanoma and glioblastoma scRNA-seq expression 696 profiles40,41, and compiled the two datasets excluding 3 cell types T.CD4, T.CD8 and Myeloid 697 due to low number of cells for training. 60 cells from each of the 11 cell types were sampled for 698 training a scRNA-seq classifier. The parameters for training a general scRNA-seq classifier 699 using the function “broadClass_train” are in Supp Tab 9. 25 cells from each of the 11 cell types 700 from the held-out data were selected to assess the single cell classifier. Using maximization of 701 average Macro F1 measure, we selected the decision threshold of 0.255. The gene-pairs that 702 were selected to construct the classifier are in Supp Tab 10. To assess the cross-technology 703 capability of applying scRNA-seq classifier to bulk RNA-seq, we downloaded 305 expression 704 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 27 profiles spanning 4 purified cell types (B cells, endothelial cells, monocyte/macrophage, 705 fibroblast) from https://github.com/pcahan1/CellNet. 706 707 Training Subtype CancerCellNet 708 We found 11 cancer types (BRCA, COAD, ESCA, HNSC, KIRC, LGG, PAAD, UCEC, 709 STAD, LUAD, LUSC) which have meaningful subtypes based on either histology or molecular 710 profile and have sufficient samples to train a subtype classifier with high AUPR. We also 711 included normal tissues samples from BRCA, COAD, HNSC, KIRC, UCEC to create a normal 712 tissue category in the construction of their subtype classifiers. Training samples were either 713 labelled as a cancer subtype for the cancer of interest or as “Unknown” if they belong to other 714 cancer types. Similar to general classifier training, CCN performed gene pair transformation and 715 selected the most discriminate gene pairs for each cancer subtype. In addition to the gene pairs 716 selected to discriminate cancer subtypes, CCN also performed general classification of all 717 training data and appended the classification profiles of training data with gene pair binary 718 matrix as additional features. The reason behind using general classification profile as additional 719 features is that many general cancer types may share similar subtypes, and general 720 classification profile could be important features to discriminate the general cancer type of 721 interest from other cancer types before performing finer subtype classification. The specific 722 parameters used to train individual subtype classifiers using “subClass_train” function of 723 CancerCellNet package can be found in Supp Tab 9 and the gene pairs are in Supp Tab 10. 724 725 Validating Subtype CancerCellNet 726 Similar to validating general class classifier, we randomly sampled 2/3 of all samples in 727 each cancer subtype as training data and sampled an equal amount across subtypes in the 1/3 728 held-out data for assessing subtype classifiers. We repeated the process 20 times for more 729 comprehensive assessment of subtype classifiers. 730 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 28 Classifying Query Data into Subtypes 731 We assigned subtype to query sample if the query sample has CCN score higher than 732 the decision threshold. The table of decision threshold for subtype classifiers are in Supp Tab 733 11. If no CCN scores exceed the decision threshold in any subtype or if the highest CCN score 734 is in ‘Unknown’ category, then we assigned that sample as ‘Unknown’. Analysis was performed 735 in R and visualizations were generated with the ComplexHeatmap package94. 736 737 Cells culture, Immunohistochemistry and histomorphometry 738 Caov-4 (ATCC® HTB-76™), SK-OV-3(ATCC® HTB-77™), RT4 (ATCC® HTB-2™), and 739 NCCIT(ATCC® CRL-2073™) cell lines were purchased from ATCC. HEC-59 (C0026001) and 740 A2780 (93112519-1VL) were obtained from Addexbio Technologies and Sigma-Aldrich. Vcap 741 and PC-3. SK-OV-3, Vcap, and RT4 were cultured in Dulbecco's Modified Eagle Medium 742 (DMEM, high glucose, 11960069, Gibco) with 1% Penicillin-Streptomycin-Glutamine ( 743 10378016, Life Technologies); Caov-4, PC-3, NCCIT, and A2780 were cultured using RPMI-744 1640 medium (11875093, Gibco) while HEC-59 was in Iscove's Modified Dulbecco's Medium 745 (IMDM, 12440053, Gibco). Both media were supplemented with 1% Penicillin-Streptomycin 746 (15140122, Gibco). All medium included 10% Fetal Bovine Serum (FBS). 747 Cells cultured in 48-well plate were washed twice with PBS and fixed in 10% buffered 748 formalin for 24 hrs at 4 °C. Immunostaining was performed using a standard protocol. Cells 749 were incubated with primary antibodies to goat HOXB6 (10 µg/mL, PA5-37867, Invitrogen), 750 mouse WT1(10 µg/mL, MA1-46028, Invitrogen), rabbit PPARG (1:50, ABN1445, Millipore), 751 mouse FOLH1(10 µg/mL, UM570025, Origene), and rabbit LIN28A (1:50, #3978, Cell Signaling) 752 in Antibody Diluent (S080981-2, DAKO), at 4 °C overnight followed with three 5 min washes in 753 TBST. The slides were then incubated with secondary antibodies conjugated with fluorescence 754 at room temperature for 1 h while avoiding light followed with three 5 min washes in TBST and 755 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29 nuclear stained with mounting medium containing DAPI. Images were captured by Nikon 756 EcLipse Ti-S, DS-U3 and DS-Qi2. 757 Histomorphometry was performed using ImageJ (Version 2.0.0-rc-69/1.52i). % 758 N.positive cells was calculated by the percentage of the number of positive stained cells divided 759 by the number of DAPI-positive nucleus within three of randomly chosen areas. The data were 760 expressed as means ± SD. 761 762 Tumor Purity Analysis 763 We used the R package ESTIMATE95 to calculate the ESTIMATE scores from TCGA 764 tumor expression profiles that we used as training data for CCN classifier. To calculate tumor 765 purity we used the equation described in YoshiHara et al., 201395: 766 Tumour purity = cos (0.6049872018 + 0.0001467884 × ESTIMATE score) 767 768 Extracting Citation Counts 769 We used the R package RISmed96 to extract the number of citations for each cell line 770 through query search of “cell line name[Text Word] AND cancer[Text Word]” on PubMed. The 771 citation counts were normalized by dividing the citation counts with the number of years since 772 first documented. 773 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑐𝑖𝑡𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡𝑠 = 𝑐𝑖𝑡𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡𝑠 # 𝑦𝑒𝑎𝑟𝑠 𝑠𝑖𝑛𝑐𝑒 𝑓𝑖𝑟𝑠𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑 774 775 GRN construction and GRN Status 776 GRN construction was extended from our previous method21. 80 samples per cancer 777 type were randomly sampled and normalized through down sampling as training data for the 778 CLR GRN construction algorithm. Cancer type specific GRNs were identified by determining the 779 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 30 differentially expressed genes per each cancer type and extracting the subnetwork using those 780 genes. 781 To extend the original GRN status algorithm21 across different platforms and species, we 782 devised a rank-based GRN status algorithm. Like the original GRN status, rank based GRN 783 status is a metric of assessing the similarity of cancer type specific GRN between training data 784 in the cancer type of interest and query samples. Hence, high GRN status represents high level 785 of establishment or similarity of the cancer specific GRN in the query sample compared to those 786 of the training data. The expression profiles of training data and query data were transformed 787 into rank expression profiles by replacing the expression values with the rank of the expression 788 values within a sample (highest expressed gene would have the highest rank and lowest 789 expressed genes would have a rank of 1). Cancer type specific mean and standard deviation of 790 every gene’s rank expression were learned from training data. The modified Z-score values for 791 genes within cancer type specific GRN were calculated for query sample’s rank expression 792 profiles to quantify how dissimilar the expression values of genes in query sample’s cancer type 793 specific GRN compared to those of the reference training data: 794 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)XYZ = [ 0, 𝑖𝑓 𝑍𝑠𝑐𝑜𝑟𝑒 𝑖𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑔𝑒𝑛𝑒 𝑖𝑠 𝑓𝑜𝑢𝑛𝑑 𝑡𝑜 𝑏𝑒 𝑢𝑝𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑 0, 𝑖𝑓 𝑍𝑠𝑐𝑜𝑟𝑒 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑔𝑒𝑛𝑒 𝑖𝑠 𝑓𝑜𝑢𝑛𝑑 𝑡𝑜 𝑏𝑒 𝑑𝑜𝑤𝑛𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑 𝑎𝑏𝑠(𝑍𝑠𝑐𝑜𝑟𝑒), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 795 If a gene in the cancer type specific GRN is found to be upregulated in the specific 796 cancer type relative to other cancer types, then we would consider query sample’s gene to be 797 similar if the ranking of the query sample’s gene is equal to or greater than the mean ranking of 798 the gene in training sample. As a result of similarity, we assign that gene of a Z-score of 0. The 799 same principle applies to cases where the gene is downregulated in cancer specific subnetwork. 800 GRN status for query sample is calculated as the weighted mean of the 801 (1000 − 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)XYZ) across genes in cancer type specific GRN. 1000 is an arbitrary 802 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 31 large number, and larger dissimilarity between query’s cancer type specific GRN indicate high 803 Z-scores for the GRN genes and low GRN status. 804 𝑅𝐺𝑆 = e(1000 − 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)XYZ)𝑤𝑒𝑖𝑔ℎ𝑡fghg i h ijk 805 𝐺𝑅𝑁 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑅𝐺𝑆 ∑ 𝑤𝑒𝑖𝑔ℎ𝑡fghg ihijk 806 The weight of individual genes in the cancer specific network is determined by the 807 importance of the gene in the Random Forest classifier. Finally, the GRN status gets normalized 808 with respect to the GRN status of the cancer type of interest and the cancer type with the lowest 809 mean GRN status. 810 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 = 𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 mngop − 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 Xih qrhqgo) 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠qrhqgo sptg ihsgogus) 811 Where “min cancer” represents the cancer type where its training data have the lowest 812 mean GRN status in the cancer type of interest, and 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 Xih qrhqgo) represents the 813 lowest average GRN status in the cancer type of interest. 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠qrhqgo sptg ihsgogus) 814 represents average GRN status of the cancer type of interest in the training data. 815 816 Code availability 817 CancerCellNet code and documentation is available at GitHub: 818 https://github.com/pcahan1/cancerCellNet 819 820 Acknowledgements 821 This work was supported by the National Institutes of Health NCI Ovarian Cancer SPORE 822 P50CA228991 via a Development Research Program award to PC. FWH was supported by a 823 Prostate Cancer Foundation Young Investigator Award, Department of Defense W81XWH-17-824 PCRP-HD (F.W.H.), the National Institutes of Health/National Cancer Institute P20 CA233255-825 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 32 01 (F.W.H.) U19 CA214253 (F.W.H.). We would like to thank John Powers, Hao Zhu, Tian-Li 826 Wang, Charles Eberhart, and Kaloyan Tsanov for comments on the manuscript and helpful 827 discussions. Some figures were created in part with Biorender.com. 828 829 FIGURE LEGENDS 830 Fig. 1 CancerCellNet (CCN) workflow, training, and performance. (A) Schematic of CCN 831 usage. CCN was designed to assess and compare the expression profiles of cancer models 832 such as CCLs, PDXs, GEMMs, and tumoroids with native patient tumors. To use trained 833 classifier, CCN inputs the query samples (e.g. expression profiles from CCLs, PDXs, GEMMs, 834 tumoroids) and generates a classification profile for the query samples. The column names of 835 the classification heatmap represent sample annotation and the row names of the classification 836 heatmap represent different cancer types. Each grid is colored from black to yellow representing 837 the lowest classification score (e.g. 0) to highest classification score (e.g. 1). (B) Schematic of 838 CCN training process. CCN uses patient tumor expression profiles of 22 different cancer types 839 from TCGA as training data. First, CCN identifies n genes that are upregulated, n that are 840 downregulated, and n that are relatively invariant in each tumor type versus all of the others. 841 Then, CCN performs a pair transform on these genes and subsequently selects the most 842 discriminative set of m gene pairs for each cancer type as features (or predictors) for the 843 Random forest classifier. Lastly, CCN trains a multi-class Random Forest classifier using gene-844 pair transformed training data. (C) Parameter optimization strategy. 5 cross-validations of each 845 parameter set in which 2/3 of TCGA data was used to train and 1/3 to validate was used search 846 for the values of n and m that maximized performance of the classifier as measured by area 847 under the precision recall curve (AUPRC). (D) Mean and standard deviation of classifiers based 848 on 50 cross-validations with the optimal parameter set. (E) AUPRC of the final CCN classifier 849 when applied to independent patient tumor data from ICGC. 850 851 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 33 Fig. 2 Evaluation of cancer cell lines. (A) General classification heatmap of CCLs extracted 852 from CCLE. Column annotations of the heatmap represent the labelled cancer category of the 853 CCLs given by CCLE and the row names of the heatmap represent different cancer categories. 854 CCLs’ general classification profiles are categorized into 4 categories: correct (red), correct 855 mixed (pink), no classification (light green) and other classification (dark green) based on the 856 decision threshold of 0.25. (B) Bar plot represents the proportion of each classification category 857 in CCLs across cancer types ordered from the cancer types with the highest proportion of 858 correct and correct mixed CCLs to lowest proportion. (C) Comparison between SKCM general 859 CCN scores from bulk RNA-seq classifier and SKCM malignant CCN scores from scRNA-seq 860 classifier for SKCM CCLs. (D) Comparison between SARC general CCN scores from bulk RNA-861 seq classifier and CAF CCN scores from scRNA-seq classifier for SKCM CCLs. (E) Comparison 862 between GBM general CCN scores from bulk RNA-seq classifier and GBM neoplastic CCN 863 scores from scRNA-seq classifier for GBM CCLs. (F) Comparison between SARC general CCN 864 scores and CAF CCN scores from scRNA-seq classifier for GBM CCLs. The green lines 865 indicate the decision threshold for scRNA-seq classifier and general classifier. 866 867 Fig. 3 Immunofluorescence of selected cell lines. (A) Classification profiles (left) and IF 868 expression (middle) of Caov-4 (OV positive control), HEC-59 (UCEC positive control) and SK-869 OV-3 for WT1 (OV biomarker) and HOXB6 (uterine biomarker). The bar plots quantify the 870 average percentage of positive cells for WT1 (top-right) and HOXB6 (bottom-right). (B) 871 Classification profiles (left) and IF expression (middle) of Caov-4, NCCIT (germ cell tumor 872 positive control) and A2780 for WT1 and LIN28A (germ cell tumor biomarker). Classification of 873 NCCIT were performed using RNA-seq profiles of WT control NCCIT duplicate from Grow et 874 al91. The bar plots quantify the average percentage of positive cells for WT1 (top-right) and 875 LIN28A (bottom-right). (C) Classification profiles (left) and IF expression (middle) of Vcap 876 (PRAD positive control), RT4 (BLCA positive control) and PC-3 for FOLH1 (prostate biomarker) 877 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 34 and PPARG (urothelial biomarker). The bar plots quantify the average percentage of positive 878 cells for FOLH1 (top-right) and PPARG (bottom-right). 879 880 Fig. 4 Subtype classification of CCLs and CCL prevalence. The heatmap visualizations 881 represent subtype classification of (A) UCEC CCLs, (B) LUSC CCLs and (C) LUAD CCLs. Only 882 samples with CCN scores > 0.1 in their nominal tumor type are displayed. (D) Comparison of 883 normalized citation counts and general CCN classification scores of CCLs. Labelled cell lines 884 either have the highest CCN classification score in their labelled cancer category or highest 885 normalized citation count. Each citation count was normalized by number of years since first 886 documented on PubMed. 887 888 Fig. 5 Evaluation of patient derived xenografts. (A) General classification heatmap of PDXs. 889 Column annotations represent annotated cancer type of the PDXs, and row names represent 890 cancer categories. (B) Proportion of classification categories in PDXs across cancer types is 891 visualized in the bar plot and ordered from the cancer type with highest proportion of correct and 892 mixed correct classified PDXs to the lowest. Subtype classification heatmaps of (C) UCEC 893 PDXs, (D) LUSC PDXs and (E) LUAD PDXs. Only samples with CCN scores > 0.1 in their 894 nominal tumor type are displayed. 895 896 Fig. 6 Evaluation of genetically engineered mouse models. (A) General classification 897 heatmap of GEMMs. Column annotations represent annotated cancer type of the GEMMs, and 898 row names represent cancer categories. (B) Proportion of classification categories in GEMMs 899 across cancer types is visualized in the bar plot and ordered from the cancer type with highest 900 proportion of correct and mixed correct classified GEMMs to the lowest. Subtype classification 901 heatmap of (C) UCEC GEMMs, (D) LUSC GEMMs and (E) LUAD GEMMs. Only samples with 902 CCN scores > 0.1 in their nominal tumor type are displayed. 903 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 35 904 Fig. 7 Evaluation of tumoroid models. (A) General classification heatmap of tumoroids. 905 Column annotations represent annotated cancer type of the tumoroids, and row names 906 represent cancer categories. (B) Proportion of classification categories in tumoroids across 907 cancer types is visualized in the bar plot and ordered from the cancer type with highest 908 proportion of correct and mixed correct classified tumoroids to the lowest. Subtype classification 909 heatmap of (C) UCEC tumoroids, (D) LUSC tumoroids and (E) LUAD tumoroids. Only samples 910 with CCN scores > 0.1 in their nominal tumor type are displayed. 911 912 Fig. 8 Comparison of CCLs, PDXs, and GEMMs. Box-and-whiskers plot comparing general 913 CCN scores across CCLs, GEMMs, PDXs of five general tumor types (UCEC, PAAD, LUSC, 914 LUAD, LIHC). 915 916 Supplementary Information 917 Supplementary Figure 1 Assessment of CCN general classifier and subtype classifier. (A) 918 Mean AUPRC of repeated grid-search cross-validation for each parameter grid. (B) Mean and 919 range of CCN classifier’s PR curves from 50 cross validations based on the optimal feature 920 selection parameters n and m. (C) AUPRC of CCN human tissue classifier when applied to 921 mouse tissue data. (D) The schematic of training a subtype classifier in CCN. CCN uses patient 922 tumor expression profiles from cancer of interest as training data. CCN performs gene-pair 923 transformation and selects the most discriminative gene pairs among the cancer subtypes from 924 training data as features. CCN then applies the general classification on training data and uses 925 the general classification profile as features in addition to gene pairs for training a Random 926 Forest classifier. The weight of the general classification profiles as features can be tuned to 927 improve AUPRC. (E) The mean and standard deviation of AUPRC for 11 subtype classifiers 928 based on 20 iterations of random sampling of training and held-out data, training subtype 929 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 36 classifier using training data, classification of held-out data, and calculation of recall and 930 precision. 931 932 Supplementary Figure 2 Further validation of CCN and classification results. To validate the 933 cross-platform classification performance of CCN, a new classifier specifically trained to classify 934 microarray data was trained using RNA-seq data from TCGA as training data and intersecting 935 genes between RNA-seq data and microarray data. (A) AUPRC of CCN classifier when applied 936 to tumor profiles assayed on microarrays. (B) Classification heatmap of CCLs using microarray 937 expression data. (C) Pearson correlation between CCN scores of CCLE lines generated from 938 RNA-seq data and microarray data. (D) Comparison between CCLs’ CCN scores and the 939 similarity metric from Yu et al15, median correlations of transcriptional profiles between CCLs 940 and TCGA tumors from CCLs’ labelled cancer category. (E) Comparison of mean tumor purity 941 of training data and mean CCN scores of CCLs for each cancer category. 942 943 Supplementary Figure 3 Single-cell classification of SKCM and GBM cell lines. (A) AUPRC of 944 the single-cell classifier when applied to scRNA-seq held-out data. (B) AUPRC of the scRNA-945 seq classifier when applied to purified bulk RNA samples. (C) Single-cell classification of SKCM 946 CCLs. Red bar-plot (top) represents general CCN scores in SARC and blue bar-plot (bottom) 947 represents general CCN scores in SKCM. (D) Single-cell classification of GBM CCLs. Red bar-948 plot (top) represents general CCN scores in SARC and yellow bar-plot (bottom) represents 949 general CCN scores in GBM. 950 951 Supplementary Figure 4 Correlation between cancer type specific network GRN status and 952 general CCN scores. 953 954 955 Supplementary Figure 5 Proportion of cancer subtypes in different cancer models and TCGA 956 tumor data across 11 general cancer types. 957 958 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 37 959 Supplementary Table 1 General classification profiles of CCLs. 960 961 Supplementary Table 2 Subtype classification profiles of CCLs. 962 963 Supplementary Table 3 General classification profiles of PDXs. 964 965 Supplementary Table 4 Subtype classification profiles of PDXs. 966 967 Supplementary Table 5 General classification profiles of GEMMs 968 969 Supplementary Table 6 Subtype classification profiles of GEMMs. 970 971 Supplementary Table 7 General classification profiles of tumoroids. 972 973 Supplementary Table 8 Subtype classification profiles of tumoroids. 974 975 Supplementary Table 9 Specific parameters used for training of all classifiers. 976 977 Supplementary Table 10 Gene-pairs selected for final training of CCN general, subtype 978 classifiers and single-cell classifier. 979 980 Supplementary Table 11 Decision thresholds and the corresponding precision and recall for 981 the general classifier and subtype classifier. 982 983 Supplementary Table 12 Accessions of tumor microarray data used in validation. 984 985 986 REFERENCES 987 1. Sharma, S. V., Haber, D. A. & Settleman, J. Cell line-based platforms to evaluate 988 the therapeutic efficacy of candidate anticancer agents. Nat. Rev. Cancer 10, 241–989 253 (2010). 990 2. Kersten, K., de Visser, K. E., van Miltenburg, M. H. & Jonkers, J. Genetically 991 engineered mouse models in oncology research and cancer medicine. EMBO Mol. 992 Med. 9, 137–153 (2017). 993 3. Hidalgo, M. et al. Patient-derived xenograft models: an emerging platform for 994 translational cancer research. Cancer Discov. 4, 998–1013 (2014). 995 4. Drost, J. & Clevers, H. Organoids in cancer research. Nat. Rev. Cancer 18, 407–996 418 (2018). 997 5. Klijn, C. et al. A comprehensive transcriptional portrait of human cancer cell lines. 998 Nat. Biotechnol. 33, 306–312 (2015). 999 6. Koren, S. et al. PIK3CA(H1047R) induces multipotency and multi-lineage mammary 1000 tumours. Nature 525, 114–118 (2015). 1001 7. DeRose, Y. S. et al. Tumor grafts derived from women with breast cancer 1002 authentically reflect tumor pathology, growth, metastasis and disease outcomes. 1003 Nat. Med. 17, 1514–1520 (2011). 1004 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 38 8. Sharpless, N. E. & Depinho, R. A. The mighty mouse: genetically engineered 1005 mouse models in cancer drug development. Nat. Rev. Drug Discov. 5, 741–754 1006 (2006). 1007 9. Mouradov, D. et al. Colorectal cancer cell lines are representative models of the 1008 main molecular subtypes of primary cancer. Cancer Res. 74, 3238–3247 (2014). 1009 10. Stuckelberger, S. & Drapkin, R. Precious GEMMs: emergence of faithful models for 1010 ovarian cancer research. J. Pathol. 245, 129–131 (2018). 1011 11. Domcke, S., Sinha, R., Levine, D. A., Sander, C. & Schultz, N. Evaluating cell lines 1012 as tumour models by comparison of genomic profiles. Nat. Commun. 4, 2126 1013 (2013). 1014 12. Jiang, G. et al. Comprehensive comparison of molecular portraits between cell lines 1015 and tumors in breast cancer. BMC Genomics 17 Suppl 7, 525 (2016). 1016 13. Chen, B., Sirota, M., Fan-Minogue, H., Hadley, D. & Butte, A. J. Relating 1017 hepatocellular carcinoma tumor samples and cell lines using gene expression data 1018 in translational research. BMC Med. Genomics 8 Suppl 2, S5 (2015). 1019 14. Vincent, K. M., Findlay, S. D. & Postovit, L. M. Assessing breast cancer cell lines as 1020 tumour models by comparison of mRNA expression profiles. Breast Cancer Res. 1021 17, 114 (2015). 1022 15. Yu, K. et al. Comprehensive transcriptomic analysis of cell lines as models of 1023 primary tumors across 22 tumor types. Nat. Commun. 10, 3574 (2019). 1024 16. Najgebauer, H. et al. CELLector: Genomics-Guided Selection of Cancer In Vitro 1025 Models. Cell Syst. 10, 424–432.e6 (2020). 1026 17. Salvadores, M., Fuster-Tormo, F. & Supek, F. Matching cell lines with cancer type 1027 and subtype of origin via mutational, epigenomic, and transcriptomic patterns. Sci. 1028 Adv. 6, (2020). 1029 18. Guernet, A. & Grumolato, L. CRISPR/Cas9 editing of the genome for cancer 1030 modeling. Methods 121-122, 130–137 (2017). 1031 19. Gargiulo, G. Next-Generation in vivo Modeling of Human Cancers. Front. Oncol. 8, 1032 429 (2018). 1033 20. Gao, H. et al. High-throughput screening using patient-derived tumor xenografts to 1034 predict clinical trial drug response. Nat. Med. 21, 1318–1325 (2015). 1035 21. Cahan, P. et al. CellNet: network biology applied to stem cell engineering. Cell 158, 1036 903–915 (2014). 1037 22. Radley, A. H. et al. Assessment of engineered cells using CellNet and RNA-seq. 1038 Nat. Protoc. 12, 1089–1102 (2017). 1039 23. Tan, Y. & Cahan, P. SingleCellNet: A Computational Tool to Classify Single Cell 1040 RNA-Seq Data Across Platforms and Across Species. Cell Syst. 9, 207–213.e2 1041 (2019). 1042 24. Cancer Genome Atlas Network. Comprehensive molecular characterization of 1043 human colon and rectal cancer. Nature 487, 330–337 (2012). 1044 25. Zhang, J. et al. International Cancer Genome Consortium Data Portal--a one-stop 1045 shop for cancer genomics data. Database (Oxford) 2011, bar026 (2011). 1046 26. Cancer Genome Atlas Network. Comprehensive molecular portraits of human 1047 breast tumours. Nature 490, 61–70 (2012). 1048 27. Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic 1049 subtypes. J. Clin. Oncol. 27, 1160–1167 (2009). 1050 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 39 28. Wilkerson, M. D. et al. Lung squamous cell carcinoma mRNA expression subtypes 1051 are reproducible, clinically important, and correspond to normal cell types. Clin. 1052 Cancer Res. 16, 4864–4875 (2010). 1053 29. Cancer Genome Atlas Research Network. Electronic address: 1054 andrew_aguirre@dfci.harvard.edu & Cancer Genome Atlas Research Network. 1055 Integrated genomic characterization of pancreatic ductal adenocarcinoma. Cancer 1056 Cell 32, 185–203.e13 (2017). 1057 30. Cancer Genome Atlas Research Network et al. Integrated genomic characterization 1058 of endometrial carcinoma. Nature 497, 67–73 (2013). 1059 31. Cancer Genome Atlas Research Network et al. Integrated genomic characterization 1060 of oesophageal carcinoma. Nature 541, 169–175 (2017). 1061 32. Cancer Genome Atlas Network. Comprehensive genomic characterization of head 1062 and neck squamous cell carcinomas. Nature 517, 576–582 (2015). 1063 33. Cancer Genome Atlas Research Network. Comprehensive molecular 1064 characterization of clear cell renal cell carcinoma. Nature 499, 43–49 (2013). 1065 34. Verhaak, R. G. W. et al. Integrated genomic analysis identifies clinically relevant 1066 subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, 1067 and NF1. Cancer Cell 17, 98–110 (2010). 1068 35. Cancer Genome Atlas Research Network. Comprehensive molecular profiling of 1069 lung adenocarcinoma. Nature 511, 543–550 (2014). 1070 36. Hu, B. et al. Gastric cancer: Classification, histology and application of molecular 1071 pathology. J. Gastrointest. Oncol. 3, 251–261 (2012). 1072 37. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling 1073 of anticancer drug sensitivity. Nature 483, 603–607 (2012). 1074 38. Medico, E. et al. The molecular landscape of colorectal cancer cell lines unveils 1075 clinically actionable kinase targets. Nat. Commun. 6, 7002 (2015). 1076 39. Park, J.-G. et al. Characteristics of Cell Lines Established from Human Colorectal 1077 Carcinoma. Cancer Res. (1987). 1078 40. Jerby-Arnon, L. et al. A cancer cell program promotes T cell exclusion and 1079 resistance to checkpoint blockade. Cell 175, 984–997.e24 (2018). 1080 41. Darmanis, S. et al. Single-Cell RNA-Seq Analysis of Infiltrating Neoplastic Cells at 1081 the Migrating Front of Human Glioblastoma. Cell Rep. 21, 1399–1410 (2017). 1082 42. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in 1083 primary glioblastoma. Science 344, 1396–1401 (2014). 1084 43. Xu, B. et al. Regulation of endometrial receptivity by the highly expressed HOXA9, 1085 HOXA11 and HOXD10 HOX-class homeobox genes. Hum. Reprod. 29, 781–790 1086 (2014). 1087 44. Raines, A. M. et al. Recombineering-based dissection of flanking and paralogous 1088 Hox gene functions in mouse reproductive tracts. Development 140, 2942–2952 1089 (2013). 1090 45. Netinatsunthorn, W., Hanprasertpong, J., Dechsukhum, C., Leetanaporn, R. & 1091 Geater, A. WT1 gene expression as a prognostic marker in advanced serous 1092 epithelial ovarian carcinoma: an immunohistochemical study. BMC Cancer 6, 90 1093 (2006). 1094 46. Kelly, Z. et al. The prognostic significance of specific HOX gene expression patterns 1095 in ovarian cancer. Int. J. Cancer 139, 1608–1617 (2016). 1096 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 40 47. Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian 1097 carcinoma. Nature 474, 609–615 (2011). 1098 48. Wiegand, K. C. et al. ARID1A mutations in endometriosis-associated ovarian 1099 carcinomas. N. Engl. J. Med. 363, 1532–1543 (2010). 1100 49. Murray, M. J. et al. LIN28 Expression in malignant germ cell tumors downregulates 1101 let-7 and increases oncogene levels. Cancer Res. 73, 4872–4884 (2013). 1102 50. Biton, A. et al. Independent component analysis uncovers the landscape of the 1103 bladder tumor transcriptome and reveals insights into luminal and basal subtypes. 1104 Cell Rep. 9, 1235–1245 (2014). 1105 51. Fair, W. R., Israeli, R. S. & Heston, W. D. Prostate-specific membrane antigen. 1106 Prostate 32, 140–148 (1997). 1107 52. Black, J. D., English, D. P., Roque, D. M. & Santin, A. D. Targeted therapy in 1108 uterine serous carcinoma: an aggressive variant of endometrial cancer. Womens 1109 Health (Lond. Engl.) 10, 45–57 (2014). 1110 53. Yang, S., Thiel, K. W. & Leslie, K. K. Progesterone: the ultimate endometrial tumor 1111 suppressor. Trends Endocrinol. Metab. 22, 145–152 (2011). 1112 54. Huszar, M. et al. Up-regulation of L1CAM is linked to loss of hormone receptors and 1113 E-cadherin in aggressive subtypes of endometrial carcinomas. J. Pathol. 220, 551–1114 561 (2010). 1115 55. Kozak, J., Wdowiak, P., Maciejewski, R. & Torres, A. A guide for endometrial 1116 cancer cell lines functional assays using the measurements of electronic 1117 impedance. Cytotechnology 70, 339–350 (2018). 1118 56. Korch, C. et al. DNA profiling analysis of endometrial and ovarian cell lines reveals 1119 misidentification, redundancy and contamination. Gynecol. Oncol. 127, 241–248 1120 (2012). 1121 57. Wu, D. et al. Gene-expression data integration to squamous cell lung cancer 1122 subtypes reveals drug sensitivity. Br. J. Cancer 109, 1599–1608 (2013). 1123 58. Walter, V. et al. Molecular subtypes in head and neck cancer exhibit distinct 1124 patterns of chromosomal gain and loss of canonical cancer genes. PLoS One 8, 1125 e56823 (2013). 1126 59. Adeegbe, D. O. et al. BET Bromodomain Inhibition Cooperates with PD-1 Blockade 1127 to Facilitate Antitumor Response in Kras-Mutant Non-Small Cell Lung Cancer. 1128 Cancer Immunol Res 6, 1234–1245 (2018). 1129 60. Blaisdell, A. et al. Neutrophils oppose uterine epithelial carcinogenesis via 1130 debridement of hypoxic tumor cells. Cancer Cell 28, 785–799 (2015). 1131 61. Fitamant, J. et al. YAP inhibition restores hepatocyte differentiation in advanced 1132 HCC, leading to tumor regression. Cell Rep. 10, 1692–1707 (2015). 1133 62. Jia, D. et al. Crebbp loss drives small cell lung cancer and increases sensitivity to 1134 HDAC inhibition. Cancer Discov. 8, 1422–1437 (2018). 1135 63. Kress, T. R. et al. Identification of MYC-Dependent Transcriptional Programs in 1136 Oncogene-Addicted Liver Tumors. Cancer Res. 76, 3463–3472 (2016). 1137 64. Li, L. et al. GKAP acts as a genetic modulator of NMDAR signaling to govern 1138 invasive tumor growth. Cancer Cell 33, 736–751.e5 (2018). 1139 65. Mollaoglu, G. et al. The Lineage-Defining Transcription Factors SOX2 and NKX2-1 1140 Determine Lung Cancer Cell Fate and Shape the Tumor Immune 1141 Microenvironment. Immunity 49, 764–779.e9 (2018). 1142 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 41 66. Pan, Y. et al. Whole tumor RNA-sequencing and deconvolution reveal a clinically-1143 prognostic PTEN/PI3K-regulated glioma transcriptional signature. Oncotarget 8, 1144 52474–52487 (2017). 1145 67. Lissanu Deribe, Y. et al. Mutations in the SWI/SNF complex induce a targetable 1146 dependence on oxidative phosphorylation in lung cancer. Nat. Med. 24, 1047–1057 1147 (2018). 1148 68. Xu, C. et al. Loss of Lkb1 and Pten leads to lung squamous cell carcinoma with 1149 elevated PD-L1 expression. Cancer Cell 25, 590–604 (2014). 1150 69. NCI-Frederick, Frederick, MD. National Laboratory for Cancer Research. The NCI 1151 Patient-Derived Models Repository (PDMR). (2019). at 1152 70. Broutier, L. et al. Human primary liver cancer-derived organoid cultures for disease 1153 modeling and drug screening. Nat. Med. 23, 1424–1435 (2017). 1154 71. Lee, S. H. et al. Tumor Evolution and Drug Response in Patient-Derived Organoid 1155 Models of Bladder Cancer. Cell 173, 515–528.e17 (2018). 1156 72. Ogawa, J., Pao, G. M., Shokhirev, M. N. & Verma, I. M. Glioblastoma model using 1157 human cerebral organoids. Cell Rep. 23, 1220–1229 (2018). 1158 73. Ben-David, U. et al. Patient-derived xenografts undergo mouse-specific tumor 1159 evolution. Nat. Genet. 49, 1567–1575 (2017). 1160 74. Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 1161 719–724 (2009). 1162 75. Balkwill, F. R., Capasso, M. & Hagemann, T. The tumor microenvironment at a 1163 glance. J. Cell Sci. 125, 5591–5596 (2012). 1164 76. Lancaster, M. A. & Knoblich, J. A. Organogenesis in a dish: modeling development 1165 and disease using organoid technologies. Science 345, 1247125 (2014). 1166 77. Bregenzer, M. E. et al. Integrated cancer tissue engineering models for precision 1167 medicine. PLoS One 14, e0216564 (2019). 1168 78. Wang, D. H. & Souza, R. F. Biology of Barrett’s esophagus and esophageal 1169 adenocarcinoma. Gastrointest Endosc Clin N Am 21, 25–38 (2011). 1170 79. Lee, J. et al. Tumor stem cells derived from glioblastomas cultured in bFGF and 1171 EGF more closely mirror the phenotype and genotype of primary tumors than do 1172 serum-cultured cell lines. Cancer Cell 9, 391–403 (2006). 1173 80. Wenger, S. L. et al. Comparison of established cell lines at different passages by 1174 karyotype and comparative genomic hybridization. Biosci. Rep. 24, 631–639 (2004). 1175 81. Ben-David, U. et al. Genetic and transcriptional evolution alters cancer cell line drug 1176 response. Nature 560, 325–330 (2018). 1177 82. Cooke, S. L. et al. Genomic analysis of genetic heterogeneity and evolution in high-1178 grade serous ovarian carcinoma. Oncogene 29, 4905–4913 (2010). 1179 83. Hristova, V. A. & Chan, D. W. Cancer biomarker discovery and translation: 1180 proteomics and beyond. Expert Rev Proteomics 16, 93–103 (2019). 1181 84. Dawson, M. A. & Kouzarides, T. Cancer epigenetics: from mechanism to therapy. 1182 Cell 150, 12–27 (2012). 1183 85. Silva, T. C. et al. TCGA Workflow: Analyze cancer genomics and epigenomics data 1184 using Bioconductor packages. [version 2; peer review: 1 approved, 2 approved with 1185 reservations]. F1000Res. 5, 1542 (2016). 1186 86. Morgan, M., Obenchain, V., Hester, J. & Pag`es, H. SummarizedExperiment: 1187 SummarizedExperiment container. (2018). 1188 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 42 87. Pavlidis, P. & Noble, W. S. Analysis of strain and regional variation in gene 1189 expression in mouse brain. Genome Biol. 2, RESEARCH0042 (2001). 1190 88. Geman, D., d Avignon, C., Naiman, D. Q. & Winslow, R. L. Classifying gene 1191 expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol 3, 1192 Article19 (2004). 1193 89. Krstajic, D., Buturovic, L. J., Leahy, D. E. & Thomas, S. Cross-validation pitfalls 1194 when selecting and assessing regression and classification models. J. Cheminform. 1195 6, 10 (2014). 1196 90. Lipton, Z. C., Elkan, C. & Naryanaswamy, B. Optimal Thresholding of Classifiers to 1197 Maximize F1 Measure. Mach. Learn. Knowl. Discov. Databases 8725, 225–239 1198 (2014). 1199 91. Grow, E. J. et al. Intrinsic retroviral reactivation in human preimplantation embryos 1200 and pluripotent cells. Nature 522, 221–225 (2015). 1201 92. Kolde, R. pheatmap: Pretty Heatmaps. (CRAN, 2019). 1202 93. Wickham, H. ggplot2 - Elegant Graphics for Data Analysis . (Springer-Verlag New 1203 York, 2016). doi:10.1007/978-0-387-98141-3 1204 94. Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations 1205 in multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016). 1206 95. Yoshihara, K. et al. Inferring tumour purity and stromal and immune cell admixture 1207 from expression data. Nat. Commun. 4, 2612 (2013). 1208 96. Kovalchik, S. RISmed: Download Content from NCBI Databases. (CRAN.R-project, 1209 2017). 1210 1211 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B Figure 1 HighLow C an ce r T yp es Cancer models Classification score Cancer cell lines (CCL) Patient derived xenograft (PDX) Genetically engineered mouse model (GEMM) Tumoroids Select parameter set with maximum mean AUPRC. Train on all TCGA data CancerCellNet Set parameters n, m Randomly select 2/3 TCGA data; run training process Assess performance on 1/3 held out data Repeat steps (2-3) 5 times (1) (2) (3) (4) Repeat steps (1-4) for each parameter set (5) CancerCellNet RNA-seq from … G en e pa irs Training data Training process Train Random Forest classifier G en es Samples G en es Labeled RNA-seq data Select n genes Gene pair transform Select m gene pairs G en e pa irs G en es Samples Samples Samples Samples Samples CancerCellNet C D E .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 2 A F C D E CCN Score B .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ CCN Score A B C Figure 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ D A B Figure 4 C General classification General CCN score (UCEC) Sub-type classification Endometrioid Serous Normal Unknown General classification General CCN score (LUSC) Sub-type classification basal classical primitive secretory Unknown General classification General CCN score (LUAD) Sub-type classification prox.-inflam prox.-prolif TRU Unknown .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ CCN Score Figure 5 A B C D E General classification General CCN score (UCEC) Sub-type classification Endometrioid Serous Normal Unknown General classification General CCN score (LUSC) Sub-type classification basal classical primitive secretory Unknown General classification General CCN score (LUAD) Sub-type classification prox.-inflam prox.-prolif TRU Unknown .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 6 C BA D E General classification General CCN score (UCEC) Sub-type classification Genotype Endometrioid Serous Normal Unknown General classification General CCN score (LUSC) Sub-type classification Genotype basal classical primitive secretory Unknown General classification General CCN score (LUAD) Sub-type classification Genotype prox.-inflam prox.-prolif TRU Unknown CCN Score .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 7 A B C D E General classification General CCN score (UCEC) Sub-type classification Endometrioid Serous Normal Unknown General classification General CCN score (LUSC) Sub-type classification basal classical primitive secretory Unknown General classification General CCN score (LUAD) Sub-type classification prox.-inflam prox.-prolif TRU Unknown CCN Score .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 8 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplemental Figure 1 BA D E Training data Samples G en es RNA-Seq TCGA Training process Gene Pair Transform Feature Selection Train Random forest classifier G en es G en e P ai rs CancerCellNetBroad Class Classification Add on to Gene Pairs as Additional Features C C N S co re s G en e P ai rs C .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplemental Figure 2 A B D E C .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplemental Figure 3 C D A B .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplemental Figure 4 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplemental Figure 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2020_10_26_351783 ---- A validated generally applicable approach using the systematic assessment of disease modules by GWAS reveals a multi-omic module strongly associated with risk factors in multiple sclerosis 1 A validated generally applicable approach using the systematic assessment of disease modules by GWAS reveals a multi-omic module strongly associated with risk factors in multiple sclerosis Tejaswi V.S. Badam1,2†, Hendrik A. de Weerd1,2†, David Martínez-Enguita2, Tomas Olsson3, Lars Alfredsson3,4,Ingrid Kockum3,Maja Jagodic3, Zelmina Lubovac-Pilav1*, Mika Gustafsson2* 1School of Bioscience, Systems Biology Research Center, University of Skövde, Sweden 2Bioinformatics, Department of Physics, Chemistry and Biology, Linköping university, Linköping, Sweden 3Department of Clinical Neuroscience, Karolinska Institutet, Center for Molecular Medicine, Karolinska University Hospital, SE-171 76, Stockholm, Sweden 4Institute of Environmental Medicine, Karolinska Institutet, Center for Molecular Medicine, Karolinska University Hospital, SE-171 76, Stockholm, Sweden †These authors contributed equally to the work. *These authors share senior authorship. Corresponding author: Mika Gustafsson (mika.gustafsson@liu.se) Running Title : Multi-omic modules in multiple sclerosis Keywords : Benchmark , Multi-omics , Network modules ,Multiple Sclerosis, Risk factors SUMMARY : Our benchmark of multi-omic modules and validated translational systems medicine workflow for dissecting complex diseases resulted in multi-omic module of 220 genes highly enriched for risk factors associated with multiple sclerosis. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 2 ABSTRACT Background: There are few (if any) practical guidelines for predictive and falsifiable multi-omics data integration that systematically integrate existing knowledge. Disease modules are popular concepts for interpreting genome-wide studies in medicine but have so far not been systematically evaluated and may lead to corroborating multi-omic modules. Methods: We assessed eight module identification methods in 57 previously published expression and methylation studies of 19 diseases using GWAS enrichment analysis. Next, we applied the same strategy for multi-omics integration of 19 datasets of multiple sclerosis (MS), and further validated the resulting module using both GWAS and risk-factor associated genes from several independent cohorts. Results: Our benchmark of modules showed that in immune-associated diseases modules inferred from clique-based methods were the most enriched for GWAS-genes. The multi-omics case study using MS revealed the robust identification of a module of 220 genes. Strikingly, most genes of the module was differentially methylated upon the action of one or several environmental risk factors in MS (n = 217, P = 10-47) and were also independently validated for association with five different risk factors of MS, which further stressed the high genetic and epigenetic relevance of the module for MS. Conclusion: We believe our analysis provides a workflow for selecting modules and our benchmark study may help further improvement of disease module methods. Moreover, we also stress that our methodology is generally applicable for combining and assessing the performance of multi-omics approaches for complex diseases. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 3 INTRODUCTION Complex diseases are the result of disruptions of many interconnected multimolecular pathways, reflected in multiple omics layers of regulation of cellular function, rather than perturbations of a single gene or protein[1]. Systems and network medicine aim to translate observed omics differences in patients using networks, in order to personalize medicine[2]. Importantly, genes that are associated with diseases are more likely to interact with each other rather than with non-disease associated genes, forming multi-omics network disease modules[3,4]. Owing to the incompleteness of the underlying multi-omics interactions, the networks are often modeled as effective gene-gene interactions, using for example STRING database[5]. Thus, network modules might be ideal tools for multi-omics analysis. However, the evaluation of performance of different module inference methods remains a poorly understood topic, which creates the need for transparent evaluation of these methods based on objective benchmarks across various diseases and omics. Genomic concordance has been suggested as a multi-omics validation principle[4,6], i.e., modules derived from one omic, such as gene expression or DNA methylation should be enriched for disease- associated single nucleotide polymorphisms (SNPs). The variety of algorithms that have been proposed and applied for identification of disease modules can be categorized into two main groups. On the one hand, there are methods which rely purely on clustering of the genes in relevant disease networks[7]. On the other hand, there are algorithms which make use of disease-associated molecules or genetic loci to reveal disease modules that correlate with disease function, such as the disease module detection (DIAMOnD) algorithm[8], clique-based methods[9],[10] and weighted gene co-expression network analysis (WGCNA)[11]. The data-derived information can either be differentially expressed genes or differentially correlated or co-expressed genes. Methods following the former approach were recently benchmarked by a metric utilizing genomic concordance within the DREAM consortia[12]. However, so far, algorithms from the latter group have not been benchmarked. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 4 In this study we analyzed, assessed, and compared the performance of eight of the most popular methods for disease module analysis using the R package MODifieR[13] on 19 different diseases including 47 expression and ten methylation datasets. We assessed the performance of the methods using genome-wide association (GWAS) enrichment analysis from the summary statistics of all assayed SNPs similarly as in DREAM[12]. The resulting workflow provided a systematic procedure for selecting the best method for each disease and set the stage for method development in the disease module area. Moreover, it allowed the predictive assessment of combining multiple datasets across several omics using GWAS, which we tested in multiple sclerosis (MS), a heterogeneous complex disease. Briefly, we derived multi-omic modules in a stepwise optimization of GWAS enrichment from transcriptomic and methylomic analyses of MS. We further evaluated the identified multi-omic MS module of 220 genes for its enrichment across DNA methylation studies of eight known lifestyle-associated risk factors of MS. Additionally, we validated the identified significant enrichment risk factors in an independent DNA methylation MS study which indeed showed a very strong and significant MS enrichment for both module genes and risk factor associations. In summary, we provide a robust multi-omics strategy that can be used to disentangle networks of affected genes in complex diseases from both genetic and environmental levels. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 5 MATERIALS AND METHODS Benchmark data A total of 47 publicly available datasets for the transcriptomic benchmark and ten publicly available datasets for the methylomic benchmark were used. To avoid bias due to subtypes of diseases and drug treatments, we searched for datasets that have only patient and control samples, and that are available for download from the GEO database. We categorized the datasets into seven distinct disease types based on the disease-trait type associations used in Choobdar et al[12]., i.e. autoimmune, cardiovascular, glycemic, inflammatory, neurodegenerative, and psychiatric and social disorders. A total of 19 complex diseases were used in the transcriptomic benchmark analysis, while six complex diseases were used in the methylation benchmark analysis. The methylation benchmark diseases belong to inflammatory, autoimmune, and glycemic disease types. MS use case data A total of 14 publicly available and one non-publicly available transcriptomic and methylomic MS- related datasets were used in the MS multi-omics integration use case. In general, every dataset in the MODifieR benchmark was also used in the MS use case, with exceptions according to certain criteria. The inclusion of transcriptomic MS datasets followed the criteria: 1) The largest dataset by sample number, per tissue, is shown in the MODifieR benchmark; 2) Replication cohorts are not included in the MS use case. Criteria for inclusion of methylomic MS datasets were the following: 1) The largest dataset by sample number, per tissue or cell type, is included in the MODifieR benchmark; 2) A single dataset for every cell-specific tissue was included in the benchmark; 3) Methylation studies that reported using whole blood as sample tissue were excluded from the MS use case, due to the high heterogeneity of this type of data. For the additional independent validation, we utilized the methylation microarray analysis of 279 blood samples analyzing from Kular et al 25 . For each of these MS patients (nMS= 139) and healthy controls (nHC= 140), we also collected their lifestyle-associated risk factors from questionnaires that (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 6 were part of the Epidemiological Investigation of Multiple Sclerosis (EIMS) study. Those factors were smoking status, prior EBV infection, sunbathing, nightshift work, alcohol consumption, as well as phenotypic features (age, sex, BMI at age of 20). Pre-processing and quality control of risk factor methylation data DNA methylation datasets were downloaded from GEO as raw IDAT files, when available, or matrices of beta values. Pre-processing of the data was performed using the Chip Analysis Methylation Pipeline (ChAMP) R package[14] , version 2.16.2. Default parameters were used for probe and sample filtering. Probes with a detection P-value above 0.01, probes with a fraction of failed (bead count less than 3) samples over 0.05, non-CpG probes, SNP-related probes, multi-hit probes, and probes located on chromosomes X and Y, were removed. Samples with a proportion of failed (NA) probe P-values over 0.1 were also removed from the analysis. Post-filtering imputation of NA values was conducted on the beta matrices, with default parameters (“combine” method, k = 5, probe cutoff = 0.2, sample cutoff = 0.1). Filtered imputed matrices were normalized applying the Beta- Mixture Quantile dilation (BMIQ) normalization method[15]�, including correction of Type-I and Type-II probe effects. Data quality was assessed by producing multi-dimensional scaling (MDS) plots of the top 1,000 most variable positions per sample, density plots for the distribution of beta values, and hierarchical clustering of samples, before and after normalization. Singular value decomposition (SVD) was used to detect the most significant components of variation in the data. Unwanted sources of variation in the normalized data were corrected using ComBat batch effect correction[16]. Module Identification The MODifieR13 R package offers nine different methods for producing disease modules for which we included all but Clique SuM exact as it is highly similar to Clique SuM. The included methods will produce modules based on the provided omics input and background network and do not include prioritization of pathway association. MODifieR methods used for module identification through this study are listed in the Supplementary Table 3. For the methods that require a network, we used the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 7 human PPI network from STRING5 database version 11, consisting of 11,295,036 interactions among 18,746 unique genes/proteins. We filtered the network to have high confidence interactions by using the cutoff > 900 to reduce the number of false positives, resulting in a subset of 631,782 interactions between 12,123 unique genes/proteins. For co-expression methods, the network is computed within the method algorithm from the gene expression matrix. In case of the benchmark analysis, we used a stringent cutoff of score > 900, so that the runs were not computationally intensive. For the MS use case benchmark, we used the network combined score cutoff > 700. The processed matrix for each dataset and their respective phenotypic information were downloaded from GEO. The input object is prepared using the create_input_microarray function from the MODifieR package which is then used for creating the modules. The input function applies linear model using limma for comparison of patient's vs controls to get the differentially methylated or expressed genes. A dynamic cutoff of 5% in the differentially methylated or expressed genes is applied for input seed genes for the methods that require seed genes. Differential methylation analysis of risk factor data Differentially methylated probes (DMPs) were found by fitting a linear model to the data using the limma R package[17]�, version 3.42.2 implemented in the ChAMP function champ.DMP. P-values were adjusted for multiple testing using Benjamini-Hochberg False Discovery Rate (FDR) correction. Differentially methylated genes (DMGs) were obtained and annotated using the org.Hs.eg.db R package�, version 3.10.0. DMG lists were cross-checked against the STRING database version 11 PPI network used for module identification in the MS multi-omics approach (high confidence interactions, combined score > 700). DMGs that were not present in the PPI network were removed. In case of the additional MS validation dataset, a linear mixed effect model with risk factors (age, sex, BMI at age of 20, smoking, alcohol consumption, sun exposure, night shift work, contact with organic solvents) as categorical covariates was implemented to find the differentially methylated genes after the preprocessing step, as described in the preprocessing section of the methods. Since all the patients were EBV positive, we did not include it for linear mixed effect model. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 8 Validation of modules The final modules produced from each single algorithm and the consensus were evaluated using Pascal[18] (Pathway scoring algorithm). Pascal implements a fast and rigorous gene scoring and pathway enrichment pipeline that can be run on a local machine. The SNP values are converted to gene scores by computing pairwise SNP-by-SNP correlations and obtaining Z-scores from their distribution. These obtained gene scores are fused with the pathway enrichment analysis to recompute a chi-square P-value for the given set of module genes. Thus, the obtained chi-square P- value serves as the significance of the module in its enrichment of the disease-associated pathway gene loci. A combined P-value was computed for each of the methods using Fisher’s method[19], diseases, and datasets for ranking the performance of the modules in each criterion. Integration of MS single-omic modules Clique SuM was ranked as the best performing method on average for both transcriptomic and methylomic data, according to the MS GWAS enrichment of the modules calculated by Pascal. Therefore, significant Clique SuM modules (P < 0.05) were selected for further analysis (nine transcriptomic and four methylomic modules). Consensus modules were generated across each omic by applying a module count-based method, where the criteria for gene inclusion in the consensus is its presence in a certain number of single-method modules. To balance the weight of each omic in the multi-omics integration, the top four significant modules per omic were used to create each consensus (Fig. 4a, b). Single-omic Clique SuM consensus were ranked again by GWAS enrichment, and the best performing consensus per omic was selected for integration into the multi-omics module. Enrichment analyses of the MS multi-omics module Disease enrichment analysis of the multi-omics module was performed by Fisher’s exact test, with a significance threshold of P < 0.05. MS-associated genes were obtained from the gene-disease association summary provided by DisGeNET database 6.0[20]�. All genes with a known association (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 9 to the disease “multiple sclerosis” (Unified Medical Language System unique identifier C0026769) were considered MS-associated genes (n = 1,105). Pathway enrichment analysis was carried out using the function enrichKEGG from the clusterProfiler R package[21]�, version 3.14.3. P-values were adjusted for multiple testing using Benjamini-Hochberg FDR correction, with a significance threshold of adj. P < 0.05. Enrichment of the multi-omics module in MS risk-factor-associated genes was performed by Fisher’s exact test, with a significance threshold of P < 0.05. To provide a uniform comparison of MS risk factor-associated genes across datasets, the module was tested for enrichment in the top 1,000 DMGs (with at least P < 0.05) obtained from the differential methylation analysis with ChAMP for each risk factor dataset. Representation of the MS multi-omics module Experimentally validated interactions for the multi-omics module genes were obtained from STRING database version 11 (experimental score > 700) and imported into Cytoscape[22] version 3.7.2. To determine representative functional clusters of module genes, overrepresented Gene Ontology (GO) Biological Process (BP) terms in the module were found using BiNGO[23] version 3.0.4, with Benjamini-Hochberg FDR for multiple testing correction, and a significance threshold of adj. P < 0.05. Then, enriched GO terms with adj. P < 1x10-10 were summarized using REVIGO[24] server tool (medium allowed similarity = 0.7) and categories of interest were selected by uniqueness (>= 80 %), dispensability (>= 50 %), and frequency (<= 10 %) criteria. Further manual assessment was performed to group similar terms with an adequate number of genes in the network. RESULTS A benchmark comparing 337 transcriptionally derived disease modules from 19 different diseases. We compiled a benchmark source of disease modules and summary statistics of GWAS datasets from 19 well-powered case-control studies (Supplementary Table 1), some of which were previously used in the DREAM topological disease module challenge[12]. For these datasets we assessed modules using the same metric as in the recent DREAM study[12], based on the pathway scoring (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 10 algorithm (Pascal)[18]. For each disease we compiled one to five publicly available transcriptomic datasets considering both easily assessable tissues (e.g. blood) and target tissues, thereby covering 47 transcriptomic datasets in total (Fig. 1a). Modules were created using eight different methods from MODifieR[13]. In addition, we also tested if genes detected by several methods, hereafter called consensus module genes, had higher enrichment scores than single-method module genes. Enrichment scores for the non-empty modules (n = 337) from this analysis were summarized for each method and dataset (Fig. 2a). In total, we found significantly GWAS-enriched modules in 17.8% (60/337) of the single-method modules and 25.5% (12/47) of the non-empty consensus modules that combined at least three methods as a criterion. These numbers seemed higher than expected, which might have been a consequence of the same GWAS being used to evaluate multiple transcriptomic datasets of the same disease. Hence, we aggregated scores of the same disease and method as meta P-values (see Methods). Out of the 152 possible disease-method combinations, 18% of the pairs showed a significant GWAS Pascal enrichment, which is more than expected by chance (n = 27, P = 1.0 x 10-8). The most enriched method was Clique SuM, which showed significant enrichment in seven out of 19 diseases (binomial test P = 2.3 x 10-5). Many methods exhibited strong enrichments in coronary artery disease (CAD), type 2 diabetes, multiple sclerosis (MS), rheumatoid arthritis (RA), and the inflammatory bowel diseases(IBD), ulcerative colitis (UC) and Crohn’s disease (CD), while no significant enrichments were found for asthma, hepatitis C, type 1 diabetes, narcolepsy, Parkinson’s disease, or for any psychiatric and social diseases. If we instead ranked methods based on their respective module GWAS enrichment, Clique SuM showed significant association in 34% (16/47) of the modules corresponding to seven different diseases followed by consensus modules identified by two out of three methods. Lastly, DIAMOnD and co- expression-based methods all achieved significant results, although worse than Clique SuM. Next, we tested the impact of network centrality and module size as potential confounding factors of the applied performance metric. We found a significant but very modest correlation for module size (Fig. 2c, Spearman rho = 0.165, P = 2.3 x 10-3), and a non-significant correlation for interactome (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 11 centrality (Fig. 2b, rho = 0.068, P = 0.21). Thus, it is meaningful to compare results with differences in those module properties. In summary, we found that the Clique SuM method resulted in the highest disease enrichment for most diseases, while not producing significant modules for others, such as type 2 diabetes, where co-expression-based methods and DIAMOnD scored best. In general, we observed stronger enrichments for inflammatory diseases and weaker results for psychiatric and social diseases. Considering that the transcriptomic modules showed that Clique SuM was the best performing method and that the cardiovascular and inflammatory diseases were the most enriched within the Clique SuM modules, we wanted to test whether this was true for methylomic data as well. A benchmark comparing 72 methylation-based disease modules from six different diseases using GWAS. Following the same logic of the transcriptomic benchmark, we performed a similar benchmark study for methylation modules. We collected ten datasets from three different disease categories, including six complex diseases, and ran the eight MODifieR methods on them (Fig. 1a). In addition, we constructed consensus modules for each of the datasets. Modules were then tested for GWAS enrichment using Pascal. Inspecting the overall performance, we found nine single-method modules with a significant GWAS enrichment (9/72, 11.8%). Though this might be due to disease and cell type heterogeneity, the enrichment is more than expected by chance (P=9.6x 10-3). Interestingly, the inflammatory diseases such as MS and UC showed a more significant GWAS enrichment Considering that the evaluation of module performance by GWAS enrichment may be biased due to differences in module sizes and interactome centrality, we again assessed the correlation between these values. We found a significant correlation between GWAS enrichment and module size (Fig. 3c, rho = 0.235, P = 0.046) and a non-significant correlation between GWAS enrichment and interactome centrality (Fig. 3b, rho = 0.190, P = 0.109). We found that 12.5% of the disease-method combinations yielded significant GWAS enrichment, which is more than expected from an independent random selection of modules (Fisher’s exact test P = 0.031, n = 6). The highly enriched disease modules belong to MS, UC and CD. Two out of the six diseases showed significant GWAS (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 12 enrichment by using the Clique SuM modules (P = 0.032). In summary, Clique SuM method resulted in a more significant GWAS enrichment for most diseases also for the methylomic benchmark. Multi-omics approach revealed a module enriched for MS-associated genes. Considering genomic concordance as the guidance principle for the modules that show enrichment for GWAS SNPs, differentially methylated genes and differentially expressed genes, we further wanted to evaluate multiple datasets of one specific disease, i.e., MS. We compiled 11 MS transcriptomic datasets and nine methylation (Supplementary Table 2) comparisons from GEO which satisfy the pre-defined dataset criteria (see Methods). For each dataset we implemented the pipeline for module identification and scoring shown in Fig. 1b. We evaluated each module using MS SNP enrichment analysis and selected the most enriched modules per omic from this metric. This analysis again showed that Clique SuM yielded the far highest average enrichment score (meta P = 3.2 x 10-12) and was significantly enriched (P < 0.05) in 9/11 transcriptomic datasets (Fig. 4a) and 4/9 of the methylation datasets (Fig. 4b). From the significant modules generated by Clique SuM, we choose the top four modules from each of the gene transcription and methylation sets, and prioritized genes detected in modules from multiple datasets in each omic. This analysis showed that the strongest MS SNP enrichment was found for genes in at least three out of four transcriptomic modules (n=1,552; P= 6.0 x 10-7) and two out of four methylomic modules (n=324, P= 1.5x10-6). Next, we used the same principle to combine these two and found that the intersection between the gene transcription and methylation consensus resulted in a module (n = 220 genes, Fig. 4) enriched for MS-associated genes (75/220, P < 2.2 x 10-16, OR = 7.8) and with the highest GWAS enrichment (P = 8.8 x 10-9) which we hereafter referred to as the multi-omics MS module. The multi-omics MS module was enriched in genes associated with major MS pathways. As we used GWAS enrichment as a selection criterion, the high GWAS enrichment of the final module was partly expected, which led us to analyze its biological functions and their potential epigenetic associations to MS. First, pathway enrichment analysis showed that the multi-omics module genes (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 13 are significantly involved in several inter-linked immune-related pathways, most of which have been previously associated to MS, including the T cell receptor[25] (adjusted P = 3.6 x 10-47), PI3K/Akt[26] (P = 4.6 x 10-35), ErbB[27] (P = 7.7 x 10-32), Fc epsilon RI[28] (P = 8.3 x 10-30), chemokine[29,30] (P = 2.6 x 10-28), MAPK[31,32] (P = 2.0 x 10-25), and B cell receptor[32] (P = 3.9 x 10-19) signaling pathways; Th17 (P = 9.6 x 10-29), and Th1 and Th2 (P = 6.9 x 10-19) cell differentiation[33]; natural killer cell mediated cytotoxicity (P = 1.6 x 10-27); and leukocyte transendothelial migration (P = 3.9 x 10-20), which indeed supports their relevance in MS. Interestingly, the module was also highly enriched in morphogenetic and neurogenetic signaling pathways, such as the neurotrophin (adjusted P = 1.3 x 10-36), Ras (P = 1.4 x 10-36), Rap1 (P = 2.2 x 10-35), vascular endothelial growth factor (VEGF, P = 1.7 x 10-27), FoxO (P = 3.6 x 10-27), and mTOR (P = 4.1 x 10-14) signaling pathways; and in growth hormone synthesis, secretion and action (P = 6.6 x 10-31). The multi-omics MS module was enriched in genes associated with five known environmental MS risk factors validated in an independent cohort. Second, from a literature study[34,35] we found nine environmental MS risk factors of varying evidence for which we could identify methylation studies in healthy controls. For each of these risk factors we derived the top 1000 differentially methylated genes (DMGs) and tested their enrichment with the module. Intriguingly, the module was significantly enriched for genes associated with five risk factors (Fig. 5b), which included the top associated risk factors, i.e., Epstein-Barr virus (EBV) infection (Fisher exact test P = 1.5 x 10-3, OR = 2.1) and smoking (P = 1.2 x 10-4, OR = 2.3), as well as low sun exposure (P = 1.2 x 10-4, OR = 2.3), high BMI (P = 0.023, OR = 1.7) and alcohol consumption (P = 2.9 x 10-4, OR = 2.2). Then, we asked whether these putative gene-risk factor associations could be validated using an independent omics dataset with paired risk factor associations. For this purpose, we utilized methylation arrays of peripheral blood from 139 MS patients and 140 controls, which have been described previously[36]. In this analysis we also considered risk factor associations for each individual including age, sex, BMI at age of 20, smoking, alcohol consumption, sun exposure, night shift work, contact with organic solvents. This enabled analysis of DMGs for the MS and risk factor status as covariates in linear mixed effect (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 14 analysis. Indeed, the module genes were highly significantly enriched for MS (n = 217; permutation test P = 1.2 x 10-47), but also for all the tested risk factors (EBV was not included, Methods) and non- significantly associated to age and sex having 104-135 of the genes in each factor (3.9x10-8 < P < 0.013; Fig 5b). Combining all these results we found 90 of the 220 module genes to be associated with a risk factors from both the risk factor studies, 25 genes were associated with two risk factors, and seven genes were associated with three risk factors (CSK, PRKCA, PRKCZ, RUNX1, RUNX3, STAT5A, and SYNJ2) (Fig. 5c). These associations suggest that the multi-omics module is capturing a key disease network with both genetically and epigenetically driven alterations, thereby providing the possibility to use it to identify potential novel biomarkers or therapeutic targets for MS.� (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 15 DISCUSSION The analysis of case control data in the context of networks has gained increased interest to detect consistent robust gene signatures of individual diseases. The application of disease modules might vary for different researchers, but here we systematically aimed at the detection of disease genes supported by genetic association. For this purpose, our study of the transcriptome and methylome profiles of 19 diseases showed significant GWAS enrichments for several inflammatory and heart diseases, while psychiatric disorders showed no enrichments and might not be suitable for GWAS validation of modules, potentially due to differences in affected tissue types and sampling points. However, analysis of the significant results showed that methods based of differentially expressed cliques in the protein-protein interaction network demonstrated the strongest enrichments (highest scoring for Clique SuM), while those based primarily on correlations, like WGCNA, showed weak enrichments. A potential reason for this could be that GWAS has shown to be mostly associated to the central genes of the protein-protein interaction (PPI) network, but our analysis demonstrated that the correlation between GWAS enrichment and centrality was non-significant. We also tested whether there was an improvement using consensus approaches that counted the frequency of the result of multiple methods but found this not to increase performance. Moreover, we tested the same strategy on a set of inflammatory, glycemic, and autoimmune methylation datasets and found similar results. We would like to emphasize that, rather than scoring a single best working method, our result is a pipeline for evaluating modules using independent high-throughput enrichments. The work on transcription and methylation datasets suggested that MS is a disease highly enriched for GWAS, and we therefore tested if increased enrichments could be derived by their integration. We found 20 publicly available datasets and run assessment for both omics independently, which again showed Clique SuM to score highest. We then tested if improved results could be obtained using modules from multiple datasets of these two omics using consensus modules from Clique SuM. This resulted in a module of 220 genes highly enriched for GWAS (P = 8.8 x 10-9). The multi- omic module was highly enriched in immune-associated pathways, such as T cell and B cell receptor (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 16 signaling, Th1/Th2 differentiation, or leukocyte transendothelial migration. These results conform with the current hypothesis that MS is mediated by an autoreactive response of CD4+ T cells against myelin surrounding neuronal axons, preceded by their migration across the blood-brain barrier (BBB)[37]. This autoproliferation of brain-targeting Th1 cells has been shown to be driven by memory B cells, in a process mediated by HLA-DR15[38]. In addition, another enriched pathway was VEGF signaling. MS patients present high serum VEGF levels, which is related to pro-inflammatory functions and can alter the permeability of the BBB[39]. As GWAS was used for method prioritization we asked if modules instead could be validated using epigenetics and lifestyle risk factor genes that we identified to associate with MS. With this aim, we compiled a set of publicly available data from omics studies of these risk factors in healthy individuals. This analysis demonstrated that five out of eight risk factors were enriched in our module. In order to validate the use of an environmental assessment using public domain risk factor association we found an independent methylome study of MS comprising environmental data for each MS and healthy individual. This analysis showed a remarkable enrichment of the 220 module genes by 217 to differentially methylated genes for MS (P = 1.2 x 10-47), and a majority to be associated with the tested risk factors. In contrast to previously known community challenges, in our study we not only used the topological property of the network, but we also combined the methods to use an omics-based input to uncover the disease modules that might be dysregulated at each omics level, contributing to the diverse causative mechanisms behind complex diseases. Although using the PPI network as background may lead to certain knowledge bias, this kind of benchmark allowed us to look at the relevant risk factors. In our assessment of the disease modules, methods such as Clique SuM and DIAMOnD did perform better than the community-based consensus predictions. In summary, our study provides a practical integrative workflow that enables system-level analysis of heterogeneous diseases, in terms of multi-omics disease modules, as well as the validation of these by using both disease-specific GWAS and risk factors enrichment. We believe that this analysis (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 17 validates our integrated use datasets and suggest a pipeline that readily could be tested in at least in other autoimmune and cardiovascular diseases. Lastly, our study did not aim to optimize hyper- parameters for individual disease modules, and instead used default values when possible, and to the methods from the MODifieR R package implementation of the methods[13]. However, this might be an important task for specific disease and our code and processed datasets are available at GitLab (https://gitlab.com/Gustafsson-lab/modifier-benchmark). In future work, this approach can be expanded to include diverse and context-specific networks to determine whether our multi-omics modules are able to capture various other levels of granularity. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 18 DECLARATIONS ETHICS APPROVAL AND CONSENT TO PARTICIPATE Not applicable AVAILABILITY OF DATA AND MATERIALS The data used for transcriptomic benchmark and methylation benchmark are downloaded from GEO. The disease specific GWAS files are downloaded from the latest Pascal version. The processed Data for analysis is available at https://gitlab.com/Gustafsson-lab/modifier-benchmark.The risk factor (EIMS) data will be made available on request. The R-package MODifieR is available on the GitLab: https://gitlab.com/Gustafsson-lab/MODifieR; the code used for benchmark analysis and risk factor analysis is available on GitLab: https://gitlab.com/Gustafsson-lab/modifier-benchmark ; the latest Pascal version: https://www2.unil.ch/cbg/index.php?title=Pascal. COMPETING INTERESTS The authors declare no competing interests. FUNDING This work was supported by the Swedish Research Council (grant 2015-03807(M.G.), grant 2018- 02638(M.J.)), the Swedish foundation for strategic research (grant SB16-0095(M.G.)), the Center for Industrial IT (CENIIT)(M.G.), European Union Horizon 2020/European Research Council Consolidator grant (Epi4MS, grant 818170(M.J.)), Knut and Alice Wallenberg Foundation (grant 2019.0089(M.J.)) and the Knowledge Foundation (grant 20170298 (Z.L.)). Computational resources were granted by Swedish National Infrastructure for Computing (SNIC; SNIC 2020/5-177, LiU-2018-12 and LiU-2019- 25). AUTHOR CONTRIBUTIONS T.V.S.B. compiled the necessary data for the benchmark analysis. H.A.W. performed the transcriptomic benchmark analysis. T.V.S.B. performed the methylation benchmark analysis. D.M.E. and H.A.W. performed the MS use case analysis. D.M.E performed the risk factor analysis. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 19 M.J.,I.K.,T.O., and L.A., provided the raw data and collected the associated risk factor data for the independent methylation dataset. T.V.S.B performed the independent validation dataset analysis. T.V.S.B. and D.M.E. collectively made the plots and figures for the manuscript. M.G. and Z.L. designed the study. T.V.S.B. and D.M.E. prepared the manuscript. All authors discussed the results and commented on the manuscript at all stages. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 20 REFERENCES 1. Naylor S, Chen JY. NIH Public Access. Natl Institutes Heal. 2011;7:275–89. 2. Santiago JA, Bottero V, Potashkin JA. Dissecting the Molecular Mechanisms of Neurodegenerative Diseases through Network Biology. Front Aging Neurosci [Internet]. 2017;9:1–13. Available from: http://journal.frontiersin.org/article/10.3389/fnagi.2017.00166/full 3. Barabási AL, Gulbahce N, Loscalzo J. Network medicine: A network-based approach to human disease. Nat Rev Genet [Internet]. Nature Publishing Group; 2011;12:56–68. Available from: http://dx.doi.org/10.1038/nrg2918 4. Gustafsson M, Nestor CE, Zhang H, Barabási A-L, Baranzini S, Brunak S, et al. Modules, networks and systems medicine for understanding disease and aiding diagnosis. Genome Med [Internet]. 2014;6:82. Available from: http://genomemedicine.biomedcentral.com/articles/10.1186/s13073- 014-0082-6 5. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-cepas J, et al. STRING v11[: protein – protein association networks with increased coverage , supporting functional discovery in genome- wide experimental datasets. Oxford University Press; 2019;47:607–13. 6. Lamparter D, Lin J, Kutalik Z, Choobdar S, Hescott B, Tomasoni M, et al. Open Community Challenge Reveals Molecular Network Modules with Key Roles in Diseases. SSRN Electron J. 2018;1– 63. 7. Schadt EE. Molecular networks as sensors and drivers of common human diseases. Nature [Internet]. 2009;461:218–23. Available from: http://www.nature.com/doifinder/10.1038/nature08454 8. Ghiassian SD, Menche J, Barabási AL. A DIseAse MOdule Detection (DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the Human Interactome. Rzhetsky A, editor. PLoS Comput Biol [Internet]. 2015;11:e1004120. Available from: (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 21 https://dx.plos.org/10.1371/journal.pcbi.1004120 9. Hellberg S, Eklund D, Gawel DR, Köpsén M, Zhang H, Nestor CE, et al. Dynamic Response Genes in CD4+ T Cells Reveal a Network of Interactive Proteins that Classifies Disease Activity in Multiple Sclerosis. Cell Rep. 2016;16:2928–39. 10. Wang H, Rogers G, Benson M, Jarvelin M-R, Chavali S, Ramasamy A, et al. Highly interconnected genes in disease-specific networks are enriched for disease-associated polymorphisms. Genome Biol. 2012;13:R46. 11. Langfelder P, Horvath S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9. 12. Choobdar S, Ahsen ME, Crawford J, Tomasoni M, Fang T, Lamparter D, et al. Assessment of network module identification across complex diseases. Nat Methods. 2019;16:843–52. 13. de Weerd HA, Badam TVS, Martínez-Enguita D, Åkesson J, Muthas D, Gustafsson M, et al. MODifieR: an Ensemble R Package for Inference of Disease Modules from Transcriptomics Networks. Bioinformatics. 2020;1–2. 14. Tian Y, Morris TJ, Webster AP, Yang Z, Beck S, Feber A, et al. Genome analysis ChAMP[: updated methylation analysis pipeline for Illumina BeadChips. 2017;33:3982–4. 15. Teschendorff AE, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-cabrero D, et al. Gene expression A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. 2013;29:189–96. 16. Johnson WE, Li C. Adjusting batch effects in microarray expression data using empirical Bayes methods. 2007;118–27. 17. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. 2015;43. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 22 18. Lamparter D, Marbach D, Rueedi R, Kutalik Z, Bergmann S. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Comput Biol. 2016;12:1–20. 19. Mosteller, F. and Fisher R. A. Questions and Answers # 14 Author ( s ): Frederick Mosteller and R . A . Fisher Published by[: Taylor & Francis , Ltd . on behalf of the American Statistical Association Stable URL[: http://www.jstor.org/stable/2681650 All use subject to http://about.jsto. 1948;2:30–1. Available from: http://www.jstor.org/stable/2681650 20. Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, et al. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020;48:D845–55. 21. Yu G, Wang LG, Han Y, He QY. ClusterProfiler: An R package for comparing biological themes among gene clusters. Omi A J Integr Biol. 2012;16:284–7. 22. Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S. Baliga, Jonathan T. Wang, Daniel Ramage, Nada Amin , Benno Schwikowski, and Trey Ideker. Cytoscape: A Software Environment for Integrated Models. Genome Res [Internet]. 1971;13:426. Available from: http://ci.nii.ac.jp/naid/110001910481/ 23. Maere S, Heymans K, Kuiper M. Systems biology BiNGO[: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks. 2005;21:3448–9. 24. Supek F, Bošnjak M, Škunca N, Šmuc T. Revigo summarizes and visualizes long lists of gene ontology terms. PLoS One. 2011;6. 25. Carbone F, De Rosa V, Carrieri PB, Montella S, Bruzzese D, Porcellini A, et al. Regulatory T cell proliferative potential is impaired in human autoimmune disease. Nat Med. 2014;20:69–74. 26. Mammana S, Bramanti P, Mazzon E, Cavalli E, Basile MS, Fagone P, et al. Preclinical evaluation of the PI3K/Akt/mTOR pathway in animal models of multiple sclerosis. Oncotarget. 2018;9:8263–77. 27. Holley JE, Gveric D, Newcombe J, Cuzner ML, Gutowski NJ. Astrocyte characterization in the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 23 multiple sclerosis glial scar. Neuropathol Appl Neurobiol. 2003;29:434–44. 28. Pedotti R, DeVoss JJ, Youssef S, Mitchell D, Wedemeyer J, Madanat R, et al. Multiple elements of the allergic arm of the immune response modulate autoimmune demyelination. Proc Natl Acad Sci U S A. 2003;100:1867–72. 29. Cui LY, Chu SF, Chen NH. The role of chemokines and chemokine receptors in multiple sclerosis. Int Immunopharmacol [Internet]. Elsevier; 2020;83:106314. Available from: https://doi.org/10.1016/j.intimp.2020.106314 30. Krumbholz M, Theil D, Cepok S, Hemmer B, Kivisäkk P, Ransohoff RM, et al. Chemokines in multiple sclerosis: CXCL12 and CXCL13 up-regulation is differentially linked to CNS immune cell recruitment. Brain. 2006;129:200–11. 31. Krementsov DN, Thornton TM, Teuscher C, Rincon M. The Emerging Role of p38 Mitogen- Activated Protein Kinase in Multiple Sclerosis and Its Models. Mol Cell Biol. 2013;33:3728–34. 32. Kotelnikova E, Kiani NA, Messinis D, Pertsovskaya I, Pliaka V, Bernardo-Faura M, et al. MAPK pathway and B cells overactivation in multiple sclerosis revealed by phosphoproteomics and genomic analysis. Proc Natl Acad Sci U S A. 2019;116:9671–6. 33. Kunkl M, Frascolla S, Amormino C, Volpe E, Tuosto L. T Helper Cells: The Modulators of Inflammation in Multiple Sclerosis. Cells. 2020;9:482. 34. Waubant E, Lucas R, Mowry E, Graves J, Olsson T, Alfredsson L, et al. Environmental and genetic risk factors for MS: an integrated review. Ann Clin Transl Neurol. 2019;6:1905–22. 35. Olsson T, Barcellos LF, Alfredsson L. Interactions between genetic, lifestyle and environmental risk factors for multiple sclerosis. Nat Rev Neurol. Nature Publishing Group; 2016;13:26–36. 36. Kular L, Liu Y, Ruhrmann S, Zheleznyakova G, Marabita F, Gomez-Cabrero D, et al. DNA methylation as a mediator of HLA-DRB1 15:01 and a protective variant in multiple sclerosis. Nat (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 24 Commun. 2018;9. 37. Compston A, Coles A. Multiple sclerosis. Lancet [Internet]. Elsevier Ltd; 2008;372:1502–17. Available from: http://dx.doi.org/10.1016/S0140-6736(08)61620-7 38. Jelcic I, Al Nimer F, Wang J, Lentsch V, Planas R, Jelcic I, et al. Memory B Cells Activate Brain- Homing, Autoreactive CD4+ T Cells in Multiple Sclerosis. Cell. 2018;175:85-100.e23. 39. Lange C, Storkebaum E, De Almodóvar CR, Dewerchin M, Carmeliet P. Vascular endothelial growth factor: A neurovascular target in neurological diseases. Nat Rev Neurol [Internet]. Nature Publishing Group; 2016;12:439–54. Available from: http://dx.doi.org/10.1038/nrneurol.2016.88 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 Figures : Figure 1. Overview of the benchmark assessment of disease modules and the integration workflow for MS. (a) Transcriptomic and methylomic datasets from 19 different diseases were used as inputs for eight MODifieR module identification methods. The resulting single-omic disease modules (n = (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 26 456) were independently assessed by GWAS enrichment analysis of the same disease using Pascal module scoring. MODifieR methods were evaluated by the combined enrichment score of their respective disease modules. (b) Multi-omic integrative workflow for multiple sclerosis (MS)- associated modules. Data from 20 case-control comparisons were used as input for module detection with MODifieR methods. Clique SuM modules presented the highest GWAS enrichment score and were therefore used to generate single-omic consensus modules. The intersection of the best transcriptomic and methylomic consensus modules resulted in an MS multi-omic module (n = 220 genes) with the highest GWAS enrichment, which was independently found to be enriched for genes associated with five known lifestyle MS risk factors using public omics data from healthy individuals. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 Figure 2. Genomic concordance of MODifieR modules on transcriptomic datasets. (a) Heatmap of PASCAL p-values for eight single-method and eight consensus MODifieR modules, identified for 47 publicly available transcriptomic datasets. Module performance P-values are shown in a white to blue scale, where any shade of blue represents a significant module ( < 0.05; the darker, the more significant), white represents a non-significant module, and grey represents a module of size zero. Datasets are classified into six disease types: cardiovascular (red), glycemic (golden), inflammatory (green), neurodegenerative (fuchsia), psychiatric and social (pink), autoimmune (dark purple), and others (light purple); and two cell types: blood (maroon), and others (light yellow). Datasets are ranked by meta P-values using Fisher’s method of the single-method module P-values across and within their disease types (dataset score, bottom boxplot). MODifieR methods are organized by algorithm type: seed-based (green), co-expression-based (yellow), and clique-based (red), plus the consensus modules (blue). Single-methods and consensus were scored by meta P-values across (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 28 datasets (method score, right boxplot). Consensus x/8 indicates that the module genes are found in at least x methods out of eight. (b) Scatter plot showing Spearman correlation between module score and betweenness centrality. Modules are represented with a different shape depending on their method and colored based on the disease type. (c) Scatter plot showing Spearman correlation between module score and module size. Modules are represented with a different shape depending on their method and colored based on the disease type. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 Figure 3. Genomic concordance of MODifieR modules on methylomic datasets. (a) Heatmap of Pascal p-values for eight single-method and eight consensus MODifieR modules, identified for ten publicly available methylomic datasets. Module performance P-values are shown in a white to blue scale, where any shade of blue represents a significant module (P < 0.05; the darker, the more significant), white represents a non-significant module, and grey represents a module of size zero. Datasets are classified into two disease types: glycemic (golden), and inflammatory (green); and two cell types: blood (maroon), and others (light yellow). Datasets are ranked by Fisher’s combined P of the single-method module P-values across and within their disease types (dataset score, bottom boxplot). MODifieR methods are organized by algorithm type: seed-based (green), co-expression- based (yellow), and clique-based (red), plus the consensus modules (blue). Single-methods and consensus are scored by meta P-values across datasets (method score, right boxplot). Consensus x/8 indicates that the module genes are found in at least x methods out of eight. (b) Scatter plot (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 30 showing Spearman correlation between module score and betweenness centrality. Modules are represented with a different shape depending on their method and colored based on the disease type. (c) Scatter plot showing Spearman correlation between module score and module size. Modules are represented with a different shape depending on their method and colored based on the disease type. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 31 Figure 4. Genomic concordance of MODifieR modules on MS use case data. (a) Heatmap of PASCAL p-values for eight single-method MODifieR modules, identified for ten MS-related transcriptomic datasets. Module performance P-values are shown in a white to blue scale, where any shade of blue represents a significant module (P < 0.05), white represents a non-significant module, and grey represents a module of size zero. Datasets are classified into the reported MS type: MS (blue), RRMS (red), PPMS (green), SPMS (orange), and CIS (yellow); and four cell types: whole blood (maroon), PBMCs (light brown), white matter (light yellow), and CD4+ T cells (purple). Datasets are meta P- values of the single-method enrichments (dataset score, bottom boxplot). MODifieR methods are organized by algorithm type: seed-based (green), co-expression-based (yellow), and clique-based (red). Single methods are scored by P of the significant modules across datasets (method score, right boxplot). (b) Heatmap of PASCAL p-values for four single-method MODifieR modules, identified for nine MS-related transcriptomic datasets. (c-d) Bar plots of Pascal p-values for the MS consensus modules generated with Clique SuM from transcriptomic (a) and methylomic (b) datasets. (e) Union and intersection of the top performing modules, shown as a Venn diagram. Diseas e Type MS RRMS PPMS SPMS CIS Module Performance α = .05 1 10-2 10-3 10-4 ≤10-5Best Worst P Cell Type WB PBMCs WM CD4+ T cells CD14+ Monocytes CD19+ B cells CD8+ T cells a b c d e 0 2 4 6 8 1/4 2/4 3/4 4/4 Transcriptomic Cliq ue SuM consensus modules α -l o g 1 0 P * 0 2 4 6 8 α -l o g 1 0 P 1/4 2/4 3/4 4/4 Methylomic Cliq ue SuM consensus modules * Best transcriptomic consensus Best methylomic consensus IntersectionUnion ngenes 1041332 220 1656 *(P = 4.82 x 10 -8) (P = 3.74 x 10 -8) (P = 1.95 x 10 -8) (P = 8.76 x 10 -9) Diseas e Type Cell Type Mod. Disco v. MCODE Correl. Clique Clique SuM WGCNA MODA Di��CoEx DIAMOnD T1 2 4 6 8 0 α = .05 0 2 4 6 8 α = .05 11.5 -log10P Disease Type Cell Type Mod. Disco v. MCODE Correl. Cliq ue Clique SuM WGCNA MODA Di��CoEx DIAMOnD α = .05 0 2 4 6 8 -log10P T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 M1 M2 M3 M4 M5 M6 M7 M8 M9 NA NA NA NANANA NA NA NA NA NA NA NA NA NA NA α = .05 0 2 4 6 8 -l o g 1 0 P -l og 1 0 P (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 32 Figure 5. Risk factor enrichment and network visualization of the MS multi-omic module. (a) Evidence levels and effect on MS of the risk factor. � (b) Enrichment overlap of multi-omic MS DYNC1H1 JUN MAPK9 MAPK8 PRKCA PRKCE MAPK11 LCP2 RHOA DYNLL1 GRAP2 BCL6 DNM3 DNM1 PRKACB CASP3 BCL2 NRIP1 DNM2 BCL2L11 PRKACA PTEN ATF2 PRKCI BID RAC1 RAC2 RASA1 NRAS SOS1 PIK3CA HRAS CASP8 CDC42 PRKCZ PARD6A MET PLCG1 IRS1 PTK2 PGR KRAS RET HGF PIK3CB GAB1 VAV1 GRB2 ERBB2 HCK PIK3CD CRKL PIK3R2 CARM1 IGF1 PTK2B KDR VEGFA PXN EDN1 CBL BCAR1 APP SH3GL2 IQGAP1 SHC1 BDNF NGF NTRK1 PTPN6 EGFR INS GNB1 GNG2 ARID1A TRIM25 GNAI1 AR PIK3R3 PIK3R1 PTPRJ SP1 INPP5B TNF CTNNB1 NCAM1 CDH1 SPP1 SEC13 CSK TLN1 RAP1B ABL 1SRC ITGB3 PTPN11 EGF IT GB1 ITGAV SYNJ2 CD7 4 HLA-E CLTA CD4 HLA-DPB1 HLA-A PTPN22 HLA-DRA IL 10 MMP9 PIP5K1B CXCR4 CXCL12 ICAM1 LCKHLA-DRB1 AP2M1 AP2B1 FCGR1A AP1M1 MAPK14 VWF IRF7 IRF1 IRF4 IL4 IL 6 IFNG AKT3 A P2 A2 HSP90AA1 CD3D PPP2R1A GSK3B PPP2CA FGG EPS15L1 FGF2 PTPRC CD3G HSP90AB1 EPHA2 F N1 CLTC PIP5K1A VCAM1 FYN ESR1 TGFB1 ITGB2 CD8 6 NR3C1 CD80 CD3E AP2A1 RUNX1 CD28 CD4 4 CEBPB AP2S1 NFKB1 HDAC1 KIT CDK4 CCNA1 UBE2I PCNA CCND1 RELA STAT5A PRKCD PRKCQ ZAP70 RAF1 YWHAB AKT1 CD24 7 RAP1A MAPK1 MAPK3 PTAFR RAB7A MAP2K1 SMAD4 MAP3K5 CREBBP SMAD2 HMGB1 NGFR DAXX AKT2 PPARG TRIM2 4 SMAD3 MYC CTSS SIRT1 CSF2 BRCA1 SPTBN2 TP53 H2 AX SPHK1 EP3 00 JAK1 IRF3 STAT3 STAT1 STAT6 PAK1 HIF1A PLCG2PDGFB JAK2 PDGFRB CCNE1 RUNX3 RB1 EZH2CDK2 Functional Clusters Cell death and apoptosis Morphogenesis and neurogenesis Cell cycle and proliferation Chemotaxis and cell migration Response to hormone stimulus Leukocyte activation and di��erentiation Node Color Legend Low sun exposure Smoking High BMI Alcohol use EBV infection Associated with MS Signif. enriched MS risk factors Risk factor Evidence E��ect EBV infection Smoking Low sun exposure Adolescent obesity High BMI Night shift work Organic solvent exposure Alcohol consumption Oral tobacco +++ +++ ++ ++ ++ ++ + + + � Risk � Risk � Risk � Risk � Risk � Risk � Risk � Risk a c b Module enrichments 1 2 3 4 Risk factor datasets -log10 P α = .05 1 2 3 4 Validation dataset -log10 P α = .05 NA NA NA 7.4 � Risk (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 33 module genes in the top 1,000 DMGs in risk factor datasets and independent risk factor methylation dataset (see Methods) shown as Fisher exact test P-values (threshold α=0.05). (c) Visualization of the module. Nodes (module genes) are arranged in functional clusters according to their overrepresented GO terms. Genes with a known association to MS are marked with a blue circle. Node colors display the associations to an MS risk factor for which the module is significantly enriched (red, alcohol use; green, high BMI; yellow, smoking; purple, low sun exposure; light blue, EBV infection; grey, no association). Edges were extracted from the STRINGdb v11 human PPI network of experimentally validated interactions (confidence score > 700). SUPPLEMENTARY MATERIALS Supplementary Table 1: All case-control comparisons used in the Transcriptomic and Methylomic benchmarks. Supplementary Table 2: All case-control comparisons used in the MS use case benchmark. Supplementary Table 3: All Methods implemented in the benchmark. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 10_1101-2020_12_24_424332 ---- Genetic epidemiology of variants associated with immune escape from global SARS-CoV-2 genomes Genetic epidemiology of variants associated with immune              escape from global SARS-CoV-2 genomes    Bani Jolly​1,2,$​, Mercy Rophina​1,2,$​, Afra Shamnath​1​, Mohamed Imran​1,2​, Rahul C. Bhoyar​1​, Mohit                        Kumar Divakar​1,2​, Pallavali Roja Rani​3​, Gyan Ranjan​1,2​, Paras Sehgal​1,2​, Pulala Chandrasekhar​3​,                      S. Afsar​3​, J. Vijaya Lakshmi​3​, A. Surekha​3​, Sridhar Sivasubbu​1,2​, Vinod Scaria​1,2,*    1​CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), New Delhi, India   2​Academy of Scientific and Innovative Research (AcSIR), CSIR-HRDC Ghaziabad, Uttar                    Pradesh, India  3​Kurnool Medical College, Kurnool, Andhra Pradesh, India    $​Authors contributed equally and would like to be known as joint first authors  *Address for correspondence: Vinod Scaria, ​vinods@igib.in    Abstract  Many antibody and immune escape variants in SARS-CoV-2 are now documented in                        literature. The availability of SARS-CoV-2 genome sequences enabled us to investigate the                        occurrence and genetic epidemiology of the variants globally. Our analysis suggests that a                          number of genetic variants associated with immune escape have emerged in global                        populations.    Keywords: ​COVID-19, SARS-CoV-2, Antibody, Mutations, Epidemiology    Text  Antibodies are one of the emerging therapeutic approaches being explored in COVID-19.                        These antibodies typically target the receptor-binding motif or structural domains of the                        Spike protein of SARS-CoV-2, in an attempt to inhibit binding of Spike protein with the host                                receptors. Cocktails of antibodies which target distinct structural and functional domains of                        spike proteins are also being currently developed considering redundant mechanisms of                      targeting the virus and therefore minimising escape mechanisms. Genomic documentation                    of the spread of SARS-CoV-2 across the globe has provided unique insights into the genetic                              variability and variants of functional consequence. In-depth studies in recent months have                        unravelled a wealth of information on the immune response in COVID-19 and offered                          insights into the development of therapeutics.    Recent investigations suggest a number of genetic variants in SARS-CoV-2 are associated                        with immune escape and/or resistance to antibodies. Their structural and functional                      features and mechanisms of immune evasion are also being extensively studied (​1​) . The                            natural occurrence and genetic epidemiology of these variants across the global                      populations are poorly understood. We were motivated by the wide availability of                        SARS-CoV-2 genomes from across the world and the increasing numbers of genetic                        variants suggested to contribute to escape from antibody inhibition.     We analysed a comprehensive compendium of genetic variants associated with immune                      escape and curated by our group from literature and preprint servers (​2​). This compendium                            included 120 unique variants reported in literature. To understand the genetic epidemiology                        of these variants in the global compendium of genomes, we compiled the dataset of                            265,079 ​SARS-CoV-2 from GISAID (as of 17 December 2020) (​3​) apart from 1,154 genomes                            (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.12.24.424332doi: bioRxiv preprint mailto:vinods@igib.in https://doi.org/10.1101/2020.12.24.424332 sequenced in-house (BioProject ID: PRJNA655577). Genome sequences with more than                    5% Ns, more than 10 ambiguous nucleotides, higher than expected divergence and                        mutation clusters were excluded from the analysis. After quality control, the final dataset                          encompassed 240,133 genomes from 133 countries. Only countries with at least 100 good                          quality genome submissions were considered for the analysis.    86 of the 120 genetic variants associated with immune escapes were found in a total of                                26,917 genomes from 63 countries (​Figure 1A​), out of which 9 variants had >1% frequency                              in the respective countries. Phylogenetic analysis was performed following the Nextstrain                      protocol for a total of 3,679 genomes, including 1,501 randomly selected genomes having                          these variants (​Figure 1B​) (​4​). Homoplasies were identified in the phylogeny using                        HomoplasyFinder (​5​). Out of 86, 43 variant sites were found to be homoplasic, suggesting                            they could emerge independently in different genetic lineages, out of which 9 were found to                              be at >1% frequency in at least one of the countries analysed.     Out of 14,222 genomes analysed from Australia, 24 immune escape associated variants                        mapped to 9,895 genomes (70%). Of significant frequency was the S:S477N variant which                          was present in 9,541 genomes (67%) from Australia. High frequency of this variant was also                              found in a number of other countries particularly in Europe. S:N439K was also found at high                                frequencies in genomes from a number of countries in Europe (​6​).    S:N501Y, one of the variants in the recently reported emergent SARS-CoV-2 lineage from                          the United Kingdom, was present in a total of 290 genomes, including genomes from the                              United Kingdom, Australia, South Africa, USA, Denmark and Brazil (​7,8)​. All 7 genomes from                            South Africa having S:N501Y also had the S:E484K variant and S:K417N was present in 2 of                                these genomes (​9​).    The ORF3a:G251V variant was also found to be prevalent across global genomes, with the                            highest frequencies in Hong Kong and South Korea. This variant is also one of the defining                                variants for the Nextstrain clade A1a (GISAID Clade V) (​Figure 1B​).     19 of the 86 genetic variants were found in genomes from India (​Supplementary Figure​).                            The S:N440K variant was found to have a frequency of 2.1% in India and a high prevalence                                  in the state of Andhra Pradesh (33.8% of 272 genomes). The variant site was homplasic and                                the variant was found in genomes belonging to different clades and haplotypes. Time-scale                          analysis suggested the variant emerged in recent months (​Figure 1C​). The S:N440K variant                          was also reported in a case of COVID-19 reinfection from North India (​10​).    Put together, our analysis suggests that a number of genetic variants which are associated                            with immune escape have emerged in global populations, some of them have been found to                              be polymorphic in many global datasets and a subset of variants have emerged to be highly                                frequent in some countries. Homoplasy of the variant sites suggests that there could be a                              potential selective advantage to these variants. Further data and analysis would be needed                          to investigate the potential impact of such variants on the efficacy of different vaccines in                              these regions.            (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.12.24.424332doi: bioRxiv preprint https://doi.org/10.1101/2020.12.24.424332 Acknowledgements  Authors acknowledge Disha Sharma and Abhinav Jain for the analysis of in-house genomes                          and the researchers, originating and submitting laboratories of the sequences retrieved                      from GISAID (​https://doi.org/10.6084/m9.figshare.13365503.v2​). BJ and MKD            acknowledge a research fellowship from the Council of Scientific and Industrial Research                        (CSIR India). The funders had no role in the study design or the decision to publish.    References   1. Weisblum Y, Schmidt F, Zhang F, DaSilva J, Poston D, Lorenzi JCC, et al. Escape from                                neutralizing antibodies by SARS-CoV-2 spike protein variants. 2020 Oct 28 [cited 2020                        Dec 22]; ​https://elifesciences.org/articles/613121    2. Rophina M, Pandhare K, Mangla M, Shamnath A, Jolly B, Sethi M, et al. FaviCoV - a                                  comprehensive manually curated resource for functional genetic variants in                  SARS-CoV-2. 2020 Nov 17 ​https://doi.org/10.31219/osf.io/wp5tx  3. Yuelong Shu JM. GISAID: Global initiative on sharing all influenza data – from vision to                              reality. Eurosurveillance [Internet]. 2017 Mar 30 [cited 2020 Dec 9];22(13).                    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5388101/  4. Nextstrain [Internet]. [cited 2020 Dec 9]. ​https://nextstrain.org/sars-cov-2/  5. Crispell J, Balaz D, Gordon SV. HomoplasyFinder: a simple tool to identify homoplasies                          on a phylogeny. Microbial Genomics [Internet]. 2019 Jan [cited 2020 Dec 9];5(1).                        https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6412054/  6. Hodcroft EB, Zuber M, Nadeau S, Crawford KHD, Bloom JD, Veesler D, et al. Emergence                              and spread of a SARS-CoV-2 variant through Europe in the summer of 2020. medRxiv :                              the preprint server for health sciences [Internet]. 2020 Nov 27 [cited 2020 Dec 9];                            https://pubmed.ncbi.nlm.nih.gov/33269368/  7. Rambaut A, Loman N, Pybus O, Barclay W, Barrett J, Carabelli A, et al. Preliminary                              genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a                          novel set of spike mutations [Internet]. 2020 [cited 2020 Dec 22].                      https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-co v-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563  8. Shang E, Axelsen PH. The Potential for SARS-CoV-2 to Evade Both Natural and                          Vaccine-induced Immunity [Internet]. Cold Spring Harbor Laboratory. 2020 [cited 2020                    Dec 24]. p. 2020.12.13.422567.        https://www.biorxiv.org/content/10.1101/2020.12.13.422567v1.abstract  9. Emergence and rapid spread of a new severe acute respiratory syndrome-related                      coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa                      [Internet]. [cited 2020 Dec 22]. ​https://www.krisp.org.za/publications.php?pubid=315   10. Gupta V, Bhoyar RC, Jain A, Srivastava S, Upadhayay R, Imran M, et al. Asymptomatic                              reinfection in two healthcare workers from India with genetically distinct SARS-CoV-2.                      Clin Infect Dis [Internet]. [cited 2020 Dec 9];                https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7543380/  (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.12.24.424332doi: bioRxiv preprint https://doi.org/10.6084/m9.figshare.13365503.v2 https://elifesciences.org/articles/613121 https://doi.org/10.31219/osf.io/wp5tx https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5388101/ https://nextstrain.org/sars-cov-2/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6412054/ https://pubmed.ncbi.nlm.nih.gov/33269368/ https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563 https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563 https://www.biorxiv.org/content/10.1101/2020.12.13.422567v1.abstract https://www.krisp.org.za/publications.php?pubid=315 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7543380/ https://doi.org/10.1101/2020.12.24.424332   (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.12.24.424332doi: bioRxiv preprint https://doi.org/10.1101/2020.12.24.424332 Figure 1. ​(A) Variant frequencies of the immune escape variants in genomes of                          SARS-CoV-2. The total number of genomes analyzed from each country is specified.                        Variants with frequency >1% in the respective countries are highlighted in red. (B) Global                            phylogenetic context of the variants. The vertical bar indicates the clade assigned                        according to the Nextstrain nomenclature (C) Time-series data on prevalence for the                        genetic variants showing the region-wise proportion of genomes per month for the variants      Supplementary Figure. Variant frequencies of the immune escape variants in genomes                      isolated from different states in India.    (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.12.24.424332doi: bioRxiv preprint https://doi.org/10.1101/2020.12.24.424332 10_1101-2020_05_22_110247 ---- Integrated cross-study datasets of genetic dependencies in cancer Integrated cross-study datasets of genetic dependencies in cancer Clare Pacini ​1,2​, Joshua M. Dempster​3​, Isabella Boyle ​3​, Emanuel Gonçalves​1​, Hanna Najgebauer​1,2,4​, Emre Karakoc​1,2​, Dieudonne van der Meer​1​, Andrew Barthorpe ​1​, Howard Lightfoot​1​, Patricia Jaaks​1​, James M. McFarland ​3​, Mathew J. Garnett​1,2​, Aviad Tsherniak​3​, Francesco Iorio ​1,2,5,* 1 ​Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK 2 ​Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK 3 ​Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA 4 ​European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Cambridge CB10 1SA, UK 5 ​Human Technopole, Via Cristina Belgioioso 147, 20157 Milano - Italy * Corresponding author: ​francesco.iorio@sanger.ac.uk Abstract CRISPR-Cas9 viability screens are increasingly performed at a genome-wide scale across large panels of cell lines to identify new therapeutic targets for precision cancer therapy. Integrating the datasets resulting from these studies is necessary to adequately represent the heterogeneity of human cancers and to assemble a comprehensive map of cancer genetic vulnerabilities. Here, we integrated the two largest public independent CRISPR-Cas9 screens performed to date (at the Broad and Sanger institutes) by assessing, comparing, and selecting methods for correcting biases due to heterogeneous single guide RNA efficiency, gene-independent responses to CRISPR-Cas9 targeting originated from copy number alterations, and experimental batch effects. Our integrated datasets recapitulate findings from the individual datasets, provide greater statistical power to cancer- and subtype-specific analyses, unveil additional biomarkers of gene dependency, and improve the detection of common essential genes. We provide the largest integrated resources of CRISPR-Cas9 screens to date and the basis for harmonizing existing and future functional genetics datasets. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint mailto:francesco.iorio@sanger.ac.uk https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Cancer is a complex disease that can arise from multiple different genetic alterations. The alternative mechanisms by which cancer can evolve result in considerable heterogeneity between patients, with the vast majority of them not benefiting from approved targeted therapies​1​. In order to identify and prioritize new potential therapeutic targets for precision cancer therapy, analyses of cancer vulnerabilities are increasingly performed at a genome-wide scale and across large panels of ​in vitro​ cancer models​2–11​. This has been facilitated by recent advances in genome editing technologies allowing unprecedented precision and scale via CRISPR-Cas9 screens. Of particular note are two large pan-cancer CRISPR-Cas9 screens that have been independently performed by the Broad and Sanger institutes​2,12​. The two institutes have also joined forces with the aim of assembling a joint comprehensive map of all the intracellular genetic dependencies and vulnerabilities of cancer: the ​Cancer Dependency Map (DepMap)​13,14​. The two generated datasets collectively contain data from over 1,000 screens of more than 900 cell lines. However, it has been estimated that the analysis of thousands of cancer models will be required to detect cancer dependencies across all cancer types​3​. Consequently, the integration of these two datasets will be key for the DepMap and other projects aiming at systematically probing cancer dependencies. These integrated datasets will provide a more comprehensive representation of heterogeneous cancer types and form the basis for the development of effective new therapies with associated biomarkers for patient stratification ​15​. Further, designing robust standards and computational protocols for the integration of these types of datasets will mean that future releases of data from CRISPR-Cas9 screens can be integrated and analyzed together, paving the way to even larger cancer dependency resources. We have previously shown that the pan-cancer CRISPR-Cas9 datasets independently generated at the Broad and Sanger institutes are consistent on the domain of 147 commonly screened cell lines​16​. The reproducibility of these CRISPR screens holds despite extensive differences in the experimental pipelines underlying the two datasets, including distinct CRISPR-Cas9 sgRNA libraries. Here we investigate the integrability of the full Broad/Sanger gene dependency datasets, yielding the most comprehensive cancer dependency resource to date, encompassing dependency profiles of 17,486 genes across 908 different cell lines that span 26 tissues and 42 different cancer types. We compare different state-of-the-art data processing methods to account for heterogeneous single-guide RNA (sgRNA) on-target efficiency, and to correct for gene independent responses to 2 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/VOtGa https://paperpile.com/c/BNwyax/e4Ooj+5JKGI+ayQe4+AS1lX+YMsJ9+T0Woi+ODthp+DcTjJ+BIfQG+g3BuJ https://paperpile.com/c/BNwyax/f4TT0+e4Ooj https://paperpile.com/c/BNwyax/Kl5bc+htOyk https://paperpile.com/c/BNwyax/5JKGI https://paperpile.com/c/BNwyax/wJXm9 https://paperpile.com/c/BNwyax/6UH1G https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ CRISPR-Cas9 targeting ​12,17,18​, evaluating their performance on common use cases for CRISPR-Cas9 screens (​Figure 1a, 1b and 1c​). Figure 1: Schematic of the integration strategy. ​ a. Broad and Sanger gene dependency datasets (raw count data of single-guide RNAs) are downloaded from respective web-portals. b. The datasets from each institute are pre-processed with three different methods, accounting for gene-independent responses to CRISPR-cas9 targeting (arising from copy number amplifications) and heterogeneous sgRNA efficiency, providing gene-level corrected depletion fold changes. Then, four different batch-correction pipelines are applied to the gene level fold changes across the two institute datasets for each of the pre-processing methods. c. Twelve different integrated datasets resulting from applying three different pre-processing methods (as indicated by the border colors) and four different batch-correction pipelines (as indicated by the fill colors) are benchmarked. d. Advantages provided by the final integrated datasets and conservation of analytical outcomes from the individual ones are investigated. We show that our integration strategy accounts and corrects for technical biases whilst preserving gene dependency heterogeneity and recapitulates established associations between molecular features and gene dependencies. We highlight the benefits of the integrated dataset over the two individual ones in terms of improved coverage of the genomic heterogeneity across different cancer types, identification of new biomarker/dependency associations, and increased reliability of human 3 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/f4TT0+Q4ESm+htDUx https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ core-fitness/common-essential genes (​Figure 1d​). Finally, we estimate the minimal size (in terms of the number of screened cell lines) required in order to effectively correct batch effects when integrating a new dataset. Collectively, this study presents a robustly benchmarked framework to integrate independently generated CRISPR-Cas9 datasets that provide the most comprehensive resource for the exploration of cancer dependencies and the identification of new oncology therapeutic targets. Results Overview of the integrated CRISPR-Cas9 screens The Sanger’s Project Score CRISPR-Cas9 dataset (part of the Sanger DepMap)​19 and the Broad’s 20Q2 DepMap dataset​20,21​ contain data for 317 and 759 cell lines, respectively. Overall, these represent screens for 908 unique cell lines (​Figure 2a​, Supplementary Table 1 ​). Together these cell lines spanned 26 different tissues (​Figure 2b​) and for 16 of these the number of cell lines covered increased when considering both datasets together. Similarly, the integrated dataset provided richer coverage of specific cancer types and clinically relevant subtypes (​Figure 2c​). These preliminary observations highlight the first benefit of combining these resources to increase statistical power for tissue-specific as well as pooled pan-cancer analyses. Between the two datasets, there was an overlap of 168 ​ ​cell lines screened by both institutes, encompassing 16 different tissue types (median = 8, min 1 for Soft Tissue, Biliary Tract and Kidney, max 28 for Lung, ​Figure 2a and 2b​). The set of overlapping cell lines enabled the estimation of batch effects due to differences in the experimental protocols underlying the two datasets​16​, without biasing the correction toward specific cell line lineages. 4 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/3CgU2 https://paperpile.com/c/BNwyax/6qc1+N7Jvg https://paperpile.com/c/BNwyax/6UH1G https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 2. Overview of CRISPR-Cas9 screened cancer cell lines. ​a. Number of cell lines screened by the Broad and the Sanger institutes and their overlap. b. Overview of the number of cell lines screened for each tissue type across the two datasets. c. Number of screened Lung cancer and Breast cancer cell lines split according to cancer types and PAM50 subtypes, respectively, across the two datasets. Data Pre-processing Known biases in CRISPR screens arise due to nonspecific cutting toxicity that increases with copy number amplifications (CNAs)​22,23​ and heterogeneous levels of on-target efficiency across sgRNAs targeting the same gene ​24​. Multiple methods exist to correct for these biases. Here, we evaluate three: CRISPRcleanR, an unsupervised nonparametric CNA effect correction method for individual genome-wide screens​17​; a method resulting from using CRISPRcleanR with JACKS, a Bayesian method accounting for differences in guide on target efficacy​18​ (CCR-JACKS) through joint analysis of multiple screens; and CERES, a method that simultaneously corrects for CNA effects and accounts for differences in guide efficacy​12​, also analyzing screens jointly. 5 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/iQbeE+59O9I https://paperpile.com/c/BNwyax/EqQvF https://paperpile.com/c/BNwyax/Q4ESm https://paperpile.com/c/BNwyax/htDUx https://paperpile.com/c/BNwyax/f4TT0 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Batch effect correction Technical differences in screening protocols, reagents and experimental settings can cause batch effects between datasets. These batch effects can arise from factors that vary within institute screens (for example, differences in control batches and Cas9 activity levels) as well as between institutes (such as differences in assay lengths and employed sgRNA libraries). When focusing on the set of cell lines screened at both institutes, a Principal Component Analysis (PCA) of the cell line dependency profiles across genes (DPGs) highlighted a clear batch effect determined by the origin of the screen, irrespective of the pre-processing method, consistent with previous results (​Figure 3a​)​16​. We quantile-normalized each cell line DPG and adjusted for differences in screen quality in the individual Broad/Sanger data sets. The combined Broad/Sanger dataset was then batch corrected using ComBat​25​ (Methods). Following ComBat correction, the combined datasets on the overlapping cell lines showed reduced yet persistent residual batch effects clearly visible along the two first principal components (​Supplementary Figure 1​). Analysis of the first two principal components (using MsigDB gene signatures​26​ and all cell lines, Methods), showed enrichment for metabolic processes (phosphorus metabolic process q-value = 1.06e-08, protein metabolic process q-value = 8.70e-07, hypergeometric test) in the first principal component. The enrichment of metabolic processes is consistent with differences identified across these datasets due to different media conditions employed in the underlying experimental pipelines​27,28​. The second principal component contained significant enrichments for protein complex organisation and assembly (q-value = 1.57e-16 and 5.28e-11 respectively, hypergeometric test) (​Supplementary Table 2​), which have no obvious associations with technical biases found in CRISPR-cas9 screens. Based on these results, we considered four different batch correction pipelines and evaluated their use in our integrative strategy. In the first pipeline, we processed the combined Broad/Sanger DPG dataset using ComBat alone (ComBat). In the second, we applied a second round of quantile normalization following ComBat correction (ComBat+QN) to account for different phenotype intensities across experiments, resulting in different ranges of gene dependency effects. In the third and fourth pipelines we also removed the first one or two principal components respectively (ComBat+QN+PC1) and (ComBat+QN+PC1-2). The final 12 datasets contained data from unique screens of 908 cell lines using each of the three pre-processing methods and four different batch correction pipelines as outlined in the previous section. To assess the performance of different batch correction pipelines we estimated, using the overlapping cell lines, the extent to which each cell line DPG from one study matched that of its counterpart (derived from the same cell line) from the other study 6 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/6UH1G https://paperpile.com/c/BNwyax/AX4Xh https://paperpile.com/c/BNwyax/wM6a https://paperpile.com/c/BNwyax/ezH2+RXWN https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ following batch correction. To quantify the agreement, we calculated for each DPG its similarity to all other screen DPGs using a weighted Pearson’s (wPearson) correlation (Methods). We then calculated the proximity of a cell line to its counterpart compared to all other cell lines using the wPearson as a metric (Recall of cell line identity)​ ​(​Figure 3b ​). The best performances were obtained when removing either the first or the first two principal components following ComBat and quantile normalization, i.e. ComBat+QN+PC1 or ComBat+QN+PC1-2. Across pre-processing methods, CERES performed best with 302 (90%) of the cell lines being closest to their counterpart from the other study (k = 1) followed by CRISPRcleanR with 272 cell lines (81%) and CCR-JACKS with 215 (64%). The Recall of cell line identity was high for each integration pipeline with normalized Area under the curve (nAUC) values of 0.98 for CCR-JACKS and 0.99 for CRISPRcleanR and CERES when considering the best performing ComBat+QN+PC1-2 batch correction method. 7 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 Figure 3: Batch effect assessment and correction.​ a. Principal component plots of the dependency profile across genes (DPGs) for cell lines screened in both Broad and Sanger studies and pre-processing methods. Screens are colored by the institute of origin. b. Percentages of cell line DPGs that have the corresponding (same cell line) DPG screened at the other institute among their ​k​ most correlated DPGs (the ​k-neighborhood​). Results are shown across different pre-processing methods (in different plots) and different batch correction pipelines (as indicated by the different colors). Correlations between DPGs are computed using a weighted Pearson correlation metric. Genes with higher selectivity have a larger weight in the correlation calculation. As a measure of selectivity we used the average (across the two individual datasets) skewness of a gene’s dependency profile across cell lines. The proportion of cell lines closest to their counterpart from the other study (k = 1) is shown and the normalised areas under the curves (nAUC) are shown in brackets. The x-axis values are restricted to between 1-100 to highlight the range over which performance differences are visible between datasets. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Performance of the integration pipelines We evaluated the performance of each of the 12 integrated datasets, containing 908 cell lines, under four use-cases: the identification of i) essential and non-essential genes ii) lineage subtypes iii) biomarkers of selective dependencies and iv) functional relationships. Identification of essential and non-essential genes A cell line DPG with a large separation of dependency scores (DS) of common essential and non-essential genes should yield lower misclassification rates when identifying dependencies specific to that cell line. For each cell line we measured the separation of dependency scores (DS) between known common essential and non-essential genes​11 across all integrated datasets. As a measure of separation we used the ​null-normalized mean difference (​NNMD)​29​, defined as the ​difference between the mean DS of the common essential genes and non-essential genes divided by the standard deviation of the DSs of the non-essential genes​. By analysing multiple screens jointly, CERES and JACKS borrow essentiality signal information across screens. As a consequence, these methods better identify consistent signals across cell line DPGs (i.e. for common essential and non-essential genes), especially for DPGs derived from lower quality experiments, or reporting weaker depletion phenotypes​18,23​. Consistently, CERES (median NNMD range [-5.78, -5.88]) showed better NNMD values than CRISPRcleanR (median NNMD range [-5.02, -5.12], Wilcox test (WT) p​-value < 2.2e-16) and CCR-JACKS (median NNMD range [-5.14, -5.23], WT ​p​-value < 2.2e-16)), and similarly CCR-JACKS had better NNMD values than CRISPRcleanR (largest WT ​p ​-value < 0.0005) (​Figure 4a​). Comparing the batch correction methods, ComBat+QN+PC1-2 had marginally better performance across all pre-processing methods. Next, we evaluated the gene dependency false-positive rates across all integrated datasets. For each cell line DPG, we defined a set of putative negative controls composed of genes not expressed at the basal level in that cell line (Methods). False positives were calculated as the sum of negative controls identified as significant dependencies (in the top 15% most depleted genes) normalized by their total number across the DPG. There was little difference in false-positive rates across the four different batch correction pipelines, with a slight improvement when two principal components were removed (​Figure 4b​). CERES outperformed CCR-JACKS significantly for all batch correction methods (largest 𝜒​2 9 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/g3BuJ https://paperpile.com/c/BNwyax/fOJkA https://paperpile.com/c/BNwyax/59O9I+htDUx https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ contingency table ​p​-value 1.87 x 10 ​-11​, N=1.43 x 10 ​7​) and CCR-JACKS outperformed CRISPRCleanR (​p​-value below machine precision). Comparing the correction methods, the differences between ComBat and ComBat+QN and between ComBat+QN+PC1 and ComBat+QN+PC1-2 were generally not significant across preprocessing methods, while the difference between either ComBat or Combat+QN and either ComBat+QN+PC1 or ComBat+QN+PC1-2 were generally significant (largest ​p​-value 1.42 x 10 ​-5​). As a final test of control separation, we used the unexpressed genes as an empirical null distribution for each DPG to estimate ​p- ​values for all DS and thus false discovery rates (FDRs) within each DPG. We calculated the recall of a reference set of common essential genes​11​ at 10% FDR (​Figure 4c ​). Again CERES outperformed CCR-JACKS which outperformed CRISPRCleanR, and increasing the number of steps in the batch correction pipeline monotonically improved essential recall for all preprocessing methods. All differences between preprocessing methods and batch correction methods were significant, with the largest observed ​t​-test (related) ​p​-value 1.96 x 10 ​-3​ (N = 830). 10 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/g3BuJ https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 4: Use case recall of essential genes and lineage identification ​. a. ​Null-normalized mean difference ​(NNMD, a measure of separation between dependency scores of prior-known essential and non-essentials genes): defined as the difference in means between dependency scores of essential and non-essential genes divided by standard deviation of dependency scores of the non-essential genes. Lower values of NNMD indicate better separation of essential genes and non-essential genes. b. False positive rates across all pre-processing methods and batch-correction pipelines. In the gene dependency profile of a given cell line, a significant dependency gene was called a false positive if that gene was not expressed in that cell line. c. Recall of known essential genes across all pre-processing methods and batch-correction-pipelines at 10% 11 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ FDR​. ​d. Agreement between cell line clusters based on DPGs correlation and tissue lineage labels of corresponding cell lines, across pre-processing methods and batch-correction pipelines. e. Agreement of Lung CRISPR-cas9 fitness profiles according to the Lung cancer subtypes. For each query Lung cancer cell line in turn we computed correlation scores to all other Lung cancer cell lines (responses). We then ranked the response cell lines according to these correlations. For each query cell line, the rank position k of the most correlated response cell line from the same cancer subtype (matching response) was identified. A rank of k = 1 indicates that the query cell line was closest to another cell line from the same cancer subtype. The curves show the ratio of query cell lines with a matching response within a given rank position. The proportion of query cell lines with a matching response in k = 1 are also shown as percentages for each dataset. The normalised area under the curve (nAUC) for each dataset is shown in brackets. The figure shows the x-axis zoomed in to between 0 and 60. Identification of lineage subtypes Many dependencies are context specific, reducing cellular fitness in a subset of lineages, that can be used to elucidate gene function and identify cancer type specific vulnerabilities. To evaluate the ability of the integrated datasets in recapitulating tissue lineages and clinical subtypes we first estimated the extent of conserved similarity between screens of cell lines derived from the same tissue lineage. We evaluated the tendency of screens of cell lines from the same lineage to yield similar results by comparing unsupervised clusterings of the batch-corrected cell line DPGs to the lineage labels of the cell lines. To this aim, we performed one hundred ​k​-means clusterings of each of the 12 datasets, with ​k ​equal to the number of tissue lineages screened in at least one study. We then calculated the adjusted mutual information (AMI, Methods) between each DPG clustering and the partition of the cell lines induced by their lineage labels. We observed higher than chance AMI between the obtained ​k​ clusters and the tissue lineages of the cell line DPGs, regardless of the starting batch corrected dataset (largest single-sample ​t​-test p​-value of 3.59 x 10 ​-135​, ​N ​ = 100, ​Figure 4d ​). Under each pre-processing method the removal of one or two principal components resulted in an increased AMI between cell line DPGs clusters and tissue lineages. We next measured the ability of each of the integrated datasets to separate cell lines according to lineage subtypes. The integrated datasets contain over 100 Lung cell lines. These cell lines can further be stratified into subtypes such as Small cell lung carcinoma and Mesothelioma, whilst clinical subtypes such as PAM50 classifications are available for the Breast cancer cell lines (​Figure 2c​). To quantify the clustering of cell lines by subtype we calculated the correlation between all cell lines DPGs, and for a given query cell line the rank of the cell line with most correlated DPG to the query from the same subtype (​k​-rank). For the Lung cancer cell lines, the percentage of cell lines whose closest neighbour was from the same subtype (​k ​= 1) was greatest for CERES (64-65% across batch correction methods) 12 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ followed by CRISPRcleanR (61-64%) and CCR-JACKS (50-57%), with slight improvement with the removal of 1 or 2 principal components (​Figure 4e​). The normalised area under the curve (nAUC) values showed little variation across batch correction methods and were broadly similar between the pre-processing methods CERES (Lung = 0.96, Breast = 0.91 - 0.92), CCR-JACKS (Lung = 0.95 - 0.96, Breast = 0.84 - 0.85), CRISPRcleanR (Lung=0.96 - 0.97, Breast=0.89 - 0.9)(​Supplementary Figure 2 ​). Identification of biomarkers Interesting potential novel therapeutic targets are genes that show a pattern of selective dependency, i.e. exerting a strong reduction of viability upon CRISPR-Cas9 targeting in a subset of cell lines. Furthermore, these selective dependencies are often associated with molecular features that may explain their dependency profiles (biomarkers). We investigated each of the integrated datasets’ ability to reveal tissue-specific biomarkers of dependencies. As potential biomarkers we used a set of 676 clinically relevant cancer functional events (CFEs​30​), across 17 different tissue types. The CFEs encompass mutations in cancer driver genes, amplifications/deletions of chromosomal segments recurrently altered in cancer, hypermethylated gene promoters and microsatellite instability status. For each CFE and tissue type, we performed a Student’s t-test for each selective gene dependency (SGD, Methods) contrasting two groups of cell lines based on the status of CFE under consideration (present/absent), for a total number of 2,142,162 biomarker/dependency pairs tested. The total number of significant biomarker/dependency associations showed little variation across batch-correction methods at 5% FDR. However, a significantly larger number of biomarker/dependency associations were identified when using CRISPRcleanR compared to CCR-JACKS (largest ​p​-value 1.0e-14, proportion test) or CERES (largest ​p​-value 3.60e-10, proportion test) whilst little significant difference was found between CCR-JACKS and CERES (smallest ​p​-value 0.038, proportion test) (​Figure 5a, Supplementary Table 3​). Similar results were seen when the CFEs were split according to whether the biomarker was a mutation, recurrent copy number alteration or hypermethylated region (​Supplementary Figure 3) ​. We next examined the ability of each dataset to recover known selective dependencies in individual cell lines. We downloaded a set of oncogenic gene alterations 13 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/hBt7j https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ from OncoKB​31,32​. After filtering for genes that tend to be common essentials (mean dependency score lower than -0.5 in the CRISPRcleanR-ComBat dataset, where -1 is the median of scores of known common essentials), we considered the oncogenes as positive controls in cell lines where they had indicated oncogenic or likely-oncogenic gain of function alterations, and negative controls in all others. For each oncogene, we measured the NNMD between positive and negative cell lines (​Figure 5b​). We found little difference in median performance by either preprocessing method or batch correction method. We then collected the dependency scores of all oncogenes in cell lines with a corresponding oncogenic alteration and measured receiver operator characteristic (ROC) AUC between them and the dependency scores of the same genes in cell lines without oncogenic alterations (​Figure 5c​). By this measure, CRISPRcleanR outperformed CERES by 2.2% and CCR-JACKS by 4.0%, with minimal variations across batch correction method. Recovery of functional relationships We tested the ability of each dataset to identify expected dependency relations between paralogs, gene pairs coding for interacting proteins, or members of the same complex using gene pairs annotation from publicly available databases​33–35​ (Methods). For each pair of genes known to have a functional relationship, we selected a random pair of genes with similar mean dependency scores across cell lines to serve as null examples. We calculated the false discovery rate for the known pairs using the absolute Pearson correlation of their dependency profiles versus those of the null examples. Recovery of known relationships was unsurprisingly low, since many genes with known functional relationships do not exhibit selective viability phenotypes. ComBat+QN+PC1 or PC1-2 recovered the greatest number of expected gene dependency relations at 10% FDR (​Figure 5d​). 14 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/aSsl+D9gc https://paperpile.com/c/BNwyax/dwIrJ+z554A+KXhhL https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 5: Use case Biomarkers and functional relationships ​. ​a. For each tissue pairs of Cancer Functional Events (CFEs) and dependencies were tested for significant associations between the gene dependency and the absence/presence of a biomarker (CFE). The bar chart shows the total number of significant associations at 5% FDR across tissue types for each of the integrated datasets.​ ​ b. The per-oncogene NNMD between cell lines with and without an indicated oncogenic gain-of-function indication (more negative is better). c. For all identified oncogenes collectively, the receiver-operator characteristic (ROC) AUC between oncogene scores in cell lines where they have an indicated gain-of-function mutation and cell lines where they do not.​ ​d. For each dataset, the number of known gene-gene relationships recovered at 10% FDR. Final selection of pre-processing methods and batch-correction pipelines Comparing the performance of batch correction methods across the use-cases we found that ComBat+QN outperformed ComBat alone and removing one or two principal components had similar or noticeable increases in performance compared to ComBat+QN. 15 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ The principal component analysis indicated that ComBat+QN+PC1 corrected for linear and non-linear effects of technical confounders including assay length, guide library and media conditions. Removing the first two principal components offered little improvement over removing the first principal component alone and we found no attributable technical bias in the gene sets enriched in the second principal component. Overall, we selected ComBat+QN+PC1 as the batch correction pipeline as it had good performance over all metrics and a reduced impact on the data with respect to ComBat+QC+PC1-2, whilst still correcting for multiple technical biases. Comparing the pre-processing methods we found that CERES outperformed the other methods while identifying essential genes and lineage subtypes, that CRISPRcleanR showed higher performance in the biomarker association use case, and these two methods performed comparably and better than CCR-JACKS in identifying known gene-gene relationships. As a conclusion, we selected both CERES and CRISPRcleanR as processing methods and considered the two corresponding integrated datasets as the final results of our pipeline. Advantages of the integrated datasets over the individual ones In-line with the results from all the use-cases, we estimated the benefits of the integrated datasets with respect to the individual ones, in terms of increased capacity to unveil reliable sets of common essential genes (using CERES), as well as increased diversity of genetic dependencies and biomarker associations (using CRISPRcleanR). To evaluate the increased coverage of molecular diversity and genetic dependencies in the integrated dataset we first estimated the increase in the number of detected gene dependencies with respect to the two individual datasets. To this aim, using the CRISPRcleanR processed dataset we quantified the number of genes significantly depleted in ​n​ cell lines (at 5% FDR, Methods) for a fixed number of cell lines ​n ​(with ​n​ = 1, 3, 5 or ​n​ ≥  10​) of the integrated dataset, as well as in the individual Broad and Sanger datasets. ​The integrated dataset identified more dependencies, indicating greater coverage of molecular features and dependencies than in the individual datasets ​(​Supplementary Figure 4a​). We then evaluated the ability of the CERES processed integrated dataset to predict common essential genes and its performance when compared to the individual datasets and two existing sets of common essential genes from recent publications: Behan ​2​ and Hart​36​. We predicted common essential genes using two methods: the 90th-percentile method ​16​ and 16 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/e4Ooj https://paperpile.com/c/BNwyax/KArN https://paperpile.com/c/BNwyax/6UH1G https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ the Adaptive Daisy Model (ADaM)​2​. The majority of genes called common essentials according to one of ADaM or 90th percentile methods was also identified by the other (1,482 out of 2,103, ​Supplementary​ ​Figure 4b ​). We assigned to each of the 2,103 common essential genes a tier based on the amount of supporting evidence of their common essentiality. Tier 1, the highest confidence set comprised the 1,482 genes found by both methods. Tier 2 had 621 genes found by only one method (​Supplementary Table 4​). For each predicted set of common essential genes, we calculated Recall rates of known essential genes sets obtained from KEGG​37​ and Reactome ​38​ pathways. These pathways included Ribosomal protein genes, genes involved in DNA replication and components of the Spliceosome (Methods). The Integrated set of common essentials (Tier 1 and 2) showed greater Recall of known essential genes compared to Behan and Hart, and increased Recall over the individual datasets for 5 out of the 6 gene sets (​Figure 6a​). We next generated a set of 647 genes that were never expressed across the panel of cell lines, to serve as high confidence negative controls (Methods). We calculated the proportion of negative controls in each set of common essentials genes. The best performance was for the Hart gene set (0%) followed by the integrated data set (0.33%) (​Figure 6b ​). As the positive and negative controls did not cover all genes we further investigated the genes predicted to be common essentials. The integrated dataset predicted the largest number of common essentials, with 233 genes found in the integrated data set alone. The 233 genes were enriched for Cell cycle genes (FDR 3.06e-9) and mitochondrial gene expression (FDR 3.66e-7), indicative of essential cellular processes. Similar results were observed for the 1,159 genes in the integrated set of common essentials but neither of the existing datasets (Behan and Hart) (​Supplementary Table 5​) We next asked whether the CRISPRcleanR processed integrated dataset was able to unveil additional significant gene dependencies and CFE/gene-dependency statistical interactions compared to either one of the Broad or Sanger (individual) datasets. Performing systematic biomarker analysis using CFEs on cell lines from individual tissue lineages unveiled 52 additional significant associations in the integrated dataset (when considering only CFE/gene-dependency pairs testable in the individual datasets at 1% FDR) with respect to those using the Sanger dataset alone, and 68 ​ ​with respect to the Broad dataset (​Supplementary Table 6 ​). Examples included decreased dependency on MDM2 in TP53 mutant Lung cell lines for the Sanger dataset, and increased dependency on STAG1 in STAG2 mutated Central Nervous System cancer cell lines for the Broad dataset (​Figure 6c​). 17 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/e4Ooj https://paperpile.com/c/BNwyax/tHHR https://paperpile.com/c/BNwyax/shSW https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Furthermore, 19 tissue-specific significant associations identified in the integrated dataset were tested but not found significant in either the Broad or the Sanger dataset (​Figure 6d​). 18 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 Figure 6: Advantages of an integrated dataset ​. a. Recall of essential genes sets for the integrated dataset, across different tiers, compared to two previously published gene sets (Behan and Hart). b. Proportion of genes in the common .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Sample size requirements for efficient data integration To further increase the coverage of a cancer dependency map, new CRISPR-cas9 screens should be integrated into the existing datasets as they are generated. To aid in this integration we estimated the minimum number of overlapping cell lines that should be screened to efficiently calculate and correct batch effects. We performed a downsampling analysis on the 168 cell lines screened at both Sanger and Broad, ranging from 5% to 90%, and used the obtained subset of cell lines to estimate and correct batch-effects using ComBat. Following this, for each cell line DPG generated at either institute, we computed the Pearson correlation following batch correction using all 168 overlapping cell lines (​Figure 6e​). We found a high degree of correlation between datasets at all levels of downsampling, with the minimum of 8 samples still reducing batch effects when compared to no batch correction (N = 0) (​Supplementary Figure 4c​). We next evaluated the batch correction using the average silhouette width (ASW) of the clustering induced by the institute of origin of the cell lines as a measure of the extent to which cell lines from the same institute clustered together. As expected, as the number of samples used to estimate and correct the batch effect decreases, the DPGs increasingly cluster by the batch of origin (​Figure 6f​). The ASW and Pearson correlation metrics both showed clear convergence with increasing sample size and at the same rate. Given the convergence of these metrics, the results showed that the 168 overlapping cell lines used were sufficient to maximise the batch correction using ComBat. Further the downsampling analysis showed convergence was reached at 90 cell lines and that between 30 and 40 cell lines would be sufficient to provide a batch corrected dataset that is highly correlated (over 0.995) with that obtained when estimating and correcting batch effects with using more than 90 cell lines. The 168 overlapping cell lines contained cell lines from 16 different lineages. To investigate the impact of lineage composition of the cell lines on the batch correction we also 20 essential gene sets that are constitutively not expressed across the panel of cell lines and therefore likely to be false positive results. c. Examples of significant associations between genes and features, found in the integrated dataset compared to the individual dataset. d. Examples of significant associations found in the integrated dataset that were not significant in either of the individual datasets. e. The boxplots contain 50 random samples of between 5% and 90% of the 168 overlapping cell lines (number of cell lines in each sample indicated on the x-axis). For each sample the Pearson correlation of the DPGs following ComBat correction compared to the integrated dataset was calculated for each pre-processing method. f. The average silhouette width (ASW) for each downsampled dataset was calculated using the institute of origin as the cluster label. An ASW of close to zero indicating a near random performance of the clustering, meaning the samples do not cluster by the origin of the screen and batch effects have been removed. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ used a single lineage to estimate the batch effects. In the overlapping cell lines the Lung lineage had the most cell lines (28 in total). We subsampled the Lung cell lines to include 8, 17 or 25 cell lines (​Supplementary​ ​Figure 4de ​) and found little difference in performance between using a single and a mixture of lineages, indicating that this is not a major factor for estimating batch effects. Discussion The integration of data from different high-throughput functional genomics screens is becoming increasingly important in oncology research to ​adequately represent the diversity of human cancers. Integrating CRISPR-Cas9 screens performed independently and/or using distinct experimental protocols, requires correction and benchmarking strategies to account for technical biases, batch effects and differences in data-processing methods. Here, we proposed a strategy for the integration of CRISPR-Cas9 screens and evaluated methods accounting for biases within and between two dependency datasets generated at the Broad and Sanger institutes. Our results show that established batch correction methods can be used to adjust for linear and non-linear study-specific biases. ​Our analyses and assessment yielded two final integrated datasets of cancer dependencies across 908 cell lines. In contrast to existing databases of CRISPR-Cas9 screens​39,40​, our integrated datasets are corrected for batch effects allowing for their joint analysis. ​Following integration, dependency profiles of cell lines from the same tissue lineage and cancer specific subtypes show good concordance.​ Our integrated datasets cover a greater number of genetic dependencies, and the increased diversity of screened models allows additional associations between biomarkers and dependencies to be identified. The integrated datasets were the output of two orthogonal pre-processing methods, CRISPRcleanR and CERES. The use-case analysis showed that CERES (which borrows information across screens) yields a final dataset better able to identify prior known essential and non-essential genes and clustering of cell lines by lineage. In contrast, CRISPRcleanR (a per sample method) was better able to detect associations between selective dependencies and potential biomarkers, and had better recall of known oncogenic 21 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/xH1A3+cZFN5 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ addictions. Therefore, results from both processing methods provide the best overall data-driven functional Cancer Dependency Map. The data integration strategies and sample size guidelines outlined here can be used with future and additional CRISPR-Cas9 datasets to increase coverage of cancer dependencies. This will be important for oncological functional genomics, for the identification of novel cancer therapeutic targets, and for the definition of a global cancer dependency map. Further, as library design improves​24,41,42​ we would expect the coverage and accuracy of the integrated datasets to also improve. 22 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/EqQvF+Ztmd+DkGL https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Data availability The final integrated datasets are available for download at https://figshare.com/projects/Integrated_CRISPR/78252 ​. The data will also be made accessible through the DepMap (https://depmap.org) and Score (https://score.depmap.sanger.ac.uk) web portals in early 2021. Code availability Scripts and software packages implementing the integration pipeline described in this manuscript and needed to reproduce results and figures are available on GitHub at https://github.com/DepMap-Analytics/IntegratedCRISPR with data sources available on Figshare: ​https://figshare.com/projects/Integrated_CRISPR/78252 ​. Acknowledgments This work was partially funded by Open Targets [project OTAR0255] and by the Wellcome Trust [grant 206194]. We thank Leo Parts for a number of insightful discussions. Author Contributions CP conceived the study, designed, implemented and performed analyses, assembled figures, curated data, wrote the manuscript. JMD conceived the study, designed, implemented and performed analyses, assembled figures, and contributed to manuscript writing. IB contributed to pipeline implementation. EG performed analyses, assembled figures, revised the manuscript. HN assembled figures, revised the manuscript. EK, DvdM, AB, HL, PJ contributed to data curation. JMM, MJG, and AT revised the manuscript and contributed to study supervision. FI conceived the study, designed analyses, contributed to figure production, wrote the manuscript, acquired funds and supervised the study. Competing interests MJG, and FI receive funding from Open Targets, a public-private initiative involving academia and industry. MJG receives funding from AstraZeneca and performs consultancy for Sanofi. FI performs consultancy for the joint CRUK - AstraZeneca Functional Genomics Centre. AT is a consultant for Tango Therapeutics and Cedilla Therapeutics. JMD, JM and AT receive funding from the Cancer Dependency Map Consortium, but no consortium member was involved in or influenced this study. All the other authors ​declare no competing interests. 23 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://figshare.com/projects/Integrated_CRISPR/78252 https://github.com/DepMap-Analytics/IntegratedCRISPR https://figshare.com/projects/Integrated_CRISPR/78252 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Methods Preprocessing data Sanger data processed with CRISPRcleanR were obtained from the Score website (​https://score.depmap.sanger.ac.uk/​). The CRISPRcleanR corrected counts were used as input into JACKS, for the CCR-JACKS processing method. Raw counts and the copy number profiles for the Sanger dataset downloaded were processed with CERES​20​. The Broad data processed with CERES (unscaled gene effect) version 20Q2 scores were downloaded from the Broad DepMap portal ​20​. The raw counts for Broad data 20Q2 were processed with CRISPRcleanR and the CRISPRcleanR corrected counts processed with JACKS. Gene names were matched across the Broad and Sanger datasets by updating both to the current version of HUGO gene symbols from the HGNC website. Missing entries were mean imputed for the principal component removal and then re-assigned as NA in the final matrix. Cell lines processed by both CERES and CRISPRcleanR were used for analysis. Tissue annotations for each cell line were obtained from the Cell Model Passports (​https://cellmodelpassports.sanger.ac.uk/​)​43​. Batch correction pipelines The dependency profiles across genes (DPGs) for overlapping cell lines from each institute were first quantile normalized using the preprocessCore package in R​44​. Screen quality adjustments were made by fitting a spline to the average gene fold change across cell line DPGs. Each DPG was then adjusted to remove the difference between the fitted spline and the diagonal. The overlapping cell lines were then batch corrected using three different methods. A standard least squares model was fitted in R. The ComBat correction was performed using the sva package in R​45​. Batch correction pipelines’ assessment and weighted Pearson correlation metric Cell lines’ rank neighborhoods were based on a weighted Pearson correlation metric. The weights were defined as the absolute mean (over the Broad and Sanger datasets) of a gene dependency signal skewness across the 168 overlapping cell lines for the Broad and Sanger datasets. Using skewness upweights genes with a variable and sufficiently selective fitness profile whilst downweighting those that show weak/no-signal or unselective dependencies. Then for each query DPG we ranked all the others based on how similar they were to the fixed one in decreasing order, according to the wPearson scores. For each position ​k​ in the resulting rank we then defined a ​k-neighborhood​ of the query DPG composed of all the other  DPGs whose rank position was ≤ ​k​. Finally we determined the number of cell line DPGs that 24 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://score.depmap.sanger.ac.uk/ https://paperpile.com/c/BNwyax/6qc1 https://paperpile.com/c/BNwyax/6qc1 https://cellmodelpassports.sanger.ac.uk/ https://paperpile.com/c/BNwyax/wfSuM https://paperpile.com/c/BNwyax/6zWnw https://paperpile.com/c/BNwyax/ZCFXR https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ had the DPG derived from screening the same cell line in the other dataset (a matching DPG) in its ​k-neighborhood​. The final rank for each cell line was defined based on the minimum rank obtained for each cell line when considering the DPG for that cell line from the Broad data compared to all DPGs, and similarly the DPG for the cell line in the Sanger dataset compared to all DPGs. Analysis of Principal Components The first two principal components (PCs) were extracted from ComBat corrected CRISPRcleanR data using the prcomp function in R. The top 500 genes (according to the absolute value of their PC loadings) were selected for enrichment analysis. The gene lists were used as input into the GSEA website (​https://www.gsea-msigdb.org/​) and were tested against the Gene ontology Biological Processes, Hallmark and Canonical Pathway databases. The top 10 significantly enriched (q-value <0.05) gene sets were downloaded from the website. Batch correction extended to 908 cell lines The ComBat estimates, pooled mean, variance and empirical Bayes adjustments (mean and standard deviation) for each batch based on the analysis of 168 cell lines common to both initial dataset were computed. The ComBat correction using these estimates was then applied to all screens, i.e. the union of the two initial datasets. Particularly, each individual cell line DPG was shifted and scaled gene-wise using the batch correction vectors outputted by ComBat. Further adjustments were then applied to all screens including quantile normalization, and the removal of either the 1st principal component of the joint datasets or the first two. Finally, DPGs for overlapping cell lines passing a similarity threshold (detailed below) were averaged. Across the three pre-processing methods the number of cell lines that matched their counterparts exactly after ComBat correction ranged from 51% - 86% (​Figure 3b)​, suggesting that under all pre-processing methods there remained cell lines whose DPGs diverged between studies. For each of the cell lines that matched their counterpart as the first neighbor we considered their distances (1-wPearson) as a measure of the variability in distance profiles between DPGs of the same cell line across institutes. We called divergent DPGs those with a distance greater than the 95th percentile of distances from matching cell lines. For 16 cell lines with divergent DPGs across all three processing methods we selected the DPG from the screen with the highest quality to be included in the integrated datasets. As a quality metric we used the Null-normalized mean difference (NNMD, defined in the 25 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://www.gsea-msigdb.org/ https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ main text) and took its consensual value across the three datasets (resulting from applying CERES, CCR-JACKS and CRISPRcleanR). Agreement between dependency profile clusterings and cell line tissue labels We selected 500 genes with the highest variance in the CERES ComBat integrated dataset and performed repeated 100 k-means clusterings cell lines using the high variance genes for each pre-processing and batch-correction method. For each clustering, we calculated the adjusted mutual information between the obtained clusters and the cell line tissue labels as specified in the annotation provided by the sample_info file of the DepMap_public_20Q2 dataset​20​ using sklearn’s python function adjusted_mutual_info_score (​https://scikit-learn.org/stable/​). Recall of known gene relationships We assembled a set of functionally related gene pairs using paralogs identified by EnsemblCompara ​33​, protein-protein interactions identified by Li et al ​34​, and CORUM complex comemberships​35​. For a given dataset, for each pair of related genes, we calculated a Pearson correlation coefficient between those genes’ dependency scores across cell lines. We then binned each gene that appeared in the list of known gene relationships according to its mean gene score using 20 equally spaced bins. For pairs of genes in the related genes pairs, we chose one as the query gene and replaced its related partner with another randomly selected gene of similar gene mean, i.e. belonging to the same bin, excluding genes known to be related to the query gene. We calculated Pearson’s correlation coefficients between these randomly selected gene pairs to generate a null distribution, from which we calculated empirical ​p​-values and Benjamini-Hochberg FDRs for known related gene pairs. Ensuring that the pairs of genes used in the null distribution have similar distributions of mean gene effect as the pairs of known related genes is necessary because variable screen quality can produce a high but artifactual correlation between any pair of common essential genes, and CORUM is highly biased towards common essentials. This is discussed further in the comparisons of batch corrections in Dempster et al ​29​. Unexpressed false positives We defined a gene as unexpressed in a cell line if the log2(Transcripts per million +1) of its DepMap expression was less than 0.01 ​46​. Any score of an unexpressed gene in a cell line was called a false positive if it fell in the bottom 15% of gene scores for that cell line. 26 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/6qc1 https://scikit-learn.org/stable/ https://paperpile.com/c/BNwyax/dwIrJ https://paperpile.com/c/BNwyax/z554A https://paperpile.com/c/BNwyax/KXhhL https://paperpile.com/c/BNwyax/fOJkA https://paperpile.com/c/BNwyax/3zOfE https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Identifying selective dependencies NormLRT and likelihood of normal distribution was calculated in R using the MASS package ​47​. For the skew t-distribution the st.mple function from the sn package was used to calculate the likelihood. If the fitting procedure failed different degrees of freedom were used iteratively until a solution was found. The degrees of freedom used in order were 2,5,10,25,50 and 100. Systematic association test between molecular features and gene dependencies We performed a systematic two-sample unpaired Student’s ​t​-test (with the assumption of equal variance between compared populations) to assess the differential essentiality of each gene across a dichotomy of cell lines defined by the status (present/absent) of each CFE in turn. We tested genes whose NormLRT values were greater than 200 in any integrated dataset. From these tests, we obtained ​p​-values against the null hypothesis that the two compared populations had an equal mean, with the alternative hypothesis indicating an association between the tested CFE/gene-dependency pair. ​P​-values were corrected for multiple hypothesis testing using Benjamini–Hochberg (method ‘fdr’ using the p.adjust function in R). We also estimated the effect size of each tested association using Cohen’s Delta (ΔFC), i.e. the difference in population means divided by their pooled standard deviations. Evaluating known selective dependencies A table of all annotated oncogene variants was downloaded from OncoKB​32​. The table was filtered first for genes that were (likely) oncogenic and alterations that were (likely) gain-of-function or switch-of-function. For each alteration, the DepMap public 20Q2 ​20 mutation and fusion calls were used to identify which cell lines had the alteration. These cell lines were treated as positive controls for the gene in question, with all other cell lines treated as negative controls. Only oncogenes with at least one positive cell line were retained. For each integrated dataset, we calculated the ROC AUC between all positive oncogene-cell line pairs and negative pairs. Then, for each oncogene with at least two positive cell lines, we calculated the NNMD between its positive and negative cell lines. Identification of common essential genes via the 90th Percentile method The 90th percentile method ​27​ finds for each gene the cell line on the boundary of its 90th percentile least dependent cell lines. It then calculates the rank of that gene in that cell line, by sorting all the genes based on their dependency score in increasing order. A mixture of 27 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/fENJN https://paperpile.com/c/BNwyax/D9gc https://paperpile.com/c/BNwyax/6qc1 https://paperpile.com/c/BNwyax/ezH2 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ two normal distributions is then fitted to the rank positions of all genes. Those genes with ranks below the crossover point of these two distributions are labeled as common essentials. ADaM method Binary depletion matrices for the integrated datasets were calculated as outlined in the next section and used with the ADaM method as described in Behan et al ​2​. The ADaM method determines the number of cell lines dependent on a gene required to call that gene a common essential. The number of cell lines is calculated by maximizing the tradeoff between true positive rate (using a set of known prior essential genes) and the deviance from the null expected rate (calculated using random permutations of the binary depletion matrix). Common essential genes were identified for each tissue separately (according to the cell line annotation from the Cell Model Passports​43​) and were then used as input into ADaM to determine pan-cancer common essential genes. Binary depletion calls Binary depletion calls were computed by considering each cell line DPG as a rank-based classifier of essential/non-essential genes​11​ (with gene rank positions determined by their fitness effect, i.e. average depletion fold-change of targeting single guide RNAs abundance at the end of the assay with respect to plasmid counts). The fitness effect threshold was then fixed as that corresponding to the largest rank position r​ guaranteeing a false discovery rate (FDR) < 5%, when the predicted essential genes are  those with a rank position ≤ ​r​. This allowed us to assign to each gene in each cell line, in each of the two datasets, a binary dependency score. To identify significantly depleted genes for a given cell line at a 5% FDR, we ranked all the genes in the cell line DPG in increasing order based on their depletion log fold-changes. We used the ranked list to calculate the precision curve using a set of prior known essential (​E​) and non-essential (​N​) genes, respectively, derived from Hart et al ​11​. To estimate the rank position corresponding to the 5% FDR threshold we calculated for each rank position ​k​, a set of predicted essential genes ​P(k)​ ​=​ {​s​ ​∈​ ​E​ ​∪​ ​N:​ ​r(s)​ ​≤​ ​k ​}, with ​r(s) indicating the rank position of ​s​, and the corresponding positive predictive value (or precision) ​PPV(k)​ as: 28 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/e4Ooj https://paperpile.com/c/BNwyax/wfSuM https://paperpile.com/c/BNwyax/g3BuJ https://paperpile.com/c/BNwyax/g3BuJ https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ PPV(k)=|P(k)∩E|/|P(k)| We then determined the largest rank position ​k*​ with ​PPV(k*)​ ≥ 0.95 (equivalent to a  FDR ≤ 0.05). The 5% FDR logFCs threshold ​F*​ was defined as the logFCs of the gene s such that ​r(s)​ ​=​ ​k*​. We called all genes with a logFC < ​F*​ as significantly depleted at 5% FDR. Binary dependency matrices were defined as gene by cell lines matrices with non null entries corresponding to significant dependency genes at 5% FDR, for each cell line, i.e. column. Positive controls for common essentials To generate sets of prior known common essential genes we downloaded gene sets from MsigDB (v7.2) using the R package qusage. The gene sets used were from KEGG were KEGG_SPLICEOSOME, KEGG_RIBOSOME, KEGG_PROTEASOME, KEGG_RNA_POLYMERASE and KEGG_DNA_REPLICATION. For the histones gene set we combined two reactome gene sets REACTOME_HATS_ACETYLATE_HISTONES and REACTOME_HDACS_DEACETYLATE_HISTONES as well as the curated histones gene set from ​2​. Negative controls for common essentials We compiled a set of negative controls for the common essential genes as those genes that were not expressed across all cell lines. We defined a gene as unexpressed across the panel of cell lines using the log2(Transcripts per million +1) of its CCLE expression ​20​ and the 90th percentile method (The input into the ADaM2 package (available at https://github.com/DepMap-Analytics/ADAM2 ​) performing the 90th percentile method was -1*log2(TPM+1) to ensure correct ranking). A gene defined as constitutively unexpressed was therefore one that was still lowly expressed in its highly ranked (90th percentile) most expressed cell line. Downsampling for batch correction sample sizes We downsampled 50 times the overlapping cell lines at different levels between 5% and 90%. Random samples were generated using probabilities of selecting a cell line based 29 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/e4Ooj https://paperpile.com/c/BNwyax/6qc1 https://github.com/DepMap-Analytics/ADAM2 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ on the relative proportions of each cell line lineage in the overlapping data set. Using the downsampled set of overlapping cell lines ComBat was used to calculate the batch adjustment vectors. The batch adjustment vectors were then applied to all 1,074 cell lines. The correlation of a cell lines fold changes batch corrected using the downsampled datasets and the full 168 overlapping cell lines was calculated and compared to the correlation with no batch correction. To evaluate the batch correction we also used the average silhouette width as a measure of clustering. We calculated the average silhouette width for each batch corrected data set (using samples of the overlapping cell lines) using the institute of origin as the cluster label. The average silhouette width is 1 for perfect clustering (or complete separation of cell lines by the institute of origin) with 0 indicating random performance of the clusters. References 1. Prasad, V. Perspective: The precision-oncology illusion. ​Nature​ ​537​, S63 (2016). 2. Behan, F. M. ​et al. ​ Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens. ​Nature​ ​568​, 511–516 (2019). 3. Tsherniak, A. ​et al.​ Defining a Cancer Dependency Map. ​Cell​ ​170​, 564–576.e16 (2017). 4. McDonald, E. R., 3rd ​et al.​ Project DRIVE: A Compendium of Cancer Dependencies and Synthetic Lethal Relationships Uncovered by Large-Scale, Deep RNAi Screening. Cell​ ​170​, 577–592.e10 (2017). 5. Shalem, O. ​et al. ​ Genome-scale CRISPR-Cas9 knockout screening in human cells. Science​ ​343​, 84–87 (2014). 6. Koike-Yusa, H., Li, Y., Tan, E.-P., Velasco-Herrera, M. D. C. & Yusa, K. Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library. ​Nat. Biotechnol.​ ​32​, 267–273 (2014). 7. Wang, T., Wei, J. J., Sabatini, D. M. & Lander, E. S. Genetic screens in human cells using the CRISPR-Cas9 system. ​Science​ ​343​, 80–84 (2014). 8. Steinhart, Z. ​et al. ​ Genome-wide CRISPR screens reveal a Wnt-FZD5 signaling circuit as a druggable vulnerability of RNF43-mutant pancreatic tumors. ​Nat. Med.​ ​23​, 60–68 (2017). 9. Shi, J. ​et al. ​ Discovery of cancer drug targets by CRISPR-Cas9 screening of protein domains. ​Nat. Biotechnol.​ ​33​, 661–667 (2015). 10. Tzelepis, K. ​et al.​ A CRISPR Dropout Screen Identifies Genetic Vulnerabilities and Therapeutic Targets in Acute Myeloid Leukemia. ​Cell Rep.​ ​17​, 1193–1205 (2016). 11. Hart, T. ​et al. ​ High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities. ​Cell​ ​163​, 1515–1526 (2015). 12. Meyers, R. M., Bryan, J. G., McFarland, J. M. & Weir, B. A. Computational correction of 30 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint http://paperpile.com/b/BNwyax/VOtGa http://paperpile.com/b/BNwyax/VOtGa http://paperpile.com/b/BNwyax/VOtGa http://paperpile.com/b/BNwyax/VOtGa http://paperpile.com/b/BNwyax/VOtGa http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/f4TT0 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells. ​Nature​ (2017). 13. Wellcome Sanger Institute. Cancer Dependency Map. ​https://depmap.sanger.ac.uk/​. 14. Broad Institute of Harvard and MIT. Cancer Dependency Map. ​https://depmap.org/​. 15. Feng, F. Y. & Gilbert, L. A. Lethal clues to cancer-cell vulnerability. ​Nature​ vol. 568 463–464 (2019). 16. Dempster, J. ​et al.​ Agreement between two large pan-cancer genome-scale CRISPR knock-out datasets. ​Nature Communications​ ​In Press ​,. 17. Iorio, F. ​et al. ​ Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. ​BMC Genomics​ ​19​, 604 (2018). 18. Allen, F. ​et al.​ JACKS: joint analysis of CRISPR/Cas9 knockout screens. ​Genome Res. 29​, 464–471 (2019). 19. Project Score. ​https://score.depmap.sanger.ac.uk/​. 20. DepMap, B. DepMap 20Q2 Public. (2020) doi:​10.6084/M9.FIGSHARE.12280541.V4 ​. 21. Project Achilles. ​https://figshare.com/articles/DepMap_19Q3_Public/9201770 ​. 22. Aguirre, A. J. ​et al. ​ Genomic Copy Number Dictates a Gene-Independent Cell Response to CRISPR/Cas9 Targeting. ​Cancer Discov.​ ​6 ​, 914–929 (2016). 23. Gonçalves, E. ​et al.​ Structural rearrangements generate cell-specific, gene-independent CRISPR-Cas9 loss of fitness effects. ​Genome Biol.​ ​20​, 27 (2019). 24. Doench, J. G. ​et al. ​ Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. ​Nat. Biotechnol.​ ​32​, 1262–1267 (2014). 25. Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. ​Bioinformatics​ ​28​, 882–883 (2012). 26. Liberzon, A. ​et al.​ Molecular signatures database (MSigDB) 3.0. ​Bioinformatics​ ​27​, 1739–1740 (2011). 27. Dempster, J. M. ​et al. ​ Agreement between two large pan-cancer CRISPR-Cas9 gene dependency data sets. ​Nat. Commun.​ ​10​, 5817 (2019). 28. Lagziel, S., Lee, W. D. & Shlomi, T. Inferring cancer dependencies on metabolic genes from large-scale genetic screens. ​BMC Biol.​ ​17​, 37 (2019). 29. Dempster, J. M., Rossen, J., Kazachkova, M. & Pan, J. Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines. BioRxiv​ (2019). 30. Iorio, F. ​et al. ​ A Landscape of Pharmacogenomic Interactions in Cancer. ​Cell​ ​166​, 740–754 (2016). 31. Chakravarty, D. ​et al.​ OncoKB: A Precision Oncology Knowledge Base. ​JCO Precis Oncol​ ​2017​, (2017). 32. OncoKB. All Annotated Variants. ​OncoKB.org http://oncokb.org/api/v1/utils/allAnnotatedVariants​ (2020). 33. Aken, B. L. ​et al. ​ Ensembl 2017. ​Nucleic Acids Res.​ ​45​, D635–D642 (2017). 34. Li, T. ​et al. ​ A scored human protein-protein interaction network to catalyze genomic interpretation. ​Nat. Methods​ ​14​, 61–64 (2017). 35. Ruepp, A. ​et al.​ CORUM: the comprehensive resource of mammalian protein complexes--2009. ​Nucleic Acids Res.​ ​38​, D497–501 (2010). 36. Hart, T. ​et al. ​ Evaluation and Design of Genome-Wide CRISPR/SpCas9 Knockout Screens. ​G3 ​ ​7 ​, 2719–2727 (2017). 31 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint http://paperpile.com/b/BNwyax/f4TT0 http://paperpile.com/b/BNwyax/f4TT0 http://paperpile.com/b/BNwyax/f4TT0 http://paperpile.com/b/BNwyax/f4TT0 http://paperpile.com/b/BNwyax/Kl5bc http://paperpile.com/b/BNwyax/Kl5bc http://paperpile.com/b/BNwyax/Kl5bc http://paperpile.com/b/BNwyax/htOyk https://depmap.org/ http://paperpile.com/b/BNwyax/htOyk http://paperpile.com/b/BNwyax/wJXm9 http://paperpile.com/b/BNwyax/wJXm9 http://paperpile.com/b/BNwyax/wJXm9 http://paperpile.com/b/BNwyax/wJXm9 http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/3CgU2 https://score.depmap.sanger.ac.uk/ http://paperpile.com/b/BNwyax/3CgU2 http://paperpile.com/b/BNwyax/6qc1 http://dx.doi.org/10.6084/M9.FIGSHARE.12280541.V4 http://paperpile.com/b/BNwyax/6qc1 http://paperpile.com/b/BNwyax/N7Jvg https://figshare.com/articles/DepMap_19Q3_Public/9201770 http://paperpile.com/b/BNwyax/N7Jvg http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/fOJkA http://paperpile.com/b/BNwyax/fOJkA http://paperpile.com/b/BNwyax/fOJkA http://paperpile.com/b/BNwyax/fOJkA http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/D9gc http://paperpile.com/b/BNwyax/D9gc http://paperpile.com/b/BNwyax/D9gc http://oncokb.org/api/v1/utils/allAnnotatedVariants http://paperpile.com/b/BNwyax/D9gc http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ 37. Kanehisa, M. ​et al.​ KEGG for linking genomes to life and the environment. ​Nucleic Acids Res.​ ​36​, D480–4 (2008). 38. Fabregat, A. ​et al.​ The Reactome Pathway Knowledgebase. ​Nucleic Acids Res.​ ​46​, D649–D655 (2018). 39. Lenoir, W. F., Lim, T. L. & Hart, T. PICKLES: the database of pooled in-vitro CRISPR knockout library essentiality screens. ​Nucleic Acids Res.​ ​46​, D776–D780 (2018). 40. Rauscher, B., Heigwer, F., Breinig, M., Winter, J. & Boutros, M. GenomeCRISPR - a database for high-throughput CRISPR/Cas9 screens. ​Nucleic Acids Research​ vol. 45 D679–D686 (2017). 41. Gonçalves, E., Thomas, M., Behan, F. M., Picco, G. & Pacini, C. Minimal genome-wide human CRISPR-Cas9 library. ​bioRxiv​ (2019). 42. Elmentaite, R., Noell, G., Turner, G., Iyer, V. & Parts, L. Minimized double guide RNA libraries enable scale-limited CRISPR/Cas9 screens. ​bioRxiv​ (2019). 43. van der Meer, D. ​et al.​ Cell Model Passports—a hub for clinical, genetic and functional datasets of preclinical cancer models. ​Nucleic Acids Res.​ ​47​, D923–D929 (2019). 44. Bolstad, B. M. preprocessCore: A collection of pre-processing functions. 2016. ​R package version​ ​1 ​,. 45. Leek, J. T. ​et al. ​ sva: Surrogate Variable Analysis. R Package Version 30. 2017. 46. DepMap, B. DepMap 19Q4 Public. (2020) doi:​10.6084/m9.figshare.11384241.v2 ​. 47. Ripley, B. ​et al.​ Package ‘mass’. ​Cran R​ ​538​, (2013). 32 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/cZFN5 http://paperpile.com/b/BNwyax/cZFN5 http://paperpile.com/b/BNwyax/cZFN5 http://paperpile.com/b/BNwyax/cZFN5 http://paperpile.com/b/BNwyax/cZFN5 http://paperpile.com/b/BNwyax/Ztmd http://paperpile.com/b/BNwyax/Ztmd http://paperpile.com/b/BNwyax/Ztmd http://paperpile.com/b/BNwyax/Ztmd http://paperpile.com/b/BNwyax/DkGL http://paperpile.com/b/BNwyax/DkGL http://paperpile.com/b/BNwyax/DkGL http://paperpile.com/b/BNwyax/DkGL http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/ZCFXR http://paperpile.com/b/BNwyax/ZCFXR http://paperpile.com/b/BNwyax/ZCFXR http://paperpile.com/b/BNwyax/3zOfE http://dx.doi.org/10.6084/m9.figshare.11384241.v2 http://paperpile.com/b/BNwyax/3zOfE http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2020_08_13_249839 ---- Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs Tsung-Yu Lu 1 , The Human Genome Structural Variation Consortium, Mark Chaisson 1 * * corresponding author, mchaisso@usc.edu 1 Department of Quantitative and Computational Biology, University of Southern California, California, USA Abstract Variable number tandem repeat sequences (VNTR) are composed of consecutive repeats of short segments of DNA with hypervariable repeat count and composition. They include protein coding sequences and associations with clinical disorders. It has been difficult to incorporate VNTR analysis in disease studies that use short-read sequencing because the traditional approach of mapping to the human reference is less effective for repetitive and divergent sequences. We solve VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies. We developed software to build a RPGG, and use the RPGG to estimate VNTR composition with short reads. We used this to discover VNTRs with length stratified by continental population, and novel expression quantitative trait loci, indicating that RPGG analysis of VNTRs will be critical for future studies of diversity and disease. Introduction The human genome is composed of roughly 3% simple sequence repeats (SSRs) (I. H. G. S. Consortium and International Human Genome Sequencing Consortium 2001) , loci composed of short, tandemly repeated motifs. These sequences are classified by motif length into short tandem repeats (STRs) with a motif length of six nucleotides or fewer, and variable-number tandem repeats (VNTRs) for repeats of longer motifs. SSRs are prone to hyper-mutability through motif copy number changes due to polymerase slippage during DNA replication (Viguera, Canceill, and Ehrlich 2001) . Variation in SSRs are associated with tandem repeat disorders .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint mailto:mchaisso@usc.edu https://paperpile.com/c/H8ctd0/Ndo7A https://paperpile.com/c/H8ctd0/Ndo7A https://paperpile.com/c/H8ctd0/oc37W https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ (TRDs) including amyotrophic lateral sclerosis and Huntington’s disease (Gatchel and Zoghbi 2005) , and VNTRs are associated with a wide spectrum of complex traits and diseases including attention-deficit disorder, Type 1 Diabetes and schizophrenia (Hannan 2018) . While STR variation has been profiled in human populations (Mallick et al. 2016) and to find expression quantitative trait loci (eQTL) (Fotsing et al. 2019; Gymrek et al. 2016) , and variation at VNTR sequences may be detected for targeted loci (Bakhtiari et al. 2018; Dolzhenko et al. 2019) , the landscape of VNTR variation in populations and effects on human phenotypes are not yet examined genome-wide. Large scale sequencing studies including the 1000 Genomes Project (1000 Genomes Project Consortium et al. 2015) , TOPMed (Taliun et al. 2019) and DNA sequencing by the Genotype-Tissue Expression (GTEx) project (G. Consortium and GTEx Consortium 2017) rely on high-throughput sequencing (SRS) characterized by SRS reads up to 150 bases. Alignment and standard approaches for detecting single-nucleotide variant (SNV) and indel variation ( insertions and deletions less than 50 bases) using SRS are unreliable in SSR loci (Li et al., n.d.) , and the majority of VNTR SVs are missed using SV detection algorithms with SRS (Chaisson et al. 2019) . The full extent to which VNTR loci differ has been made more clear by single-molecule sequencing (LRS) and assembly. LRS assemblies have megabase scale contiguity and accurate consensus sequences (Koren et al. 2017; Chin et al. 2016) that may be used to detect VNTR variation. Nearly 70% of insertions and deletions discovered by LRS assemblies greater than 50 bases are in STR and VNTR loci (Chaisson et al. 2019) , accounting for up to 4 Mbp per genome. Furthermore, LRS assemblies reveal how VNTR sequences differ kilobases in length and by motif composition (Song, Lowe, and Kingsley 2018) . Here we propose using a limited number of human LRS genomes sequenced for population references and diversity panels (Chaisson et al. 2019; Audano et al. 2019; Seo et al. 2016; Shi et al. 2016) to improve how VNTR variation is detected using SRS. It has been previously demonstrated that VNTR variation discovered by LRS assemblies may be genotyped using SRS (Hickey et al. 2020; Audano et al. 2019) . However, the genotyping accuracy for VNTR SVs is considerably lower than accuracy for genotyping other SVs, owing to the complexity of representing .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://paperpile.com/c/H8ctd0/p908A https://paperpile.com/c/H8ctd0/9K9ci https://paperpile.com/c/H8ctd0/t54PI https://paperpile.com/c/H8ctd0/0DGuV+QaNj https://paperpile.com/c/H8ctd0/0DGuV+QaNj https://paperpile.com/c/H8ctd0/0gs4+qaF1 https://paperpile.com/c/H8ctd0/0gs4+qaF1 https://paperpile.com/c/H8ctd0/jzBJy https://paperpile.com/c/H8ctd0/jzBJy https://paperpile.com/c/H8ctd0/cRk7v https://paperpile.com/c/H8ctd0/lyx1d https://paperpile.com/c/H8ctd0/yMN6Z https://paperpile.com/c/H8ctd0/rpD83 https://paperpile.com/c/H8ctd0/pJ5xM+q9Ll7 https://paperpile.com/c/H8ctd0/pJ5xM+q9Ll7 https://paperpile.com/c/H8ctd0/rpD83 https://paperpile.com/c/H8ctd0/JeL81 https://paperpile.com/c/H8ctd0/rpD83+k5roB+xd623+B7ifz https://paperpile.com/c/H8ctd0/rpD83+k5roB+xd623+B7ifz https://paperpile.com/c/H8ctd0/JzYin+k5roB https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ VNTR variation and mapping reads to SV loci. Most existing tools support a limited description of the complexity of tandem repeats using a single motif, such as in GangSTR (Mousavi et al. 2019) and adVNTR (Bakhtiari et al. 2018) . While ExpansionHunter (Dolzhenko et al. 2019) allows the repeat structure to be defined by a regular expression, it is mostly restricted to STR genotyping and has not been extended to VNTRs. Additionally, GangSTR and adVNTR are designed to estimate the number of a repeat unit, which leaves the variation in motif sequences unexplored. Furthermore, traditional genotyping tests (Chen et al. 2019) for the presence of a known variant, and does not reveal the spectrum of copy number variation that exists in tandem repeat sequences. Repeat length estimation in tools specialized for tandem repeat genotyping allows more biological meaningful analyses (Gymrek et al. 2016; Saini et al. 2018; Gymrek et al. 2017) . An alternative approach to tackle the VNTR genotyping problem is to use LRS assemblies as population-specific references that improve SRS read mapping by adding sequences missing from the reference (Du et al. 2019; Shi et al. 2016) . Because missing sequences are enriched for VNTRs (Audano et al. 2019) , haplotype-resolved LRS genomes may help improve alignment to VNTR regions, as well as facilitate the development of a model to discover VNTR variation by serving as a ground truth. The hypervariability of VNTRs prevents a single assembly from serving as an optimal reference. Instead, to improve both alignment and genotyping, multiple assemblies may be combined into a pangenome graph (PGG) (Hickey et al. 2020; Eggertsson et al. 2019; Garrison et al. 2018; Chen et al. 2019) composed of sequence-labeled vertices connected by edges such that haplotypes correspond to paths in the graph. Sequences shared between haplotypes are stored in the same vertex, and genetic variation is represented by the structure of the graph. A conceptually similar construct is the repeat graph (Pevzner, Tang, and Tesler 2004) , with sequences repeated multiple times in a genome represented by the same vertex. Graph analysis has been used to encode the elementary duplication structure of a genome (Jiang et al. 2007) and for multiple sequence alignment of repetitive sequences with shuffled domains (Raphael et al. 2004) , making them well-suited to represent VNTRs that differ in both repeat count and composition. Here we propose the representation of .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://paperpile.com/c/H8ctd0/aKIi https://paperpile.com/c/H8ctd0/0gs4 https://paperpile.com/c/H8ctd0/qaF1 https://paperpile.com/c/H8ctd0/hn5t https://paperpile.com/c/H8ctd0/QaNj+37xl+yulf https://paperpile.com/c/H8ctd0/eix7E+B7ifz https://paperpile.com/c/H8ctd0/eix7E+B7ifz https://paperpile.com/c/H8ctd0/k5roB https://paperpile.com/c/H8ctd0/JzYin+N6KaX+lMbAV+hn5t https://paperpile.com/c/H8ctd0/tdFtW https://paperpile.com/c/H8ctd0/wQpB6 https://paperpile.com/c/H8ctd0/xHKpD https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ human VNTRs as a repeat-pangenome graph (RPGG), that encodes both the repeat structure and sequence diversity of VNTR loci (Figure 1c). The most straight-forward approach that combines a pangenome graph and a repeat graph is a de Bruijn graph, and was the basis of one of the earliest representations of a pangenome by the Cortex method (Iqbal, Turner, and McVean 2013; Iqbal et al. 2012) . The de Bruijn graph has a vertex for every distinct sequence of length k in a genome ( k- mer), and an edge connecting every two consecutive k -mers, thus k -mers occurring in multiple genomes or in multiple times in the same genome are stored by the same vertex. While the Cortex method stores entire genomes in a de Bruijn graph, we construct a separate locus-RPGG for each VNTR and store a genome as the collection of locus-RPGGs, which deviates from the definition of a de Bruijn graph because the same k -mer may be stored in multiple vertices. We developed a toolkit, Tan d em Repe a t Ge n otyping b ased on Haplotype-der i ved Pange n ome G raphs (danbing-tk) to identify VNTR boundaries in assemblies, construct RPGGs, align SRS reads to the RPGG, and infer VNTR motif composition and length in SRS samples. This enables the alignment of SRS datasets into an RPGG to discover population genetics of VNTR loci, and to associate expression with VNTR variation. Results. Repeat pan-genome graph construction Our approach to build RPGGs is to de novo assemble LRS genomes, and build de Bruijn graphs on the assembled sequences at VNTR loci, using SRS genomes to ensure graph quality. We used public LRS data for 19 individuals with diverse genetic backgrounds, including genomes from individual genome projects (Seo et al. 2016; Zook et al. 2020) , structural variation studies (Chaisson et al. 2019) , and diversity panel sequencing (Audano et al. 2019) (Figure 1a, Supplementary Table 1). Each genome was sequenced by either PacBio single long read (SLR) between, or high-fidelity (HiFi) sequencing between 16 and 76-fold coverage along with matched 22-82-fold Illumina sequencing (Table 1). This data reflects a wide range of technology revisions, .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://paperpile.com/c/H8ctd0/JmTrf+CjaUX https://paperpile.com/c/H8ctd0/JmTrf+CjaUX https://paperpile.com/c/H8ctd0/xd623+ccLhp https://paperpile.com/c/H8ctd0/xd623+ccLhp https://paperpile.com/c/H8ctd0/rpD83 https://paperpile.com/c/H8ctd0/k5roB https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ sequencing depth, and data type, however subsequent steps were taken to ensure accuracy of RPGG through locus redundancy and SRS alignments. We developed a pipeline that partitions LRS reads by haplotype based on phased heterozygous SNVs and assembles haplotypes separately by chromosome. When available, we used existing telomere-to-telomere SNV and phase data provided by Strand-Seq and/or 10X Genomics (Porubsky et al. 2017; Chaisson et al. 2019) with phase-block N50 size between 13.4-18.8 Mb. For other datasets, long-read data were used to phase SNVs. While this data has lower phase-block N50 (<0.5 - 6 Mb), the individual locus-RPGG do not use long-range haplotype information and are not affected by phasing switch error. Reads from each chromosome and haplotype were independently assembled using the flye assembler (Kolmogorov et al. 2019) for a diploid of 0.88-14.5Mb N50, with the range of assembly contiguity reflected by the diversity of input data. In this study, the number of resolved VNTR loci is a more accurate measurement of useful assembly contiguity than N50 because a disjoint RPGG is generated for each VNTR locus. An initial set of 84,411 VNTR intervals with motif size >6 bp, minimal length >150 bp and <10k bp (mean length=420 bp in GRCh38, Methods, Supplementary Table 2) were annotated by Tandem Repeats Finder (TRF) (Benson 1999) , and then mapped onto contig coordinates using pairwise contig alignments. Long VNTR loci tended to have fragmented TRF annotation, which can cause erroneous length estimates in downstream analysis and fail to properly interpret repeat structures as a whole such as in adVNTR-NN (Supplementary Fig. 1). During locus assignment, danbing-tk expands boundaries and merges loci to ensure boundaries of all VNTRs are well-defined and harmonized across genomes (Methods) (Figure 1b). In practice, we found that 43,869/84,411 (52%) of the VNTR loci are subject to boundary expansion, with an average expansion size of 539 bp. The set of VNTRs that can be properly annotated ranges from 19,800-73,212 depending on the assembly quality, with a final set of 73,582 loci (mean length=652 bp) across 19 genomes (Supplementary Fig. 2). The RPGGs are constructed as disjoint bi-directional de Bruijn graphs of each VNTR locus and flanking 700 bases from the haplotype-resolved assemblies. In a bi-directional de Bruijn graph, each distinct sequence of .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://paperpile.com/c/H8ctd0/haw16+rpD83 https://paperpile.com/c/H8ctd0/haw16+rpD83 https://paperpile.com/c/H8ctd0/R2BY9 https://paperpile.com/c/H8ctd0/R2BY9 https://paperpile.com/c/H8ctd0/PGH4U https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ length k ( k -mer) and its reverse complement map to a vertex, and each sequence of length k +1 connects the vertices to which the two composite k -mers map. There was little effect on downstream analysis for values of k between 17 and 25, and so k =21 was used for all applications. To remove spurious vertices and edges from assembly consensus errors, SRS from genomes matching the LRS samples were mapped to the RPGG, and k -mers not mapped by SRS were removed from the graph (average of 264 per locus). Using the number of vertices as a proxy for sampled genetic diversity, we find that 27% (2,102,270 new nodes) of the sequences novel with respect to GRCh38 (7,672,357 nodes) are discovered after the inclusion of 19 genomes, with diversity linearly increasing per genome after the first four genomes are added to the RPGG (8,958,361 nodes, Figure 1c). The alignment of a read to an RPGG may be defined by the path in the RPGG with a sequence label that has the minimum edit distance to the read among all possible paths. We used error-free 150bp paired end reads simulated from six genomes (HG00512, HG00513, HG00731, HG00732, NA19238 and NA19239) to evaluate how reads are aligned to the RPGG. While several methods exist to find alignments that do not reuse cycles (Garrison et al. 2018; Rakocevic et al. 2019) , alignment with cycles is a more challenging problem recently solved by the GraphAligner method to map long reads to pangenome graphs (Rautiainen, Mäkinen, and Marschall 2019) . Although >99.99% of the reads simulated from VNTR loci were aligned, 6.03% of reads matched with less than 90% identity, indicating misalignment. We developed an alternative approach tuned for RPGG alignments in danbing-tk (Figure 1d) to realign all SRS reads within a bam/fastq file to the RPGG in two passes, first by finding locus-RPGGs with a high number (>45 in each end) shared k -mers with reads, and next by threading the paired-end reads through the locus-RPGG, allowing for up to two edits (mismatch, insertion, or deletion) and at least 50 matched k-mers per read against the threaded path (Methods). Using danbing-tk, 99.997% of VNTR-simulated reads were aligned with >90% identity. When reads from the entire genome are considered, for 96.6% of the loci, danbing-tk can map >90% of the reads back to their original VNTR regions. Misaligned reads from either other VNTR loci or untracked regions target relatively few loci; 3.6% .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://paperpile.com/c/H8ctd0/lMbAV+jQzSb https://paperpile.com/c/H8ctd0/Uke8R https://paperpile.com/c/H8ctd0/Uke8R https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ (2,635/73,582) loci have at least one read misaligned from outside the locus. The graph pruning step is the primary cause of missed alignments, and affects on average 2,772 loci per assembly. On real data, danbing-tk required 18.5 GB of memory to map 150 base paired-end reads at 10.1 Mb/sec on 16 cores. Read-to-graph alignment in VNTR regions Alignment of SRS reads to the RPGG enables estimation of VNTR length and motif composition. The count of k -mers in SRS reads mapped to the RPGG are reported by danbing-tk for each locus. For samples and VNTR loci, the result of an alignment is count matrices of dimension , where is the number of vertices in the de Bruijn graph on the locus , excluding flanking sequences. If SRS reads from a genome were sequenced without bias, sampled uniformly, and mapped without error to the RPGG, the count of a k -mer in a locus mapped by an SRS sample should scale by a factor of read depth with the sum of the count of the k -mer from the locus of both assembled haplotypes for the same genome. The quality of alignment (aln- ) and sequencing bias were measured by comparing the k -mer counts from the 19 matched Illumina and LRS genomes (Figure 2a). In total, 44% (32,138/73,582) loci had a mean aln- ≥ 0.96 between SRS and assembly k -mer counts, and were marked as “valid” loci to carry forward for downstream diversity and expression analysis (Figure 2b). Valid had an average length of 341 bp, compared to 657 bp in the entire database (Figure 2c). VNTR loci that did not align well (invalid) were enriched for sequences that map within Alu (21,820), SVA (1,762), and other 26,752 mobile elements (Supplementary Fig. 3); loci with false mapping in the simulation experiment are also enriched in the invalid set (Supplementary Table 3) . Specifically, 71.6% (4,297/5,999) of loci with FP mapping, 84.7% (8,065/9,525) of loci with FN mapping are not marked as valid. Loci with false mapping but retained in the final set have lower but still decent length prediction accuracy (0.79 versus 0.82). The complete RPGG on valid loci contains 8,958,361 vertices, in contrast to the corresponding RPGG only on GRCh38 (repeat-GRCh38), which has 7,672,357 vertices. We validate that the additional vertices in the RPGG are indeed important for accurately recruiting reads pertinent to a VNTR locus, using the CACNA1C VNTR as .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=M#0 https://www.codecogs.com/eqnedit.php?latex=L#0 https://www.codecogs.com/eqnedit.php?latex=L#0 https://www.codecogs.com/eqnedit.php?latex=M%5Ctimes%20N_i#0 https://www.codecogs.com/eqnedit.php?latex=N_i#0 https://www.codecogs.com/eqnedit.php?latex=i#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ an example (Figure 2d). It is known that the reference sequence at this locus is truncated compared to the majority of the populations (319 bp in GRCh38 versus 5,669 bp averaged across 19 genomes). The limited sequence diversity provided by repeat-GRCh38 at this locus failed to recruit reads that map to paths existing in the RPGG but missing or only partially represented in repeat-GRCh38. A linear fit between the k -mers from mapped reads and the ground truth assemblies shows that there is a 13-fold gain in slope, or measured read depth, when using RPGG compared to repeat-GRCh38 (Figure 2e). The k -mer counts in the RPGGs also correlate better with the assembly k -mer counts compared to the repeat-GRCh38 (aln- = 0.992 versus 0.858). New genomes with arbitrary combinations of motifs and copy numbers in VNTRs should still align to an RPGG as long as the motifs are represented in the graph. We used leave-one-out analysis to evaluate alignment of novel genomes to RPGGs and estimation of VNTR length. In each experiment, an RPGG was constructed with one LRS genome missing. SRS reads from the missing genome were mapped into the RPGG, and the estimated locus lengths were compared to the average diploid lengths of corresponding loci in the missing LRS assembly. The locus length is estimated as the adjusted sum of k -mer counts mapped from SRS sample : , where is sequencing depth of , is a correction for locus-specific sampling bias (LSB). Because the SRS datasets used in this study during pangenome construction were collected from a wide variety of studies with different biases, there was no consistent LSB in either repetitive or nonrepetitive regions for samples from different sequencing runs (Supplementary Fig. 4-5). However, principal component analysis (PCA) of repetitive and nonrepetitive regions showed highly similar projection patterns (Supplementary Fig. 6), which enabled using LSB in nonrepetitive regions as a proxy for finding the nearest neighbor of LSB in VNTR regions (Supplementary Fig. 7). Leveraging this finding, a set of 397 nonrepetitive control regions were used to estimate the LSB of an unseen SRS sample (Methods), giving a median length-prediction accuracy of 0.82 for 16 unrelated genomes (Figure 3a left, Supplementary Fig. 8). The read depth of a repetitive region correlates to the locus length when aligning short reads to a linear reference .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://www.codecogs.com/eqnedit.php?latex=kms#0 https://www.codecogs.com/eqnedit.php?latex=s#0 https://www.codecogs.com/eqnedit.php?latex=kms%2F(cov_s%5Ctimes%20%5Chat%7Bb%7D)#0 https://www.codecogs.com/eqnedit.php?latex=cov_s#0 https://www.codecogs.com/eqnedit.php?latex=s#0 https://www.codecogs.com/eqnedit.php?latex=%5Chat%7Bb%7D#0 https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ genome. However, estimation of VNTR length from read depth has an accuracy of 0.72 (Figure 3a left). We also compared the performance for length prediction using the RPGG versus repeat-GRCh38, and observed a 58% improvement in accuracy (0.82 versus 0.52, Figure 3a left, Supplementary Fig. 9). The overall error rate, measured with mean absolute percentage error (MAPE), of all loci (n=32,138) are also significantly lower when using RPGGs (MAPE=0.19, Figure 3a right) compared with the repeat-GRCh38 (0.23, paired t -test P = 1.7⨉10 -31 ) or reference-aligned read depth (0.21, paired t -test P = 2.9⨉10 -31 ). Furthermore, a 61% reduction in error size is observed for the 6,238 loci poorly genotyped (MAPE > 0.4) using repeat-GRCh38 (Figure 3b, MAPE=0.233 versus 0.603). Profiling VNTR length and motif diversity To explore global diversity of VNTR sequences and potential functional impact, we aligned reads from 2,504 individuals from diverse populations sequenced at 30-fold coverage sequenced by the 1000-Genomes project (1KGP) (Fairley et al. 2020; 1000 Genomes Project Consortium et al. 2015) , and 879 GTEx genomes (G. Consortium and GTEx Consortium 2017) to the RPGG. The fraction of reads from these datasets that align to the RPGG ranges from 1.11%-1.37%, similar to the matched LRS/SRS data (1.23%). PCA on the LSB of both datasets showed the 1KGP and GTEx genomes as separate clusters in both repetitive and nonrepetitive regions (Supplementary Fig. 5), indicating experiment-specific bias that prevents cross data set comparisons. Consistent with the finding in previous leave-one-out analysis, genomes from the same study cluster together in the PCA plot of LSB, and so within each dataset and locus, k -mer counts from SRS reads normalized by sequencing depth were used to compare VNTR content across genomes. The k -mer dosage: , was used as a proxy for locus length to compare tandem repeat variation across populations in the 1KGP genomes. The 1KGP samples contain individuals from African (26.5%), East Asian (20.1%), European (19.9%), Admixed American (13.9%), and South Asian (19.5%) populations. When .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://paperpile.com/c/H8ctd0/4q2kL+jzBJy https://paperpile.com/c/H8ctd0/lyx1d https://www.codecogs.com/eqnedit.php?latex=kms%2Fcov#0 https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ comparing the average population length to the global average length, 60.8% (19,530/32,138) have differential length between populations (FDR=0.5 on ANOVA P values), with similar distributions of differential length when loci are stratified by the accuracy of length prediction (Figure 4a). Population stratification was calculated using the V ST statistic (Redon et al. 2006) on VNTR length (Figure 4b). Previous studies have used >3 standard deviations above the mean to define for highly stratified copy number variants (Sudmant et al. 2015) . Under this measure, 785 variants are highly stratified, including 266 that overlap genes, however this is not significantly enriched (p=0.079, one-sided permutation test). Two of the top five loci ranked by V ST are intronic: a 72 base VNTR in PLCL1 (V ST =0.37), and a 148 base locus in SPATA18 (V ST =0.35) (Figure 4c,d). These values for V ST are lower than what are observed for large copy number variants (Redon et al. 2006) and may be the result of neutral variation, however this may be affected by the high variance of the length estimate, as V ST decreases as the variance of the copy number/dosage values increase (Supplementary Methods). VNTR loci that are unstable may undergo hyper-expansion and are implicated as a mechanism of multiple diseases (Hannan 2018) . To discover new potentially unstable loci, we searched the 1kg genomes for evidence of rare VNTR hyper-expansion. Loci were screened for individuals with extreme (>6 standard deviations) variation, and then filtered for deletions or unreliable samples (Methods) to characterize 477 loci as potentially unstable. These loci are inside 115 genes and are significantly reduced from the number expected by chance (p<1⨉10 -5 , one-sided permutation test; n=10,000). Of these loci, 64 have an individual with > 10 standard deviations above the mean, of which two overlap genes, KCNA2 , and GRM4 (Supplemental Fig. 10). Alignment to an RPGG provides information about motif usage in addition to estimates of VNTR length because genomes with different motif composition will align to different vertices in the graph. To detect differential motif usage, we searched for loci with a k -mer that was more frequent in one population than another and not simply explained by a difference in locus length, comparing African (AFR) and East Asian populations for maximal genetic diversity. Lasso regression against locus length was used to find the k -mer with the most variance explained (VEX) in EAS genomes, denoted as the most informative k -mer (mi-kmer). Two .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://paperpile.com/c/H8ctd0/pJd6 https://paperpile.com/c/H8ctd0/N1Ru https://paperpile.com/c/H8ctd0/pJd6 https://paperpile.com/c/H8ctd0/9K9ci https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ statistics are of interest when comparing the two populations: the difference in the count of mi-kmers ( ) and the difference between proportion of VEX ( ) by mi-kmers. describes the usage of an mi-kmer in one population relative to another, while indicates the degree that the mi-kmer is involved in repeat contraction or expansion in one population relative to another. We observe that 8,216 loci have significant differences in the usage of mi-kmers between the two populations (two-sided P < 0.01, bootstrap, Supplementary Fig. 11). Among these, the mi-kmers of 1,913 loci are crucial to length variation in the EAS but not in the AFR population (two-sided P < 0.01, bootstrap) (Figure 4e, Supplementary Fig. 11). A top example of these loci with at least 0.9 in the EAS population was visualized with a heatmap of relative k -mer count from both populations, and clearly showed differential usage of cycles in the RPGG (Figure 4f). Association of VNTR with nearby gene expression Because the danbing-tk length estimates showed population genetic patterns expected for human diversity, we assessed whether danbing-tk alignments could detect VNTR variation with functional impact. Genomes from the GTEx project were mapped into the RPGG to discover loci that have an effect on nearby gene expression in a length-dependent manner. A total of 812/838 genomes with matching expression data passed quality filtering (Methods). Similar to the population analysis, the k -mer dosage was used as a proxy for locus length. Methods previously used to discover eQTL using STR genotyping (Fotsing et al. 2019) were applied to the danbing-tk alignments. In sum, 30,362 VNTRs within 100 kb to 45,720 GTEx gene-annotations (including genes, lncRNA, and other transcripts) were tested for association, with a total of 149,057 tests and approximately 3.3 VNTRs tested per gene. Using a gene-level FDR cutoff of 5%, we find 346 eQTL (eVNTRs) (Figure 5a), among which 344 (99.4%) discoveries are novel (Supplementary Table 5), indicating that the spectrum of association between tandem repeat variation and expression extends beyond the lengths and the types of SSR considered in previous STR (Mousavi et al. 2019) and VNTR (Bakhtiari et al. 2020) studies. Both .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=kmc_d#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2_d#0 https://www.codecogs.com/eqnedit.php?latex=kmc_d#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2_d#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://paperpile.com/c/H8ctd0/0DGuV https://paperpile.com/c/H8ctd0/aKIi https://paperpile.com/c/H8ctd0/S5lM https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ positive and negative effects were observed among eVNTRs (Figure 5b). More eVNTRs with positive effect size were found than with a negative effect size (200 versus 146, binomial test P = 0.0043), with an average effect of +0.261 (from +0.139 to +0.720) versus −0.247 (from −0.524 to −0.159), respectively. eVNTRs tend to be closer to telomeres relative to all VNTRs (Mann–Whitney U test P = 5.2⨉10 -5 , Supplementary Fig. 12). Because many exons contain VNTR sequences, expression measured by read depth should increase with length of the VNTR, and there is an 2.5-fold enrichment of eVNTRs in coding regions as expected. The eVNTRs have the potential to yield insight to disease. In one example, an intronic eVNTR at chr5:96,896,863-96,896,963 flanks exon 9 of ERAP2 (Figure 5d, Supplementary Fig. 13). The eVNTR has a -0.52 effect size and was reported across 27 tissues. It colocalizes with a regulatory hotspot with peaks of histone markers, DNase and 40 different ChIP signals. The protein product of ERAP2 , or endoplasmic reticulum aminopeptidase 2, is a zinc metalloaminopeptidase involving in the process of Class I MHC mediated antigen presentation and innate immune response. It has been reported to be associated with several diseases including ankylosing spondylitis (Wellcome Trust Case Control Consortium et al. 2007) and Crohn’s disease (Franke et al. 2010) . Abnormal expansion of the VNTR might increase autoimmune disease risk through reducing ERAP2 expression, leaving longer and more antigenic peptides, yet potentially higher fitness against virus infection (Ye et al. 2018) . This VNTR is a unique sequence in GRCh38 that is a 101 bp tandem duplication in 17/38 of the haplotypes. Another example is an intergenic VNTR at chr17:46,265,245-46,265,480 that associates with the expression of KANSL1 ~40kb upstream (Figure 5c, Supplementary Fig. 13). The eVNTR has a maximal effect size of +0.45 and is significant across 40 tissues. The protein product of KANSL1 , or KAT8 regulatory NSL complex subunit 1, is a part of the histone acetylation machinery. Deletion of this gene is linked to Koolen-de Vries syndrome (Koolen et al. 2008) , and the locus is associated with Parkinson disease (Witoelar et al. 2017) . The eVNTR colocalizes with strong ChIP signals the association of this VNTR with the epigenetic landscape warrants further investigation. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://paperpile.com/c/H8ctd0/8Gyl https://paperpile.com/c/H8ctd0/41me https://paperpile.com/c/H8ctd0/41me https://paperpile.com/c/H8ctd0/stR2 https://paperpile.com/c/H8ctd0/stR2 https://paperpile.com/c/H8ctd0/cQ1B https://paperpile.com/c/H8ctd0/gwpe https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Discussion. Previous commentaries have proposed that variation in VNTR loci may represent a component of undiagnosed disease and missing heritability (Hannan 2010) , which has remained difficult to profile even with whole genome sequencing (Mousavi et al. 2019) . To address this, we have proposed an approach that combines multiple genomes into a pangenome graph that represents the repeat structure of a population. This is supported by the software, danbing-tk and associated RPGG. We used danbing-tk to generate a pangenome from 19 haplotype-resolved assemblies, and applied it to detect VNTR variation across populations and to discover eQTL. The structure of the RPGG can help to organize the diversity of assembled VNTR sequences with respect to the standard reference. In particular, 27% of the graph structure is novel after the addition of 19 genomes to the RPGG relative to repeat-GRCh38. Combined with the observation that using the 19-genome RPGG gives a 63% decrease in length prediction error, this indicates that the pan-genomes add detail for the missing variation. With the availability of additional genomes sequenced through the Pangenome Reference Consortium ( https://humanpangenome.org/ ) and the HGSVC ( https://www.internationalgenome.org ), combined with advanced haplotype-resolve assembly methods (Porubsky et al. 2020) , the spectrum of this variation will be revealed in the near future. While we anticipate that eventually the full spectrum of VNTR diversity will be revealed through LRS of the entire 1kg, the RPGG analysis will help organize analysis by characterizing repeat domains. For example, with our approach, we are able to detect 1,913 loci with differential motif usage between populations, which could be difficult to characterize using an approach such as multiple-sequence alignment of VNTR sequences from assembled genomes. There are several caveats to our approach. In contrast to other pangenome approaches (Garrison et al. 2018; Rakocevic et al. 2019) , danbing-tk does not keep track of a reference (e.g. GRCh38) coordinate system. Furthermore, because it is often not possible to reconstruct a unique path in an RPGG, only counts of mapped reads are reported rather than the order of traversal of the RPGG. An additional caveat of our approach is that .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://paperpile.com/c/H8ctd0/IYbxb https://paperpile.com/c/H8ctd0/aKIi https://humanpangenome.org/ https://www.internationalgenome.org/ https://paperpile.com/c/H8ctd0/JLne https://paperpile.com/c/H8ctd0/lMbAV+jQzSb https://paperpile.com/c/H8ctd0/lMbAV+jQzSb https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ genotype is calculated as a continuum of k -mer dosage rather than discrete units, prohibiting direct calculation of linkage-disequilibrium for fine-scale mapping (LaPierre et al., n.d.) . Finally this approach only profiles loci where k -mer counts from reads and assemblies are correlated; loci for which every k -mer appears the same number of times are excluded from analysis (on average 8,058/73,582 per genome). The rich data provided by danbing-tk and pangenome analysis provide the basis for additional association studies. While most analysis in this study focused on the diversity of VNTR length or association of length and expression, it is possible to query differential motif usage using the RPGG. The ability to detect motifs that have differential usage between populations brings the possibility of detecting differential motif usage between cases and controls in association studies. This can help distinguish stabilizing versus fragile motifs (Braida et al. 2010) , or resolve some of the problem of missing heritability by discovering new associations between motif and disease (Song, Lowe, and Kingsley 2018) . Finally, this work is a part of ongoing pangenome graph analysis (Paten et al. 2017; Li, Feng, and Chu 2020) , and represents an approach to generating pangenome graphs in loci that have difficult multiple sequence alignments or degenerate graph topologies. Additional methods may be developed to harmonize danbing-tk RPGGs with genome-wide pangenome graphs constructed from other methods. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://paperpile.com/c/H8ctd0/LGTuZ https://paperpile.com/c/H8ctd0/yrLyS https://paperpile.com/c/H8ctd0/JeL81 https://paperpile.com/c/H8ctd0/gDid+n2Qw https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 1. Sequence diversity of VNTRs in human populations. a , Global diversity of SMS assemblies. b, Dot-plot analysis of the VNTR locus chr1:2280569-2282538 (SKI intron 1 VNTR) in genomes that demonstrate varying motif usage and length c , Diversity of RPGG as genomes are incorporated, measured by the number of k -mers in the 32,138 VNTR graphs. Total graph size built from GRCh38 and an average genome are also shown. d, danbing-tk workflow analysis. (top) VNTR loci defined from the reference are used to map haplotype loci. Each locus is converted to a de Bruijn graph, from which the collection of graphs is the RPGG. The de Bruijn graphs shown illustrate sequences missing from the RPGG built only on GRCh38. The alignments may be either used to select which loci may be accurately mapped in the RPGG using SRS that match the assemblies (red), or may be used to estimate lengths on sample datasets (blue). Genome Continental population Study Assembly N50 (Mb) Fraction of VNTR annotated Ancestry Cov AK1 EAS KG 54 0.88 0.840 Korean HG00268 EUR DP 67 3.51 0.967 Finnish HG00512 EAS HGSVG 28 8.83 0.995 Han Chinese HG00513 EAS HGSVG 30 1.57 0.993 Han Chinese .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Table 1. Source genomes for RPGG. Continental populations represented are East Asian (EAS), European (EUR), Admixed Amerindian (AMR), South Asian (SAS), and African (AFR). Coverage is estimated diploid coverage based on alignment to GRCh38. Assembly N50 is of haplotype-resolved assemblies. The fraction of VNTR annotated are all VNTR with at least 700 flanking bases assembled. Figure 2. Mapping short reads to repeat-pangenome graphs. a, An example of evaluating the alignment quality of a locus mapped by SRS reads. The alignment quality is measured by the of a linear fit between the k -mer counts from the ground truth assemblies and from the mapped reads (Methods). b, Distribution of the alignment quality scores of 73,582 loci. Loci with alignment quality less than 0.96 when averaged across samples are removed from downstream analysis (Methods). c, Distribution of VNTR lengths in GRCh38 HG00514 EAS HGSVG 31 1.32 0.948 Han Chinese HG00731 AMR HGSVG 31 2.18 0.995 Puerto Rican HG00732 AMR HGSVG 16 1.3 0.992 Puerto Rican HG00733 AMR HGSVG 46 6.88 0.992 Puerto Rican HG01352 AMR DP 68 5.97 0.992 Colombian HG02059 EAS DP 76 19.5 0.992 Vietnamese HG02106 AMR DP 57 0.88 0.640 Peruvian HG02818 AFR DP 56 0.66 0.802 Gambian HG04217 SAS DP 60 0.86 0.269 Telugu NA12878 EUR DP 54 4.67 0.971 Central European NA19238 AFR HGSVG 23 2.64 0.991 Yoruba NA19239 AFR HGSVG 35 4.87 0.994 Yoruba NA19240 AFR HGSVG 49 3.4 0.989 Yoruba NA19434 AFR DP 62 11 0.980 Luhya NA24385 EUR GIAB 54 1.32 0.981 Ashkenazim .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ removed or retained for downstream analysis. d-e , Comparing the read mapping results of the CACNA1C VNTR using RPGG or repeat-GRCh38. The k -mer counts in each graph and the differences are visualized with edge width and color saturation ( d ). The k -mer counts from the ground truth assemblies are regressed against the counts from reads mapped to the RPGG (red) and repeat-GRCh38 (blue), respectively ( e ). .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 3. VNTR length prediction. a , Accuracies of VNTR length prediction measured for each genome (left) and each locus (right). Mean absolute percentage error (MAPE) in VNTR length is averaged across loci and genomes, respectively. Lengths were predicted based on repeat-pangenome graphs (RPGG), repeat-GRCh38 (RHG) or naive read depth method (RD), respectively. Boxes span from the lower quartile to the upper quartile, with horizontal lines indicating the median. Whiskers extend to points that are within 1.5 interquartile range (IQR) from the upper or the lower quartiles. b, Relative performance of RPGG versus repeat-GRCh38. Loci are ordered along the x-axis by genotyping accuracy in repeat-GRCh38. The y-axis shows the decrease in MAPE using RPGG versus repeat-GRCh38. The subplot shows loci poorly genotyped (MAPE>0.4) in repeat-GRCh38. The red dotted line indicates the baseline without any improvement. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 4. Population properties of VNTR loci. a , Ratios of median length between populations for loci with significant differences in average length. Loci are stratified by accuracy prediction (<0.8), medium (0.8-0.9), and high (0.9+). b , Manhattan plot of V ST values. c-d , The distribution of estimated length via k -mer dosage in continental populations for PLCL1 and SPATA18 VNTR loci, selected to visualize the distribution of dosage in different populations. Each point is an individual. e, Differential usage and expansion of motifs between the EAS and AFR populations. For each locus, the proportion of variance explained by the most informative k -mer in the EAS is shown for the EAS and AFR populations on the x and y axes, respectively. Points are colored by the difference in normalized k -mer counts, with red and blue indicating k -mers more abundant in EAS and AFR populations, respectively. f, An example VNTR with differential motif usage. Edges are colored if the k -mer count is biased toward a certain population. The black arrow indicates the location of the k -mer that explains the most variance of VNTR length in the EAS population. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 5. cis -eQTL mapping of VNTRs. a, eVNTR discoveries in 20 human tissues. The quantile-quantile plot shows the observed P value of each association test versus the P value drawn from the expected uniform distribution. Black dots indicate the permutation results from the top 5% associated (gene, VNTR) pairs in each tissue. The regression plots for ERAP2 and KANSL1 are shown in c and d. b, Effect size distribution of significant associations from all tissues. c-d , Genomic view of disease-related (eGene,eVNTR) pairs ( ERAP2 , chr5:96896863-96896963) (c) and ( KANSL1 , chr17:46265245-46265480) (d) are shown. Red boxes indicate the location of eGenes and eVNTRs. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Materials and methods Pangenome construction. Initial discovery of tandem repeats: TRF v4.09 (option: 2 7 7 80 10 50 500 -f -d -h) (Benson 1999) was used to roughly annotate the SSR regions of five PacBio assemblies (AK1, HG00514, HG00733, NA19240, NA24385). The scope of this work focuses on VNTRs that cannot be resolved by typical short read sequencing methods. We selected the set of SSR loci with a motif size greater than 6 bp and a total length greater than 150 bp and less than 10 kbp. For each haplotype, the selected VNTR loci were mapped to GRCh38 reference genome to identify homologous VNTR loci. To maintain data quality, VNTR loci that could not be assigned homology were removed from datasets. Boundary expansion of VNTRs: The biological boundaries of a VNTR are ill-defined; VNTRs with sparse recurring motifs or transition between different motifs or a nested motif structure often fail to be fully annotated by TRF. A misannotation of VNTR boundaries can cause erroneous length estimates. To avoid the propagation of this error to downstream analysis, we developed a multiple boundary expansion algorithm to recover the proper boundary for each VNTR across all haplotypes, including the the 14 remaining genomes (HG00268, HG00512, HG00513, HG00731, HG00732, HG01352, HG02059, HG02106, HG02818, HG04217, NA12878, NA19238, NA19239 and NA19434). The algorithm maintains an invariant: the flanking sequence in any of the haplotypes does not share k -mers with the VNTR regions from all haplotypes. VNTR boundaries in each haplotype are iteratively expanded until the invariant is true or if expansion exceeds 10 kbp in either 5’ or 3’ direction. The size of the flanking regions is chosen to be 700 bp, which is approximately the upper bound of the insert size of typical SRS reads. The following QC step removes a haplotype if its VNTR annotation is within 700 bp to breakpoints or if the orthology mapping location to GRCh38 is different from the majority of haplotypes. A VNTR locus with the number of supporting haplotypes less than 90% of the total number of haplotypes is also removed. Adjacent VNTR loci within 700 bp to each other in any of the haplotypes will .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ induce a merging step over all haplotypes. Haplotypes with distance between adjacent loci inconsistent with the majority of haplotypes are removed. Finally, VNTR loci with the number of supporting haplotypes less than 80% of the total number of haplotypes are removed, leaving 73,582 of the initial 84,411 loci. Read-to-graph alignment: For the two haplotypes of an individual, three data structures are used to encode the information of all VNTR loci, including VNTRs and their 700 bp flanking sequences. The first data structure allows fast locus lookup for each k -mer ( k =21) by hashing each canonical k -mer in the VNTRs and the flanking sequences to the index of the original locus. The second data structure enables graph threading by storing a bi-directional de Bruijn graph for each locus. The third data structure is used for counting k -mers originating from VNTRs. The read mapping algorithm maps each pair of Illumina paired-end reads to a unique VNTR locus in three phases: (1) In the k -mer set mapping phase, the read pair is converted to a pair of canonical k -mer multisets. The VNTR locus with the highest count of intersected k -mers is detected with the first data structure. (2) In the threading phase, the algorithm tries to map the k -mers in the read pair to the bi-directional de Bruijn graph such that the mapping forms a continuous path/cycle. To account for sequencing and assembly errors, the algorithm is allowed to edit a limited number of nucleotides in a read if no matching k -mer is found in the graph. The read pair is determined feasible to map to a VNTR locus if the number of mapped k -mers is above an empirical threshold. (3) In the k -mer counting phase, canonical k -mers of the feasible read pair are counted if they existed in the VNTR locus. Finally, the read mapping algorithm returns the k -mer counts for all loci as mapped by SRS reads. Alignment timing was conducted on an Intel Xeon E5-2650v2 2.60GHz node. Graph pruning and merging: Pan-genome representation provides a more thorough description of VNTR diversity and reduces reference allele bias, which effectively improves the quality of read mapping and downstream analysis. Considering the fact that haplotypes assembled from long read datasets are error prone in VNTR regions, it is necessary to prune the graphs/ k -mers before merging them as a pan-genome. We ran the .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ read mapping algorithm with error correction disabled so as to detect k -mers unsupported by SRS reads. The three data structures were updated by deleting all unsupported k -mers for each locus. By pooling and merging the reference regions corresponding to the VNTR regions in all individuals, we obtained a set of “pan-reference” regions, each indicating a location in GRCh38 that is likely to map to a VNTR region in any other unseen haplotype. By referencing the mapping relation of VNTR loci across individuals, we encoded the variability of each VNTR locus by merging the three data structures across individuals. Alignment quality analysis: To evaluate the quality of the haplotype assemblies and the performance of the read mapping algorithm, VNTR k -mer counts in the original assemblies were regressed against those mapped from SRS reads. The of the linear fit was used as the alignment quality score (referred to as aln- ). To measure alignment quality in the pan-genome setting, only the k -mer set derived from the genotyped individual was retained as the input for regression. Data filtering: A final set of 32,138 VNTR regions was called by filtering based on aln- . The quality of a locus was measured by the mean aln- across individuals. Loci with mean aln- below 0.96 were removed from the final call set. The final pan-genome graphs were used to genotype large Illumina datasets, measure length prediction accuracy, analyze population structures and map eQTL. Predicting VNTR lengths : Read depths at VNTR regions usually vary considerably from locus to locus. Furthermore, the sampling bias of different sequencing runs are also different, which limits our ability to genotype the accurate length of VNTRs. To account for this, we compute locus-specific biases (LSBs) for each sample , a tuple of (genome , sequencing run) as follows: .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://www.codecogs.com/eqnedit.php?latex=b_s#0 https://www.codecogs.com/eqnedit.php?latex=s#0 https://www.codecogs.com/eqnedit.php?latex=g#0 https://www.codecogs.com/eqnedit.php?latex=b_s%3D%5Cdfrac%7B1%7D%7Bcov_s%5Ctimes%20L_g%7D%5Csum_%7Be%7DW_%7Bs%2Ce%7D#0 https://www.codecogs.com/eqnedit.php?latex=b_s%3D%5Cdfrac%7Bkms_s%7D%7Bcov_s%5Ctimes%20L_g%7D#0 https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ ,where is the ground truth VNTR lengths of 32,138 loci in genome ; is the sum of k -mer counts in each locus mapped by samples ; is the global read depth of sample estimated by averaging the read depths of 397 unique regions without any types of repeats or duplications. The ground truth VNTR length of a locus in genome is averaged across haplotypes: ,where is the number of haplotype(s) in genome , i.e. 2 for normal individuals and 1 for complete hydatidiform mole (CHM) samples. With the above bias terms, the VNTR length of locus in sample can be computed by: ,where is same as described above; is the estimated LSBs computed from sample with ground truth VNTR lengths; is the sum of k -mer counts of locus mapped by sample . We assume the LSBs that best approximates come from samples within the same sequencing run. Without prior knowledge on the ground truth VNTR lengths of and therefore , we determine the “closest” sample w.r.t. based on between the read depths, , of the 397 unique regions as follows: , where is the set of samples with ground truths and within the same sequencing run as . We cross-validate our approach by leaving one sample out of the pan-genome database and evaluating the prediction accuracy on the excluded sample. For comparison, VNTR lengths were also estimated by a read depth method. For each VNTR region, the read depth, computed with samtools bedcov -j, was divided by the global read depth, computed from the 397 nonrepetitive regions, to give the length estimate. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=L_g#0 https://www.codecogs.com/eqnedit.php?latex=g#0 https://www.codecogs.com/eqnedit.php?latex=kms_s#0 https://www.codecogs.com/eqnedit.php?latex=s#0 https://www.codecogs.com/eqnedit.php?latex=cov_s#0 https://www.codecogs.com/eqnedit.php?latex=s#0 https://www.codecogs.com/eqnedit.php?latex=l#0 https://www.codecogs.com/eqnedit.php?latex=g#0 https://www.codecogs.com/eqnedit.php?latex=L_%7Bg%2Cl%7D%3D%5Cdfrac%7B1%7D%7BH%7D%5Csum_%7Bh%3D1%7D%5E%7BH%7DL_%7Bg%2Ch%2Cl%7D#0 https://www.codecogs.com/eqnedit.php?latex=H#0 https://www.codecogs.com/eqnedit.php?latex=g#0 https://www.codecogs.com/eqnedit.php?latex=l#0 https://www.codecogs.com/eqnedit.php?latex=s#0 https://www.codecogs.com/eqnedit.php?latex=L_%7Bs%2Cl%7D%3D%5Cdfrac%7Bkms_%7Bs%2Cl%7D%7D%7Bcov_s%5Ctimes%20b_%7B%5Chat%7Bs%7D%7D%7D#0 https://www.codecogs.com/eqnedit.php?latex=cov_s#0 https://www.codecogs.com/eqnedit.php?latex=b_%7B%5Chat%7Bs%7D%7D#0 https://www.codecogs.com/eqnedit.php?latex=%5Chat%7Bs%7D#0 https://www.codecogs.com/eqnedit.php?latex=kms_%7Bs%2Cl%7D#0 https://www.codecogs.com/eqnedit.php?latex=l#0 https://www.codecogs.com/eqnedit.php?latex=s#0 https://www.codecogs.com/eqnedit.php?latex=b_s#0 https://www.codecogs.com/eqnedit.php?latex=s#0 https://www.codecogs.com/eqnedit.php?latex=b_s#0 https://www.codecogs.com/eqnedit.php?latex=%5Chat%7Bs%7D#0 https://www.codecogs.com/eqnedit.php?latex=s#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2#0 https://www.codecogs.com/eqnedit.php?latex=RD#0 https://www.codecogs.com/eqnedit.php?latex=%5Chat%7Bs%7D%3D%5Coperatorname*%7Bargmax%7D_%7Bs%27%2C%20s%27%5Cin%20GT%2C%20s%27%5Cneq%20s%7D%20r%5E2(RD_%7Bs%27%7D%2CRD_s)#0 https://www.codecogs.com/eqnedit.php?latex=GT#0 https://www.codecogs.com/eqnedit.php?latex=s#0 https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Comparing with GraphAligner: The compact de Bruijn graph of each VNTR locus was generated with bcalm v2.2.3 (option: -kmer-size 21 -abundance-min 1) using the VNTR sequences from all assemblies as input. GFA files were then reindexed and concatenated to generate the RPGGs for 32,138 loci. Error-free paired-end reads were simulated from all VNTR regions at 2x coverage with 150 bp read length and 600 bp insert size (300 bp gap between each end). Reads were aligned to the RPGG using GraphAligner v1.0.11 with option -x dbg --seeds-minimizer-length 21. Reads with alignment identity > 90% were counted from the output gam file. To compare in a similar setting, danbing-tk was run with option -gc -thcth 117 -k 21 -cth 45 -rth 0.5 to assert >90% identity for all reads aligned, given that . V ST calculation: V ST was calculated according to (Redon et al. 2006) : Top V ST loci were considered as the sites with V ST at least three standard deviations above the mean. Identifying unstable loci: A locus was annotated as a candidate for being unstable if at least one individual had outlying k -mer dosage ≥ six standard deviations above the mean, using population and locus specific summary statistics on data discarding individuals with zero no individuals had dosage less than 10 or a bimodal distribution was not detected (diptest v0.75-7, p > 0.9). Among this set, the number of times each genome appeared as an outlier was used to select a set of genomes with an over abundant contribution to fragile loci. Any candidate locus with an individual that was an outlier in at least four other loci was removed from the candidate list. The loci were compared to gencode v34, excluding readthrough, pseudogenes, noncoding RNA, and nonsense transcripts. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=(read%5C_%20length-kmer%5C_size%2B1)%5Ctimes%200.9%3D117#0 https://paperpile.com/c/H8ctd0/pJd6 https://www.codecogs.com/eqnedit.php?latex=V_%7BST%7D%5Bi%5D%3Dmax(0%2C%20%5Cfrac%7Bvar_%7Ball%7D-%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bp%5Cin%20P%7D%7Bvar_p%5Ctimes%20n_p%7D%7D%7Bvar_%7Ball%7D%7D)#0 https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Identifying differential motif usage and expansion : Sample outliers in the 1000 Genomes were detected from the read sampling biases over 397 control regions and the TR dosages over 32,138 loci using DBSCAN. A total of 119/2,504 samples were removed from downstream analysis. We use the EAS population as the reference for measuring differential motif usage and expansion. Initially, a lasso fit using the statsmodel.api.OLS function in python statsmodel v0.10.1 (Seabold and Perktold 2010) was performed for each locus to identify the k -mer with the most variance explained (VEX) in VNTR lengths using the following formula: , where is the VNTR length of individuals in the EAS population; is the k -mer dosage matrix for individuals with k -mers; is the model coefficient, and is the error term. The lasso penalty weight was scanned starting at 0.9 with at a step size of −0.1 until at least one covariate has a positive weight or is below 0.1. The k -mer with the highest weight is denoted as the most informative k -mer (mi-kmer) for the locus. To identify loci with differential motif usage between populations, we subtracted the median count of the mi-kmer of the AFR from the EAS population for each locus, denoted as . The null distribution of was estimated by bootstrap. Specifically, EAS individuals were sampled with replacement times, matching the sample sizes of the EAS and AFR populations, respectively. The bootstrap statistics, , were computed by subtracting the median count of the mi-kmer of the last from the first bootstrap samples for each locus. The estimated null distribution is then used to determine the threshold for calling a locus having significant differential motif usage between populations (two-sided P < 0.01). To identify loci with differential motif expansion between populations, we subtracted the proportion of VEX by mi-kmer in the AFR from the EAS population, denoted as . The null distribution of was estimated by bootstrap in a similar sampling procedure as , except for subtracting the proportion of VEX by the mi-kmer in the last from the first bootstrap samples for each locus. The estimated null .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=y%3DXb%2B%5Cepsilon#0 https://www.codecogs.com/eqnedit.php?latex=y%5Cin%20%5Cmathbb%7BR%7D%5EN#0 https://www.codecogs.com/eqnedit.php?latex=N#0 https://www.codecogs.com/eqnedit.php?latex=X%5Cin%20%5Cmathbb%7BR%7D%5E%7BN%5Ctimes%20M%7D#0 https://www.codecogs.com/eqnedit.php?latex=N#0 https://www.codecogs.com/eqnedit.php?latex=M#0 https://www.codecogs.com/eqnedit.php?latex=b%5Cin%20%5Cmathbb%7BR%7D%5EM#0 https://www.codecogs.com/eqnedit.php?latex=%5Cepsilon%5Csim%20N(0%2C%5Csigma%5E2)#0 https://www.codecogs.com/eqnedit.php?latex=%5Calpha#0 https://www.codecogs.com/eqnedit.php?latex=%5Calpha#0 https://www.codecogs.com/eqnedit.php?latex=kmc_d#0 https://www.codecogs.com/eqnedit.php?latex=kmc_d#0 https://www.codecogs.com/eqnedit.php?latex=N_%7BEAS%7D%2BN_%7BAFR%7D#0 https://www.codecogs.com/eqnedit.php?latex=kmc_d%5E*#0 https://www.codecogs.com/eqnedit.php?latex=N_%7BAFR%7D#0 https://www.codecogs.com/eqnedit.php?latex=N_%7BEAS%7D#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2_d#0 https://www.codecogs.com/eqnedit.php?latex=r%5E2_d#0 https://www.codecogs.com/eqnedit.php?latex=kmc_d#0 https://www.codecogs.com/eqnedit.php?latex=N_%7BAFR%7D#0 https://www.codecogs.com/eqnedit.php?latex=N_%7BEAS%7D#0 https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ distribution is used to determine the threshold for calling a locus having significant differential motif expansion between populations (two-sided P < 0.01). eQTL mapping Retrieving datasets : WGS datasets of 879 individuals, normalized gene expression matrices and covariates of all tissues are accessed from the GTEx Analysis V8 (dbGaP Accession phs000424.v8.p2). Genotype data preprocessing : VNTR lengths are genotyped using daunting-tk with options: -gc -thcth 50 -cth 45 -rth 0.5. All the k -mer counts of a locus are summed and adjusted by global read depth and ploidy to represent the approximate length of a locus. Sample outliers were detected from the read sampling biases over 397 control regions and the TR dosages over 32,138 loci using DBSCAN. A total of 26/838 samples were removed from downstream analysis. Adjusted values are then z-score normalized as input for eQTL mapping. Expression data preprocessing : The downloaded expression matrices are already preprocessed such that outliers are rejected and expression counts are quantile normalized as standard normal distribution. Confounding factors such as sex, sequencing platform, amplification method, technical variations and population structure are removed prior to eQTL mapping to avoid spurious associations. Technical variations are corrected with the covariates, including PEER factors, provided by the GTEx Consortium. Population structures are corrected with the top 10 principal components (PCs) from the SNP matrix of all samples. Particularly, principal component analysis (PCA) was performed jointly on the intersection of the SNP sets from GTEx samples and 1KGP Omni 2.5 SNP genotyping arrays (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/ALL.chip.omni_broa d_sanger_combined.20140818.snps.genotypes.vcf.gz). This is done by first using CrossMap v0.4.0 to liftover the SNP sites from Omni 2.5 arrays to GRCh38, followed by extracting the intersection of the two SNP sets .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ using vcftools isec. The SNP set is further reduced by LD-pruning with plink v1.90b6.12 using the options: --indep 50 5 2, leaving a total of 757,000 sites. Finally, PCA on the joint SNP matrix was done by smartpca v13050. The normalized expression matrix are residualized with the above covariates using the following formula: , where is the residualized expression matrix; is the normalized expression matrix; is the projection matrix; is the identity matrix; is the covariate matrix where each column corresponds to a covariate mentioned above. The residualized expression values are z-score normalized as the input of eQTL mapping. Association test : VNTRs within 100 kb to a gene are included for eQTL mapping. Linear regression was done using the statsmodel.api.OLS function in python statsmodel v0.10.1 (Seabold and Perktold 2010) with expression values as the dependent variable and genotype values as the independent variable. Nominal P values are computed by performing t tests on slope. Adjusted P values are computed by Bonferroni correction on nominal P values. Under the assumption of at most one causal VNTR per gene, we control gene-level false discovery rate at 5%. Specifically, the adjusted P values of the lead VNTR for each gene are taken as input for Benjamini-Hochberg procedure using statsmodels.stats.multitest.fdrcorrection v0.10.1. Lead VNTRs that passed the procedure are identified as eVNTRs. Data availability The overall analysis pipeline is delivered in a software package at https://github.com/ChaissonLab/danbing-tk . 1000 Genomes Acknowledgement:   .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=Y%3D(I-H)Y%27#0 https://www.codecogs.com/eqnedit.php?latex=H%3DC(C%5ETC)%5E%7B-1%7DC%5ET#0 https://www.codecogs.com/eqnedit.php?latex=Y#0 https://www.codecogs.com/eqnedit.php?latex=Y%27#0 https://www.codecogs.com/eqnedit.php?latex=H#0 https://www.codecogs.com/eqnedit.php?latex=I#0 https://www.codecogs.com/eqnedit.php?latex=C#0 https://github.com/ChaissonLab/danbing-tk https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ The following cell lines/DNA samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: [NA06984, NA06985, NA06986, NA06989, NA06994, NA07000, NA07037, NA07048, NA07051, NA07056, NA07347, NA07357, NA10847, NA10851, NA11829, NA11830, NA11831, NA11832, NA11840, NA11843, NA11881, NA11892, NA11893, NA11894, NA11918, NA11919, NA11920, NA11930. NA11931, NA11932, NA11933, NA11992, NA11994, NA11995, NA12003, NA12004, NA12005, NA12006, NA12043, NA12044, NA12045, NA12046, NA12058, NA12144, NA12154, NA12155, NA12156, NA12234, NA12249, NA12272, NA12273, NA12275, NA12282, NA12283, NA12286, NA12287, NA12340, NA12341, NA12342, NA12347, NA12348, NA12383, NA12399, NA12400, NA12413,, NA12414, NA12489, NA12546, NA12716, NA12717, NA12718, NA12748, NA12749, NA12750, NA12751, NA12760, NA12761, NA12762, NA12763, NA12775, NA12776, NA12777, NA12778, NA12812, NA12813, NA12814, NA12815, NA12827, NA12828, NA12829, NA12830, NA12842, NA12843, NA12872, NA12873, NA12874, NA12878, NA12889, NA12890]. These data were generated at the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1. Data accession IDs are given in supplementary table S4. References. 1000 Genomes Project Consortium, Adam Auton, Lisa D. Brooks, Richard M. Durbin, Erik P. Garrison, Hyun Min Kang, Jan O. Korbel, et al. 2015. “A Global Reference for Human Genetic Variation.” Nature 526 (7571): 68–74. Audano, Peter A., Arvis Sulovari, Tina A. Graves-Lindsay, Stuart Cantsilieris, Melanie Sorensen, Annemarie E. Welch, Max L. Dougherty, et al. 2019. “Characterizing the Major Structural Variant Alleles of the Human Genome.” Cell 176 (3): 663–75.e19. Bakhtiari, Mehrdad, Jonghun Park, Yuan-Chun Ding, Sharona Shleizer-Burko, Susan L. Neuhausen, Bjarni V. Halldórsson, Kári Stefánsson, Melissa Gymrek, and Vineet Bafna. 2020. “Variable Number Tandem Repeats Mediate the Expression of Proximal Genes.” bioRxiv . https://doi.org/ 10.1101/2020.05.25.114082 . Bakhtiari, Mehrdad, Sharona Shleizer-Burko, Melissa Gymrek, Vikas Bansal, and Vineet Bafna. 2018. “Targeted Genotyping of Variable Number Tandem Repeats with adVNTR.” Genome Research 28 (11): 1709–19. Benson, G. 1999. “Tandem Repeats Finder: A Program to Analyze DNA Sequences.” Nucleic Acids Research . https://doi.org/ 10.1093/nar/27.2.573 . Braida, Claudia, Rhoda K. A. Stefanatos, Berit Adam, Navdeep Mahajan, Hubert J. M. Smeets, Florence Niel, Cyril Goizet, et al. 2010. “Variant CCG and GGC Repeats within the CTG Expansion Dramatically Modify Mutational Dynamics and Likely Contribute toward Unusual Symptoms in Some Myotonic Dystrophy Type 1 Patients.” Human Molecular Genetics 19 (8): 1399–1412. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint http://paperpile.com/b/H8ctd0/jzBJy http://paperpile.com/b/H8ctd0/jzBJy http://paperpile.com/b/H8ctd0/jzBJy http://paperpile.com/b/H8ctd0/jzBJy http://paperpile.com/b/H8ctd0/jzBJy http://paperpile.com/b/H8ctd0/k5roB http://paperpile.com/b/H8ctd0/k5roB http://paperpile.com/b/H8ctd0/k5roB http://paperpile.com/b/H8ctd0/k5roB http://paperpile.com/b/H8ctd0/k5roB http://paperpile.com/b/H8ctd0/S5lM http://paperpile.com/b/H8ctd0/S5lM http://paperpile.com/b/H8ctd0/S5lM http://paperpile.com/b/H8ctd0/S5lM http://paperpile.com/b/H8ctd0/S5lM http://dx.doi.org/10.1101/2020.05.25.114082 http://paperpile.com/b/H8ctd0/S5lM http://paperpile.com/b/H8ctd0/0gs4 http://paperpile.com/b/H8ctd0/0gs4 http://paperpile.com/b/H8ctd0/0gs4 http://paperpile.com/b/H8ctd0/0gs4 http://paperpile.com/b/H8ctd0/0gs4 http://paperpile.com/b/H8ctd0/PGH4U http://paperpile.com/b/H8ctd0/PGH4U http://paperpile.com/b/H8ctd0/PGH4U http://paperpile.com/b/H8ctd0/PGH4U http://dx.doi.org/10.1093/nar/27.2.573 http://paperpile.com/b/H8ctd0/PGH4U http://paperpile.com/b/H8ctd0/yrLyS http://paperpile.com/b/H8ctd0/yrLyS http://paperpile.com/b/H8ctd0/yrLyS http://paperpile.com/b/H8ctd0/yrLyS http://paperpile.com/b/H8ctd0/yrLyS http://paperpile.com/b/H8ctd0/yrLyS https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Chaisson, Mark J. P., Ashley D. Sanders, Xuefang Zhao, Ankit Malhotra, David Porubsky, Tobias Rausch, Eugene J. Gardner, et al. 2019. “Multi-Platform Discovery of Haplotype-Resolved Structural Variation in Human Genomes.” Nature Communications 10 (1): 1784. Chen, Sai, Peter Krusche, Egor Dolzhenko, Rachel M. Sherman, Roman Petrovski, Felix Schlesinger, Melanie Kirsche, et al. 2019. “Paragraph: A Graph-Based Structural Variant Genotyper for Short-Read Sequence Data.” Genome Biology 20 (1): 291. Chin, Chen-Shan, Paul Peluso, Fritz J. Sedlazeck, Maria Nattestad, Gregory T. Concepcion, Alicia Clum, Christopher Dunn, et al. 2016. “Phased Diploid Genome Assembly with Single-Molecule Real-Time Sequencing.” Nature Methods 13 (12): 1050–54. Consortium, Gtex, and GTEx Consortium. 2017. “Genetic Effects on Gene Expression across Human Tissues.” Nature . https://doi.org/ 10.1038/nature24277 . Consortium, International Human Genome Sequencing, and International Human Genome Sequencing Consortium. 2001. “Initial Sequencing and Analysis of the Human Genome.” Nature . https://doi.org/ 10.1038/35057062 . Dolzhenko, Egor, Viraj Deshpande, Felix Schlesinger, Peter Krusche, Roman Petrovski, Sai Chen, Dorothea Emig-Agius, et al. 2019. “ExpansionHunter: A Sequence-Graph-Based Tool to Analyze Variation in Short Tandem Repeat Regions.” Bioinformatics 35 (22): 4754–56. Du, Zhenglin, Liang Ma, Hongzhu Qu, Wei Chen, Bing Zhang, Xi Lu, Weibo Zhai, et al. 2019. “Whole Genome Analyses of Chinese Population and De Novo Assembly of A Northern Han Genome.” Genomics, Proteomics & Bioinformatics 17 (3): 229–47. Eggertsson, Hannes P., Snaedis Kristmundsdottir, Doruk Beyter, Hakon Jonsson, Astros Skuladottir, Marteinn T. Hardarson, Daniel F. Gudbjartsson, Kari Stefansson, Bjarni V. Halldorsson, and Pall Melsted. 2019. “GraphTyper2 Enables Population-Scale Genotyping of Structural Variation Using Pangenome Graphs.” Nature Communications . https://doi.org/ 10.1038/s41467-019-13341-9 . Fairley, Susan, Ernesto Lowy-Gallego, Emily Perry, and Paul Flicek. 2020. “The International Genome Sample Resource (IGSR) Collection of Open Human Genomic Variation Resources.” Nucleic Acids Research 48 (D1): D941–47. Fotsing, Stephanie Feupe, Jonathan Margoliash, Catherine Wang, Shubham Saini, Richard Yanicky, Sharona Shleizer-Burko, Alon Goren, and Melissa Gymrek. 2019. “The Impact of Short Tandem Repeat Variation on Gene Expression.” Nature Genetics 51 (11): 1652–59. Franke, Andre, Dermot P. B. McGovern, Jeffrey C. Barrett, Kai Wang, Graham L. Radford-Smith, Tariq Ahmad, Charlie W. Lees, et al. 2010. “Genome-Wide Meta-Analysis Increases to 71 the Number of Confirmed Crohn’s Disease Susceptibility Loci.” Nature Genetics 42 (12): 1118–25. Garrison, Erik, Jouni Sirén, Adam M. Novak, Glenn Hickey, Jordan M. Eizenga, Eric T. Dawson, William Jones, et al. 2018. “Variation Graph Toolkit Improves Read Mapping by Representing Genetic Variation in the Reference.” Nature Biotechnology 36 (9): 875–79. Gatchel, Jennifer R., and Huda Y. Zoghbi. 2005. “Diseases of Unstable Repeat Expansion: Mechanisms and Common Principles.” Nature Reviews. Genetics 6 (10): 743–55. Gymrek, Melissa, Thomas Willems, Audrey Guilmatre, Haoyang Zeng, Barak Markus, Stoyan Georgiev, Mark J. Daly, et al. 2016. “Abundant Contribution of Short Tandem Repeats to Gene Expression Variation in Humans.” Nature Genetics 48 (1): 22–29. Gymrek, Melissa, Thomas Willems, David Reich, and Yaniv Erlich. 2017. “Interpreting Short Tandem Repeat Variations in Humans Using Mutational Constraint.” Nature Genetics . https://doi.org/ 10.1038/ng.3952 . Hannan, Anthony J. 2010. “Tandem Repeat Polymorphisms: Modulators of Disease Susceptibility and Candidates for ‘missing Heritability.’” Trends in Genetics . https://doi.org/ 10.1016/j.tig.2009.11.008 . ———. 2018. “Tandem Repeats Mediating Genetic Plasticity in Health and Disease.” Nature Reviews. Genetics 19 (5): 286–98. Hickey, Glenn, David Heller, Jean Monlong, Jonas A. Sibbesen, Jouni Sirén, Jordan Eizenga, Eric T. Dawson, .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint http://paperpile.com/b/H8ctd0/rpD83 http://paperpile.com/b/H8ctd0/rpD83 http://paperpile.com/b/H8ctd0/rpD83 http://paperpile.com/b/H8ctd0/rpD83 http://paperpile.com/b/H8ctd0/rpD83 http://paperpile.com/b/H8ctd0/hn5t http://paperpile.com/b/H8ctd0/hn5t http://paperpile.com/b/H8ctd0/hn5t http://paperpile.com/b/H8ctd0/hn5t http://paperpile.com/b/H8ctd0/hn5t http://paperpile.com/b/H8ctd0/q9Ll7 http://paperpile.com/b/H8ctd0/q9Ll7 http://paperpile.com/b/H8ctd0/q9Ll7 http://paperpile.com/b/H8ctd0/q9Ll7 http://paperpile.com/b/H8ctd0/q9Ll7 http://paperpile.com/b/H8ctd0/lyx1d http://paperpile.com/b/H8ctd0/lyx1d http://paperpile.com/b/H8ctd0/lyx1d http://dx.doi.org/10.1038/nature24277 http://paperpile.com/b/H8ctd0/lyx1d http://paperpile.com/b/H8ctd0/Ndo7A http://paperpile.com/b/H8ctd0/Ndo7A http://paperpile.com/b/H8ctd0/Ndo7A http://paperpile.com/b/H8ctd0/Ndo7A http://paperpile.com/b/H8ctd0/Ndo7A http://dx.doi.org/10.1038/35057062 http://paperpile.com/b/H8ctd0/Ndo7A http://paperpile.com/b/H8ctd0/qaF1 http://paperpile.com/b/H8ctd0/qaF1 http://paperpile.com/b/H8ctd0/qaF1 http://paperpile.com/b/H8ctd0/qaF1 http://paperpile.com/b/H8ctd0/qaF1 http://paperpile.com/b/H8ctd0/eix7E http://paperpile.com/b/H8ctd0/eix7E http://paperpile.com/b/H8ctd0/eix7E http://paperpile.com/b/H8ctd0/eix7E http://paperpile.com/b/H8ctd0/eix7E http://paperpile.com/b/H8ctd0/N6KaX http://paperpile.com/b/H8ctd0/N6KaX http://paperpile.com/b/H8ctd0/N6KaX http://paperpile.com/b/H8ctd0/N6KaX http://paperpile.com/b/H8ctd0/N6KaX http://dx.doi.org/10.1038/s41467-019-13341-9 http://paperpile.com/b/H8ctd0/N6KaX http://paperpile.com/b/H8ctd0/4q2kL http://paperpile.com/b/H8ctd0/4q2kL http://paperpile.com/b/H8ctd0/4q2kL http://paperpile.com/b/H8ctd0/4q2kL http://paperpile.com/b/H8ctd0/4q2kL http://paperpile.com/b/H8ctd0/0DGuV http://paperpile.com/b/H8ctd0/0DGuV http://paperpile.com/b/H8ctd0/0DGuV http://paperpile.com/b/H8ctd0/0DGuV http://paperpile.com/b/H8ctd0/0DGuV http://paperpile.com/b/H8ctd0/41me http://paperpile.com/b/H8ctd0/41me http://paperpile.com/b/H8ctd0/41me http://paperpile.com/b/H8ctd0/41me http://paperpile.com/b/H8ctd0/41me http://paperpile.com/b/H8ctd0/lMbAV http://paperpile.com/b/H8ctd0/lMbAV http://paperpile.com/b/H8ctd0/lMbAV http://paperpile.com/b/H8ctd0/lMbAV http://paperpile.com/b/H8ctd0/lMbAV http://paperpile.com/b/H8ctd0/p908A http://paperpile.com/b/H8ctd0/p908A http://paperpile.com/b/H8ctd0/p908A http://paperpile.com/b/H8ctd0/p908A http://paperpile.com/b/H8ctd0/QaNj http://paperpile.com/b/H8ctd0/QaNj http://paperpile.com/b/H8ctd0/QaNj http://paperpile.com/b/H8ctd0/QaNj http://paperpile.com/b/H8ctd0/QaNj http://paperpile.com/b/H8ctd0/yulf http://paperpile.com/b/H8ctd0/yulf http://paperpile.com/b/H8ctd0/yulf http://paperpile.com/b/H8ctd0/yulf http://dx.doi.org/10.1038/ng.3952 http://paperpile.com/b/H8ctd0/yulf http://paperpile.com/b/H8ctd0/IYbxb http://paperpile.com/b/H8ctd0/IYbxb http://paperpile.com/b/H8ctd0/IYbxb http://paperpile.com/b/H8ctd0/IYbxb http://dx.doi.org/10.1016/j.tig.2009.11.008 http://paperpile.com/b/H8ctd0/IYbxb http://paperpile.com/b/H8ctd0/9K9ci http://paperpile.com/b/H8ctd0/9K9ci http://paperpile.com/b/H8ctd0/9K9ci http://paperpile.com/b/H8ctd0/9K9ci http://paperpile.com/b/H8ctd0/JzYin https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Erik Garrison, Adam M. Novak, and Benedict Paten. 2020. “Genotyping Structural Variants in Pangenome Graphs Using the vg Toolkit.” Genome Biology 21 (1): 35. Iqbal, Zamin, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. 2012. “De Novo Assembly and Genotyping of Variants Using Colored de Bruijn Graphs.” Nature Genetics 44 (2): 226–32. Iqbal, Zamin, Isaac Turner, and Gil McVean. 2013. “High-Throughput Microbial Population Genomics Using the Cortex Variation Assembler.” Bioinformatics . https://doi.org/ 10.1093/bioinformatics/bts673 . Jiang, Zhaoshi, Haixu Tang, Mario Ventura, Maria Francesca Cardone, Tomas Marques-Bonet, Xinwei She, Pavel A. Pevzner, and Evan E. Eichler. 2007. “Ancestral Reconstruction of Segmental Duplications Reveals Punctuated Cores of Human Genome Evolution.” Nature Genetics 39 (11): 1361–68. Kolmogorov, Mikhail, Jeffrey Yuan, Yu Lin, and Pavel A. Pevzner. 2019. “Assembly of Long, Error-Prone Reads Using Repeat Graphs.” Nature Biotechnology 37 (5): 540–46. Koolen, D. A., A. J. Sharp, J. A. Hurst, H. V. Firth, S. J. L. Knight, A. Goldenberg, P. Saugier-Veber, et al. 2008. “Clinical and Molecular Delineation of the 17q21.31 Microdeletion Syndrome.” Journal of Medical Genetics 45 (11): 710–20. Koren, Sergey, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. 2017. “Canu: Scalable and Accurate Long-Read Assembly via Adaptive K-Mer Weighting and Repeat Separation.” Genome Research 27 (5): 722–36. LaPierre, Nathan, Kodi Taraszka, Helen Huang, Rosemary He, Farhad Hormozdiari, and Eleazar Eskin. n.d. “Identifying Causal Variants by Fine Mapping Across Multiple Studies.” https://doi.org/ 10.1101/2020.01.15.908517 . Li, Heng, Jonathan M. Bloom, Yossi Farjoun, Mark Fleharty, Laura Gauthier, Benjamin Neale, and Daniel MacArthur. n.d. “New Synthetic-Diploid Benchmark for Accurate Variant Calling Evaluation.” https://doi.org/ 10.1101/223297 . Li, Heng, Xiaowen Feng, and Chong Chu. 2020. “The Design and Construction of Reference Pangenome Graphs with Minigraph.” Genome Biology 21 (1): 265. Mallick, Swapan, Heng Li, Mark Lipson, Iain Mathieson, Melissa Gymrek, Fernando Racimo, Mengyao Zhao, et al. 2016. “The Simons Genome Diversity Project: 300 Genomes from 142 Diverse Populations.” Nature 538 (7624): 201–6. Mousavi, Nima, Sharona Shleizer-Burko, Richard Yanicky, and Melissa Gymrek. 2019. “Profiling the Genome-Wide Landscape of Tandem Repeat Expansions.” Nucleic Acids Research 47 (15): e90. Paten, Benedict, Adam M. Novak, Jordan M. Eizenga, and Erik Garrison. 2017. “Genome Graphs and the Evolution of Genome Inference.” Genome Research 27 (5): 665–76. Pevzner, Pavel A., Haixu Tang, and Glenn Tesler. 2004. “De Novo Repeat Classification and Fragment Assembly.” Genome Research 14 (9): 1786–96. Porubsky, David, Shilpa Garg, Ashley D. Sanders, Jan O. Korbel, Victor Guryev, Peter M. Lansdorp, and Tobias Marschall. 2017. “Dense and Accurate Whole-Chromosome Haplotyping of Individual Genomes.” Nature Communications 8 (1): 1293. Porubsky, David, Human Genome Structural Variation Consortium, Peter Ebert, Peter A. Audano, Mitchell R. Vollger, William T. Harvey, Pierre Marijon, et al. 2020. “Fully Phased Human Genome Assembly without Parental Data Using Single-Cell Strand Sequencing and Long Reads.” Nature Biotechnology . https://doi.org/ 10.1038/s41587-020-0719-5 . Rakocevic, Goran, Vladimir Semenyuk, Wan-Ping Lee, James Spencer, John Browning, Ivan J. Johnson, Vladan Arsenijevic, et al. 2019. “Fast and Accurate Genomic Analyses Using Genome Graphs.” Nature Genetics . https://doi.org/ 10.1038/s41588-018-0316-4 . Raphael, Benjamin, Degui Zhi, Haixu Tang, and Pavel Pevzner. 2004. “A Novel Method for Multiple Alignment of Sequences with Repeated and Shuffled Elements.” Genome Research 14 (11): 2336–46. Rautiainen, Mikko, Veli Mäkinen, and Tobias Marschall. 2019. “Bit-Parallel Sequence-to-Graph Alignment.” Bioinformatics 35 (19): 3599–3607. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint http://paperpile.com/b/H8ctd0/JzYin http://paperpile.com/b/H8ctd0/JzYin http://paperpile.com/b/H8ctd0/JzYin http://paperpile.com/b/H8ctd0/JzYin http://paperpile.com/b/H8ctd0/CjaUX http://paperpile.com/b/H8ctd0/CjaUX http://paperpile.com/b/H8ctd0/CjaUX http://paperpile.com/b/H8ctd0/CjaUX http://paperpile.com/b/H8ctd0/JmTrf http://paperpile.com/b/H8ctd0/JmTrf http://paperpile.com/b/H8ctd0/JmTrf http://paperpile.com/b/H8ctd0/JmTrf http://dx.doi.org/10.1093/bioinformatics/bts673 http://paperpile.com/b/H8ctd0/JmTrf http://paperpile.com/b/H8ctd0/wQpB6 http://paperpile.com/b/H8ctd0/wQpB6 http://paperpile.com/b/H8ctd0/wQpB6 http://paperpile.com/b/H8ctd0/wQpB6 http://paperpile.com/b/H8ctd0/wQpB6 http://paperpile.com/b/H8ctd0/R2BY9 http://paperpile.com/b/H8ctd0/R2BY9 http://paperpile.com/b/H8ctd0/R2BY9 http://paperpile.com/b/H8ctd0/R2BY9 http://paperpile.com/b/H8ctd0/cQ1B http://paperpile.com/b/H8ctd0/cQ1B http://paperpile.com/b/H8ctd0/cQ1B http://paperpile.com/b/H8ctd0/cQ1B http://paperpile.com/b/H8ctd0/cQ1B http://paperpile.com/b/H8ctd0/pJ5xM http://paperpile.com/b/H8ctd0/pJ5xM http://paperpile.com/b/H8ctd0/pJ5xM http://paperpile.com/b/H8ctd0/pJ5xM http://paperpile.com/b/H8ctd0/pJ5xM http://paperpile.com/b/H8ctd0/LGTuZ http://paperpile.com/b/H8ctd0/LGTuZ http://paperpile.com/b/H8ctd0/LGTuZ http://dx.doi.org/10.1101/2020.01.15.908517 http://paperpile.com/b/H8ctd0/LGTuZ http://paperpile.com/b/H8ctd0/yMN6Z http://paperpile.com/b/H8ctd0/yMN6Z http://paperpile.com/b/H8ctd0/yMN6Z http://dx.doi.org/10.1101/223297 http://paperpile.com/b/H8ctd0/yMN6Z http://paperpile.com/b/H8ctd0/n2Qw http://paperpile.com/b/H8ctd0/n2Qw http://paperpile.com/b/H8ctd0/n2Qw http://paperpile.com/b/H8ctd0/n2Qw http://paperpile.com/b/H8ctd0/t54PI http://paperpile.com/b/H8ctd0/t54PI http://paperpile.com/b/H8ctd0/t54PI http://paperpile.com/b/H8ctd0/t54PI http://paperpile.com/b/H8ctd0/t54PI http://paperpile.com/b/H8ctd0/aKIi http://paperpile.com/b/H8ctd0/aKIi http://paperpile.com/b/H8ctd0/aKIi http://paperpile.com/b/H8ctd0/aKIi http://paperpile.com/b/H8ctd0/gDid http://paperpile.com/b/H8ctd0/gDid http://paperpile.com/b/H8ctd0/gDid http://paperpile.com/b/H8ctd0/gDid http://paperpile.com/b/H8ctd0/tdFtW http://paperpile.com/b/H8ctd0/tdFtW http://paperpile.com/b/H8ctd0/tdFtW http://paperpile.com/b/H8ctd0/tdFtW http://paperpile.com/b/H8ctd0/haw16 http://paperpile.com/b/H8ctd0/haw16 http://paperpile.com/b/H8ctd0/haw16 http://paperpile.com/b/H8ctd0/haw16 http://paperpile.com/b/H8ctd0/haw16 http://paperpile.com/b/H8ctd0/JLne http://paperpile.com/b/H8ctd0/JLne http://paperpile.com/b/H8ctd0/JLne http://paperpile.com/b/H8ctd0/JLne http://paperpile.com/b/H8ctd0/JLne http://paperpile.com/b/H8ctd0/JLne http://dx.doi.org/10.1038/s41587-020-0719-5 http://paperpile.com/b/H8ctd0/JLne http://paperpile.com/b/H8ctd0/jQzSb http://paperpile.com/b/H8ctd0/jQzSb http://paperpile.com/b/H8ctd0/jQzSb http://paperpile.com/b/H8ctd0/jQzSb http://paperpile.com/b/H8ctd0/jQzSb http://dx.doi.org/10.1038/s41588-018-0316-4 http://paperpile.com/b/H8ctd0/jQzSb http://paperpile.com/b/H8ctd0/xHKpD http://paperpile.com/b/H8ctd0/xHKpD http://paperpile.com/b/H8ctd0/xHKpD http://paperpile.com/b/H8ctd0/xHKpD http://paperpile.com/b/H8ctd0/Uke8R http://paperpile.com/b/H8ctd0/Uke8R http://paperpile.com/b/H8ctd0/Uke8R https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ Redon, Richard, Shumpei Ishikawa, Karen R. Fitch, Lars Feuk, George H. Perry, T. Daniel Andrews, Heike Fiegler, et al. 2006. “Global Variation in Copy Number in the Human Genome.” Nature 444 (7118): 444–54. Saini, Shubham, Ileena Mitra, Nima Mousavi, Stephanie Feupe Fotsing, and Melissa Gymrek. 2018. “A Reference Haplotype Panel for Genome-Wide Imputation of Short Tandem Repeats.” Nature Communications 9 (1): 4397. Seo, Jeong-Sun, Arang Rhie, Junsoo Kim, Sangjin Lee, Min-Hwan Sohn, Chang-Uk Kim, Alex Hastie, et al. 2016. “De Novo Assembly and Phasing of a Korean Human Genome.” Nature 538 (7624): 243–47. Shi, Lingling, Yunfei Guo, Chengliang Dong, John Huddleston, Hui Yang, Xiaolu Han, Aisi Fu, et al. 2016. “Long-Read Sequencing and de Novo Assembly of a Chinese Genome.” Nature Communications 7 (June): 12065. Song, Janet H. T., Craig B. Lowe, and David M. Kingsley. 2018. “Characterization of a Human-Specific Tandem Repeat Associated with Bipolar Disorder and Schizophrenia.” American Journal of Human Genetics 103 (3): 421–30. Sudmant, Peter H., Swapan Mallick, Bradley J. Nelson, Fereydoun Hormozdiari, Niklas Krumm, John Huddleston, Bradley P. Coe, et al. 2015. “Global Diversity, Population Stratification, and Selection of Human Copy-Number Variation.” Science 349 (6253): aab3761. Taliun, Daniel, Daniel N. Harris, Michael D. Kessler, Jedidiah Carlson, Zachary A. Szpiech, Raul Torres, Sarah A. Gagliano Taliun, et al. 2019. “Sequencing of 53,831 Diverse Genomes from the NHLBI TOPMed Program.” bioRxiv . https://doi.org/ 10.1101/563866 . Viguera, E., D. Canceill, and S. D. Ehrlich. 2001. “Replication Slippage Involves DNA Polymerase Pausing and Dissociation.” The EMBO Journal 20 (10): 2587–95. Wellcome Trust Case Control Consortium, Australo-Anglo-American Spondylitis Consortium (TASC), Paul R. Burton, David G. Clayton, Lon R. Cardon, Nick Craddock, Panos Deloukas, et al. 2007. “Association Scan of 14,500 Nonsynonymous SNPs in Four Diseases Identifies Autoimmunity Variants.” Nature Genetics 39 (11): 1329–37. Witoelar, Aree, Iris E. Jansen, Yunpeng Wang, Rahul S. Desikan, J. Raphael Gibbs, Cornelis Blauwendraat, Wesley K. Thompson, et al. 2017. “Genome-Wide Pleiotropy Between Parkinson Disease and Autoimmune Diseases.” JAMA Neurology 74 (7): 780–92. Ye, Chun Jimmie, Jenny Chen, Alexandra-Chloé Villani, Rachel E. Gate, Meena Subramaniam, Tushar Bhangale, Mark N. Lee, et al. 2018. “Genetic Analysis of Isoform Usage in the Human Anti-Viral Response Reveals Influenza-Specific Regulation of Transcripts under Balancing Selection.” Genome Research 28 (12): 1812–25. Zook, Justin M., Nancy F. Hansen, Nathan D. Olson, Lesley Chapman, James C. Mullikin, Chunlin Xiao, Stephen Sherry, et al. 2020. “A Robust Benchmark for Detection of Germline Large Deletions and Insertions.” Nature Biotechnology , June. https://doi.org/ 10.1038/s41587-020-0538-8 . Author contributions. T.Y.L. and M.J.P.C. performed data analysis and wrote the manuscript. M.J.P.C. supervised the work. HGSVC generated sequencing data. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.08.13.249839doi: bioRxiv preprint http://paperpile.com/b/H8ctd0/pJd6 http://paperpile.com/b/H8ctd0/pJd6 http://paperpile.com/b/H8ctd0/pJd6 http://paperpile.com/b/H8ctd0/pJd6 http://paperpile.com/b/H8ctd0/pJd6 http://paperpile.com/b/H8ctd0/37xl http://paperpile.com/b/H8ctd0/37xl http://paperpile.com/b/H8ctd0/37xl http://paperpile.com/b/H8ctd0/37xl http://paperpile.com/b/H8ctd0/37xl http://paperpile.com/b/H8ctd0/xd623 http://paperpile.com/b/H8ctd0/xd623 http://paperpile.com/b/H8ctd0/xd623 http://paperpile.com/b/H8ctd0/xd623 http://paperpile.com/b/H8ctd0/B7ifz http://paperpile.com/b/H8ctd0/B7ifz http://paperpile.com/b/H8ctd0/B7ifz http://paperpile.com/b/H8ctd0/B7ifz http://paperpile.com/b/H8ctd0/B7ifz http://paperpile.com/b/H8ctd0/JeL81 http://paperpile.com/b/H8ctd0/JeL81 http://paperpile.com/b/H8ctd0/JeL81 http://paperpile.com/b/H8ctd0/JeL81 http://paperpile.com/b/H8ctd0/JeL81 http://paperpile.com/b/H8ctd0/N1Ru http://paperpile.com/b/H8ctd0/N1Ru http://paperpile.com/b/H8ctd0/N1Ru http://paperpile.com/b/H8ctd0/N1Ru http://paperpile.com/b/H8ctd0/N1Ru http://paperpile.com/b/H8ctd0/cRk7v http://paperpile.com/b/H8ctd0/cRk7v http://paperpile.com/b/H8ctd0/cRk7v http://paperpile.com/b/H8ctd0/cRk7v http://paperpile.com/b/H8ctd0/cRk7v http://dx.doi.org/10.1101/563866 http://paperpile.com/b/H8ctd0/cRk7v http://paperpile.com/b/H8ctd0/oc37W http://paperpile.com/b/H8ctd0/oc37W http://paperpile.com/b/H8ctd0/oc37W http://paperpile.com/b/H8ctd0/oc37W http://paperpile.com/b/H8ctd0/8Gyl http://paperpile.com/b/H8ctd0/8Gyl http://paperpile.com/b/H8ctd0/8Gyl http://paperpile.com/b/H8ctd0/8Gyl http://paperpile.com/b/H8ctd0/8Gyl http://paperpile.com/b/H8ctd0/8Gyl http://paperpile.com/b/H8ctd0/gwpe http://paperpile.com/b/H8ctd0/gwpe http://paperpile.com/b/H8ctd0/gwpe http://paperpile.com/b/H8ctd0/gwpe http://paperpile.com/b/H8ctd0/gwpe http://paperpile.com/b/H8ctd0/stR2 http://paperpile.com/b/H8ctd0/stR2 http://paperpile.com/b/H8ctd0/stR2 http://paperpile.com/b/H8ctd0/stR2 http://paperpile.com/b/H8ctd0/stR2 http://paperpile.com/b/H8ctd0/stR2 http://paperpile.com/b/H8ctd0/ccLhp http://paperpile.com/b/H8ctd0/ccLhp http://paperpile.com/b/H8ctd0/ccLhp http://paperpile.com/b/H8ctd0/ccLhp http://paperpile.com/b/H8ctd0/ccLhp http://dx.doi.org/10.1038/s41587-020-0538-8 http://paperpile.com/b/H8ctd0/ccLhp https://doi.org/10.1101/2020.08.13.249839 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2020_09_09_289074 ---- Structural Genetics of circulating variants affecting the SARS-CoV-2 Spike / human ACE2 complex 1 Structural Genetics of circulating variants affecting the SARS-CoV-2 Spike / human ACE2 complex Francesco Ortuso1,2, Daniele Mercatelli3, Pietro Hiram Guzzi4, Federico Manuel Giorgi3,* 1 Department of Health Sciences, University “Magna Græcia” of Catanzaro, Catanzaro, Italy 2 Net4Science srl, c/o University “Magna Græcia” of Catanzaro, Catanzaro, Italy 3 Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy 4 Department of Surgical and Medical Sciences, University “Magna Græcia” of Catanzaro, Catanzaro, Italy * Corresponding author E-mail: federico.giorgi@unibo.it (FMG) ORCIDs Francesco Ortuso: 0000-0001-6235-8161 Daniele Mercatelli: 0000-0003-3228-0580 Pietro Hiram Guzzi: 0000-0001-5542-2997 Federico Manuel Giorgi: 0000-0002-7325-9908 Classification Biophysics and Computational Biology Keywords SARS-CoV-2, COVID-19, mutations, Spike, ACE2 Author Contributions FMG, PHG and FO designed the study. FO designed and performed the structural analysis. FMG designed the genetics analysis. FMG and DM performed the genetics analysis. FMG financially supported the study. PHG drafted the manuscript and performed literature search. All authors contributed to the writing of the final version of the manuscript. Abstract .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 SARS-CoV-2 entry in human cells is mediated by the interaction between the viral Spike protein and the human ACE2 receptor. This mechanism evolved from the ancestor bat coronavirus and is currently one of the main targets for antiviral strategies. However, there currently exist several Spike protein variants in the SARS-CoV-2 population as the result of mutations, and it is unclear if these variants may exert a specific effect on the affinity with ACE2 which, in turn, is also characterized by multiple alleles in the human population. In the current study, the GBPM analysis, originally developed for highlighting host-guest interaction features, has been applied to define the key amino acids responsible for the Spike/ACE2 molecular recognition, using four different crystallographic structures. Then, we intersected these structural results with the current mutational status, based on more than 295,000 sequenced cases, in the SARS-CoV-2 population. We identified several Spike mutations interacting with ACE2 and mutated in at least 20 distinct patients: S477N, N439K, N501Y, Y453F, E484K, K417N, S477I and G476S. Among these, mutation N501Y in particular is one of the events characterizing SARS-CoV-2 lineage B.1.1.7, which has recently risen in frequency in Europe. We also identified five ACE2 rare variants that may affect interaction with Spike and susceptibility to infection: S19P, E37K, M82I, E329G and G352V. Significance Statement We developed a method to identify key amino acids responsible for the initial interaction between SARS-CoV-2 (the COVID-19 virus) and human cells, through the analysis of Spike/ACE2 complexes. We further identified which of these amino acids show variants in the viral and human populations. Our results will facilitate scientists and clinicians alike in identifying the possible role of present and future Spike and ACE2 sequence variants in cell entry and general susceptibility to infection. Abbreviations AA: amino acid ACE2: Angiotensin-Converting Enzyme 2 COVID-19: Coronavirus Disease 2019 GBPM: Grid Based Pharmacophore Model IEP: Interaction Energy point MIFs: Molecular Interaction Fields ORF: Open Reading Frame PDB: Protein Data Bank RBD: Spike Receptor Binding Domain with ACE2 RMSd: Root Mean Square deviation SARS-CoV-2: Severe Acute Respiratory Syndrome Coronavirus 2 Main Text Introduction The Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has emerged in late 2019 (1) as the etiological cause of a pandemic of severe proportions dubbed Coronavirus Disease 19 (COVID- 19). The disease has reached virtually every country in the globe (2), with more than 40,000,000 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 confirmed cases and more than 1,100,000 deaths (source: World Health Organization). SARS-CoV-2 is characterized by a 29,903-long single stranded RNA genome, densely packed in 11 Open Reading Frames (ORFs); the ORF1 encodes for a polyprotein which is furtherly split in 16 proteins, for a total of 26 proteins (3). The second ORF encodes for the Spike (S) protein, which is the key protagonist in the viral entry into host cells, through its interaction with human epithelial cell receptors Angiotensin Converting Enzyme 2 (ACE2) (4), Transmembrane Serine Protease 2 (TMPRSS2) (5), Furin (6) and CD147 (7). Investigators have focused their attention on the Spike/ACE2 interaction, trying to disrupt it as a potential anti-COVID-19 therapy, using small drugs (8) or Spike fragments (9). Using X-ray crystallography, some models of the Spike/ACE2 have been generated (10–12), providing a structural instrument for the analysis of this key interaction. These models determined that the Receptor Binding Domain (RBD) of Spike, directly interacting with ACE2, is a compact structure of ~200 amino acids (AAs) over a total of 1273 AAs of the full-length Spike. The SARS-CoV-2 Spike protein adapted from subsequent mutations from a wild bat beta-coronavirus (13), in order to exploit the N-terminal ACE2 peptidase domain conformation. As a result, SARS-CoV- 2 Spike can establish a strong interaction with the human cell surface, allowing the virus to fuse its membrane with that of the host cell, releasing its proteins and genetic material and starting its replication cycle (5). While SARS-CoV-2 shows low mutability (14), with less than 25 predicted events/year (15), the virus is in continuous evolution from the original Wuhan reference sequence (NC_045512.2) (16), and there are currently at least 6 major variants circulating in the population (3, 17). Some of these strains are characterized by a mutation in Spike, at AA 614, whereas an Aspartic Acid (D) is substituted by a Glycine (G) (18). In fact, the Spike D614G mutation gives the name to the most frequent viral clade (G), which was first detected in Europe at the end of January 2020, and is currently present in all continents, with increasing frequency over time (3). D614G does not fall within the putative RBD (AA ~330-530), but some studies suggest it may have a clinically relevant role: D614G is positively correlated with increased case fatality rate (19), and it shows increased transmissibility and infectivity compared to the reference genome (20). In vitro studies show that viruses carrying the D614G Spike mutation have an increased viral load and cytopathic effect in cultured Vero cells (16). Despite these preliminary observations, there are still several doubts on the molecular effects of the D614G variant (21). Other recurring Spike mutations have been observed in the population worldwide, however at frequencies of 1% or below (3); some of these mutations fall within the RBD and therefore may have a direct role in ACE2 interaction. On the other hand, genetic variants of ACE2 in human population may influence susceptibility or resistance to SARS-CoV-2 infection, possibly contributing to the difference in clinical features observed in COVID-19 patients (22). ACE2 gene is located on chromosome Xp22.2 and consists of 18 exons, coding for an 805 AAs long protein exposed on the cell surface of a variety of human organs, including kidneys, heart, brain, gastrointestinal tract, and lungs (23). It is unclear if tissue-expression patterns of ACE2 may be linked to the severity of symptoms or outcomes of SARS-CoV-2 infections; however, ACE2 levels in lungs were found to be increased in patients with comorbidities associated to severe COVID-19 clinical manifestations (24), whereas polymorphisms of ACE2 have been already described to play a role in hypertension and cardiovascular diseases (25), particularly in association with type 2 diabetes (23), all conditions predisposing to an increased risk of dying from COVID-19 (26). Despite early studies, the presence of Spike mutations potentially altering the binding with .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 ACE2 is still largely under-investigated, as is the role of ACE2 variants in the human population in determining patient-specific molecular interactions between these two proteins. In the present study, we aim at detecting which Spike and ACE2 AAs are the most important in determining the SARS-CoV-2 entry interaction and analyze which ones have already mutated in the population. The task is clinically relevant, providing a functional characterization of present and future mutations targeting the ACE2/Spike binding and detected by sequencing SARS-CoV-2 on a patient-specific basis. Characterizing the variability of both proteins must be taken in consideration in the process of developing anti-COVID-19 strategies, such as the Spike-based vaccine currently deployed by the National Institute of Allergy and Infectious Diseases and Moderna (27). Results We set out to analyze the key AAs involved in the Spike/ACE2 interaction, in order to highlight which ones may alter the binding affinity and therefore etiological and clinical properties of different SARS- CoV-2 variants on different patients. Following that, we determined which Spike and ACE2 AA variations relevant for this interaction have been observed in the SARS-CoV-2 and human population, respectively. Structural analysis of Spike/ACE2 interaction We obtained structural models of the SARS-CoV-2 Spike interacting with the human ACE2 from three recent X-ray structures, deposited on the Protein Data Bank: 6LZG (10), 6M0J (11) and 6VW1 (12). For 6VW1, two Spike/ACE2 complexes were available, so we report results for both as 6VW1-A and 6WV1-B, separately. All models show the core domains of interaction, located in the region of AA 330-530 for Spike and in the region AA 15-615 of ACE2. Full length proteins would be 1273 AAs (Spike only known isoform, from reference SARS-CoV-2 genome NC_045512.2) and 805 AAs (ACE2 isoform 1, UniProt id Q9BYF1-1). Selected PDB entries are wild type and their primary sequence and the higher order structures were identical. Residues 517-519 were missed in 6VW1-B. With the aim to investigate the conformation variability, PDB complexes were aligned by backbone and the Root Mean Square deviation (RMSd) was computed on all equivalent not hydrogen atoms. RMSd data have shown some conformation flexibility that confirmed our idea to take into account all PDB structures in the next investigation (Fig 1). The GBPM method was originally developed for identifying and scoring pharmacophore and protein- protein interaction key features by combining GRID molecular interaction fields (MIFs) according to the GRAB tool algorithm (28). In the present study, GBPM has been applied to all selected complex models considering Spike and ACE2 either as host or guest. DRY, N1 and O GRID probes were considered for describing hydrophobic, hydrogen bond donor and hydrogen bond acceptor interaction. For each probe a cut-off, required for highlighting the most relevant MIFs points, was fixed above the 30% from the corresponding global minimum interaction energy value. With respect to the known GBPM application, where pharmacophore features are used for virtual screening .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 purposes, here these data guided us in the complex stabilizing AAs identification. In fact, Spike or ACE-2 residues, within 3 Å from GBPM points, were marked as relevant in the host-guest recognition and were qualitatively scored by assigning them the corresponding GBPM energy. If a certain residue was suggested by more than one GBPM point, its score was computed as summa of the related GBPM points energy (Fig. 2). Finally, for each selected residue, the four models averaged score was considered for estimating the role in complex stabilization. Taking into account their average scores, Spike and ACE2 AAs were divided by quartiles to facilitate the interpretation of the results: quartile 1 (Q1) includes the strongest complex stabilization contributors; quartile 2 (Q2) contains residues less important than those reported in Q1 but most relevant of those included in quartile 3 (Q3); quartile 4 (Q4) indicates the weakest predicted interacting AAs. Such an extension of the original approach allowed us to highlight known relevant interaction residues of both Spike (Table 1) and ACE-2 (Table 2). Basically, the same number of AAs was highlighted for Spike (26 AAs) and ACE2 (25 AAs). The average score was also in the same range. Spike reported a population of Q1 larger than ACE2: 12 and 7 AAs, respectively. The opposite scenario was observed in the Q2 that accounted for 7 residues for Spike and 11 for ACE2. No remarkable difference can be addressed to the Q3 and Q4 Spike-ACE2 comparison. We reasoned that mutations and variants in Q1 residues could have a more relevant impact in the complex stability. The analysis of all designed GBPM suggested the Spike - ACE2 molecular recognition is largely sustained by polar interactions, such as hydrogen bonds, and by very few putative hydrophobic contributions (Table 3). Mutational analysis of SARS-CoV-2 Spike We analyzed 295,507 publicly available SARS-CoV-2 full-length genome sequences collected worldwide and deposited on the GISAID database on December 30, 2020 (29). From these, we obtained 257,434 samples containing at least one AA-changing mutation in the Spike protein. A total of 3,314 different AA-changing mutations were detected in the 1,279 AA-long Spike sequence. However, many of these are unique events (or possibly even sequencing errors), as only 2,023 mutations were found in more than one sample, 788 were found in more than ten samples, and 196 in more than one hundred samples (Supplementary File 1). We then focused on mutations located in the Spike RBD (aa 330-530) with predicted interaction contribution, as assessed by our GBPM method. The majority of mutations here are found in only a handful of samples (Table 4 and Fig 4 A), with a few notable exceptions. The mutations S477N and N439K are the most frequent in the current population and were identified in 16,547 patients (5.60%) and 5,587 patients (1.89%) respectively. These two variants (N439K and S477N) are also amongst the top 20 most frequent in the population and involve two positions productively contributing to the interaction between Spike and ACE2, according to GBPM (see Table 1 and Fig 3 for locations 439 and 477). The graphical inspection of the PDB structures revealed that Spike Asparagine (N) 439, raked at GBPM Q2, is mainly involved in intra-protein interaction. In fact, by means of its backbone sp2 oxygen atom, N439 accepts one hydrogen bond from Spike Serine 443 sidechain and, by its sidechain amide .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 group, donates one hydrogen bond to the Spike Proline 499 backbone: all these AAs are located into a random coil loop of Spike so the N439K could minimally modify the Spike-ACE2 recognition. On the other hand, after the theoretical mutation of the Asparagine 439 with a Lysine, it is possible to predict a productive electrostatic interaction between the new net positively charged residue and the ACE2 Glutamate 329. Such a long-distance interaction could improve the stabilization of the complex with respect to the Spike wild type (Figure S1). A similar effect could be addressed to the mutation at position 477. Serine (S) 477 is a weak contributor to the complex interaction. In all PDB entries we selected, Serine 477 is located into a solvent exposed random coil loop. No interaction with ACE2 or Spike residues can be observed. Actually, the GBPM analysis included such a residue in Q2. Conversely, its mutation to Asparagine (S477N), in our in silico model, revealed the possibility to establish hydrogen bond to the ACE2 Serine 19 that can clearly result in a stabilization of the complex (Figure S2). Moreover, position 477 is also affected by three other events with lower occurrence: S477I, S477R and S477G, with 6, 2 and 2 observations (Table 4). Among all, the S447R could be the most interesting one. Actually, a net positively charged residue, such as Arginine (R), can establish a weak electrostatic interaction to ACE2 Glutamate 87, as suggested by a theoretical model we built. The S477I and S477G could modify the conformation of a random coil segment, so it does not appear very relevant. Conversely, S477N and S477G could productively contribute to the Spike ACE2 complex stabilization. Of course, deeper theoretical and experimental investigations should be carried out to confirm this hypothesis. Unfortunately, full-scale simulations cannot be rigorously performed today because the available 3D structural models report only fragments of the complex between Spike and ACE2. The third most common mutation, N501Y (Fig 3), targets an AA predicted to have a strong role in the interaction in all four models, sitting in the GBPM Q1. N501Y was detected in 4,921 patients (1.67% of the dataset): the majority of which were located in the United Kingdom (29). From a structural point of view, we predict that a substitution, at position 501, of an Asparagine (N) with a Tyrosine (Y) may have an effect: their Total Polar Surface Area (TPSA), equal to 101.29 and to 78.43 Å2 respectively, is different, however both their sidechains can donate/accept a hydrogen bond. Therefore, their contribution to complex stabilization may be slightly different, also taking into account the chemical environment. In fact, the wild type Asparagine 501 donates one hydrogen bond to ACE2 Tyrosine 41: such an interaction could be possible also for N501Y mutant or, as we observed in our theoretical model, it could be replaced by pi-pi stacking (Figure S3). The rapid increase in frequency of mutation N501Y has been recently observed in the United Kingdom and other countries, as it is one of the variants characterizing lineage B1.1.7 (30). The Asparagine/Tyrosine substitution in Spike position 501 could contribute to determine an evolutionary advantage for this lineage, based on differential affinity for the human receptor ACE2 (31, 32). A less frequent mutation amongst those predicted to contribute to the ACE2/Spike interaction is G476S, detected in 43 samples (0.02%), and supported by three out of four structural models (Table 1, Fig 4 B). The Glycine (G) 476 was included by GBPM analysis in Q2: its contribution to the complex stabilization is weak. Conversely to the other mutation described here, the replacement of Glycine 476 with a Serine (S) could have more evident effects on Spike ACE2 molecular recognition. In fact, in all PDB entries, the alpha carbon of this Glycine is very close, about 4 Å, to the sidechain amide group of the ACE2 Glutamine 24. Between these two AAs no productive interaction can be established but the substitution of the Spike Glycine with a Serine could allow one inter-protein hydrogen bond to .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 ACE2 Glutamine 24. Moreover, G476S could establish the same interaction with Spike Glutamine 478 that could stabilize the conformation of a random coil segment of the viral protein resulting in a better pre-organization to the ACE2 recognition (Figure S4). Another Spike residue, predicted by our analysis for playing a relevant role in ACE2 recognition, is the Glutamine 493 (Table 1). The GISAID data revealed that such an aminoacid is rarely replaced by a Leucine (Q493L) or by an Arginine (Q493R). These mutations could affect the recognition of ACE2 in an opposite way. Spike Glutamine 493 is involved in hydrogen bond with ACE2 Glutamate 35. The mutation Q493L cannot establish such a productive contribution and could only hydrophobically interact to Spike Leucine 455. Conversely, Q493R could locate its net positively charged sidechain into an ACE2 pocket delimited by Aspartate 30, Histidine 34 and Glutamate 35. Such a positioning could produce a remarkable electrostatic stabilization of the complex (Figure S5). In general, we could observe that AAs with the strongest evidence for interaction contribution in the Spike/ACE2 interface tend not to diverge from the reference (Fig 4 B), which may indicate a solid evolutionary constraint to maintain the interface residues unchanged. For example, one of the most relevant 1st quartile AA in the ACE2/Spike interaction, Glutamine (Q) 493, is rarely mutated, with 12 cases of Q493L, 4 of Q493* (the substitution of Q493 with a stop codon), 3 of Q493K, and 1 of Q493R and Q493H. One possible exception is the aforementioned Spike mutation N501Y, located in the strongest 1st quartile GBPM-predicted AA for ACE2 binding, which was found in the considerable number of 4921 different patients. Mutational analysis of human ACE2 We also investigated the variants of human ACE2, since these could constitute the basis for patient- specific COVID-19 susceptibility and severity. ACE2 protein sequence is highly conserved across vertebrates (33) and also within the human species (34), with the most frequent missense mutation (rs41303171, N720D) present in 1.5% of the world population (Supplementary File 2). Our analysis shows that only 5 variants of ACE2 detected in the human population are also located in the ACE2/Spike direct binding interface (Table 5 and Fig 5). Of these, rs73635825 (causing a S19P AA variant) is both the most frequent in the population (0.06%) and the most relevant in the interaction with the viral protein, with a GBPM score of -47.6175 (Q1) and support from all 4 models (Table 2). The rs73635825 SNP frequency is higher in the population of African descent (0.2%). The second SNP, rs143936283 (E329G, Table 5) is a very rare allele (0.0066%) in the European (non-Finnish) Asian population. The rs766996587 (M82I) SNP is also a very rare allele (0.0066%) found in the African population. E37K (rs146676783) is more frequent in the Finnish (0.03%) and G352V (rs370610075) in the European non-Finnish (0. 007%) population. None of these five SNPs have a reported clinical significance, according to dbSNP and literature search (35). It must be mentioned that M82I, together with S19P, has been predicted to adversely affect ACE2 stability (36). M82I, together with E329G, has been simulated to increase binding affinity with Spike when compared to wild type ACE2, hypothesizing greater susceptibility to SARS-CoV-2 for patients carrying these variants (37). Instead, E37K (37) and G352V (38) were predicted to possess a lower affinity with Spike, suggesting lower susceptibility to the infection. However, while describing .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 potential explanations to the existence of a possible predisposing genetic background to infection, all these studies remain inconclusive in linking allele variants to COVID-19 susceptibility. Structurally, the S19P variant may greatly differ from the reference sequence in the interaction with ACE2: Serine (S) is a polar residue, able to accept and donate, by means of its side chain alcoholic group, a hydrogen bond. Proline (P), on the other hand, cannot be involved in hydrogen bonding, and therefore should establish a weaker interaction with Spike. In fact, ACE2 Serine 19 sidechain donates a hydrogen bond to Spike Alanine 475 backbone (Figure S6) and potentially could establish the same interaction with Spike Glycine (G) 476, which could also be mutated (Table 4). Both Methionine (M) 82 and Glutamate (E) 329 are in Q3 minimally contributing to Spike ACE2 recognition (Figures S7 and S8). They are located within two alpha helices so their mutation could modify the secondary structure of ACE2 corresponding to a different affinity against Spike. Such a possibility should be more evident in the case of E329G because Glutamate 329 sidechain is involved in hydrogen bond with ACE-2 Glutamine 325. Discussion SARS-CoV-2 Spike evolved through a series of adaptive mutations that increased its affinity for the human ACE2 receptor (39). There is no reason to believe that the evolution and adaptation of the virus will stop, making continuous sequencing and mutational tracking studies of paramount importance to strategically contain COVID-19 (40). In our study, we highlighted which specific locations of Spike can influence the ACE2 molecular recognition, required for the viral entry into the host cell (5). We further showed that some mutations are already present in the SARS-CoV-2 population that may weakly affect the interaction with the human receptor, specifically Spike N439K, S477N and N501Y. These mutations are rising in the viral population (>1%) and in particular N501Y is one of the key mutations characterizing lineage B.1.1.7 (32), which has seen a recent dramatic increase in frequency in the United Kingdom (30). Having identified this mutation proves that our combination of targeted mutation frequency and GBPM is a useful pipeline to monitor events in the key region used by SARS-CoV-2 to recognize and enter human bronchial cells. The same approach can be used to monitor, in the future, if any of these events will increase in frequency, suggesting an adaptation to the human host leveraging a higher affinity with ACE2. On the other hand, we studied the variants in the human ACE2 population, identifying 5 loci that can affect the binding with SARS-CoV-2 Spike. They are all rare variants, with the most frequent, S19P, present in 0.06% of the population, and with no known clinical significance. However, other in silico studies have predicted their role in decreasing ACE2 stability (S19P and M82I) (36), and in altering the affinity with Spike (increasing it: M82I and E329G (37); decreasing it: E37K (37) and G352V (38)). The most common ACE2 variant, rs41303171 (N720D), is not located in the binding region, and so far its predicted effects on the etiopathology of COVID-19 are still largely conjectural and associated to neurological complications via mechanisms probably independent from direct interaction with Spike (41). .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9 It remains to be seen whether, in the future, the combination of Spike and ACE2 sequences will produce novel and unexpected COVID-19 specificities, that will require granular efforts in developing wider-spectrum anti-SARS-CoV-2 strategies, such as vaccines or antiviral drugs. So far, our analysis has shown a location on the Spike/ACE2 complex where both proteins vary in the viral/human population, specifically on ACE2 S19 and Spike A475/G476. While, as described in our Results, these mutations on Spike are not likely to strongly affect the interaction surface, future combinations of ACE2/Spike variants may have peculiar effects that will require constant mutation monitoring. Identifying single or multiple AAs involved in this viral entry interaction will allow for personalized diagnosis and clinical prediction based on the specific combination of SARS-CoV-2 strain and ACE2 variant. Personalized COVID-19 treatment will require targeted sequencing of the patient ACE2 and Spike, to identify the combination causing the specific case. This technical obstacle can be further complicated by the intra-host genetic variability of SARS-CoV-2, which has recently been reported from RNA-Sequencing studies (42). Structural investigation will benefit, in the next future, from the availability of experimental structural models reporting the complete sequence of both Spike and ACE2, or at least Spike. This will allow more rigorous computational analyses (i.e. molecular dynamics simulation, free energy perturbation) on the effect of mutations on the Spike/ACE2 recognition. Beyond the complex investigated in this manuscript, our approach can be fully extended to any other partners in the SARS-CoV-2/human interactome, for example the recently discovered interaction between viral protease NSP5 (43) and human histone deacetylase HDAC2 (44), which is indirectly responsible for the transcriptional activation of pro-inflammatory genes. Our approach can also be extended to other viruses exploiting human receptors as an entry mechanism, such as CD4 for the Human Immunodeficiency Virus (HIV) or TIM-1 for the Ebola virus (45). Materials and Methods Structural analysis The PDB (46) was searched for high resolution Spike/ACE2 complexes. PDB entries 6LZG (10), 6M0J (11) and 6VW1 (12), reporting the Spike RBD interacting to ACE2, have been retrieved and taken into account for our GBPM analysis (28). Such a computational approach compares GRID (47) molecular interaction fields (MIFs) computed on a generic complex (A) and on its host (B) and guest (C) components, separately. Actually, MIFs describe the interaction between a certain probe and a certain target. If the target is represented by a complex, depending on the selected area, the MIF energies can be referred to the interaction between the probe and one of the complex subunits or, at the host/guest interface, with both of them. The GBPM analysis, objectively, highlights these last. Five steps are required: (1) the complex A is disassembled in its subunits B and C; (2) MIFs are computed on A, B and C by using the most appropriate GRID probes. A hydrogen bond acceptor/donor and a generic hydrophobic probe can describe the basic interaction. Because GRID MIFs are stored as a 3D matrix of interaction energy points (IEP), the same box dimensions are adopted in all calculations; (3) each IEP of B is compared with respect to the equivalent point of A generating a new MIFs named D. The following algorithm, available into the GRAB tool, is applied: if IEP(A) > 0 and IEP(B) > 0 then IEP(D) = 0; if IEP(A) > 0 and IEP(B) < 0 then IEP(D) = IEP(B); if IEP(A) < 0 and IEP(B) > 0 then IEP(D) .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 = -IEP(A); if IEP(A) < 0 and IEP(B) < 0 then IEP(D) = IEP(A)-IEP(B). The resulting MIF D reports as negative energy values the productive interaction between the GRID probe and B and the interface A and B; (4) in order to obscure the interaction between the probe and B, MIFs D and C are compared, by using the GRAB approach, producing to a new MIF E; (5) the most relevant interaction points (GBPM features) of the MIF E are, finally, selected taking into account an energy cutoff 15% above the global minimum. Supplementary figures focusing on the most relevant mutation are available in Supplementary File 3. Before starting the GBPM analysis, co-crystalized water molecules were removed from PDB structures. In 6VW1, showing two Spike-ACE2 complexes, namely chains A-E and B-F, both structures have been investigated and further reported as model A and B, respectively. All selected complexes have been conformationally compared one each other by alignment and computing the RMSd on the cartesian coordinates of equivalent not hydrogen atoms. DRY, N1 and O original GRID probes have been used to highlight hydrophobic, hydrogen bond donors and acceptors areas. In order to identify the most relevant residues of both Spike and ACE2, we conceptually and technically extended the GBPM algorithm, originally designed for drug/target interactions (28). In the GBPM analysis presented here, the two interacting proteins have been considered either as host and guest units, and relevant AAs were selected if their distance from GBPM features was lower or equal to 3 Å. For each PDB model, the selected residues were scored as summa of the corresponding GBPM features interaction energy. In order to prevent unrealistic distortion of the Spike-ACE2 complex, due the usage of structures not covering the full length of the interacting proteins, the mutations effect has been qualitatively estimated by means of the mutagenesis tool implemented in PyMol software (48). Wild type residues have been replaced by the mutation and the new sidechain conformations have been optimized taking into account the neighboring AAs. The graphical analysis was carried out onto the predicted most populated rotamers. On the basis of its better X-ray resolution, the 6M0J PDB structure has been selected for the above reported investigation. Genetical analysis SARS-CoV-2 genome sequences from human hosts and accounting for a total of 145,201 submissions were obtained from the GISAID database on 15 October 2020 (29). Low quality (with more than 5% uncharacterized nucleotides) and incomplete (<29,000 nucleotides, based on a total reference length of 29,903) sequences were removed. The resulting 135,591 genome sequences were aligned on the reference SARS-CoV-2 Wuhan genome (NCBI entry NC_045512.2) using the NUCMER algorithm (49). Position-specific nucleotide differences were merged for neighboring events and converted into protein mutations using the coronapp annotator (17). The results were further filtered for AA- changing mutations targeting the Spike protein. ACE2 variants in the human population were extracted from the gnomAD database, v3, 18 July 2020 (50). We considered only missense variants affecting specific AAs in the protein sequence, for a total of 155 entries (Supplementary File 2). Graph generation was performed with the R statistical software and the corto package v1.1.2 (51). .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11 Acknowledgments We thank the Italian Ministry of Education and Research for their financial support under the Montalcini initiative. We thank Prof. Giovanni Perini for his continued support and scientific enthusiasm, Prof. Massimo Battistini for his lessons on logic and writing, Prof. Elena Bacchelli for her suggestions on the use of gnomAD, and Prof. Stefano Alcaro who provided the computational resources required by the GBPM analysis. Finally, we thank Mr. George Wolf for the final proofreading the manuscript. References .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 12 Figures and Tables .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 13 Figure 1. Conformational comparison of Spike-ACE2 PDB complexes: (A) alignment of PDB entries, Spike and ACE2 are respectively surrounded by cyan and orange fog, and (B) bar graph showing RMSd (in Å) computed on structures aligned without hydrogen atoms. 0,00 0,34 1,43 2,19 0,34 0,00 1,40 2,18 1,43 1,40 0,00 1,48 2,19 2,18 1,48 0,00 0,00 0,50 1,00 1,50 2,00 2,50 6LZG 6M0J 6VW1-A 6VW1-B R M S d ( Å ) PDB entries 6LZG 6M0J 6VW1-A 6VW1-BB A .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 Figure 2. Summary of the pipeline adopted by GBPM to identify key residues contributing to the SARS-CoV-2 Spike / Human ACE2 interface. Spike is depicted in cyan, and ACE2 in orange, based on the 6LZG PDB model (10). Residues highlighted by GBPM are then tested for mutation frequency in the worldwide SARS-CoV-2 population. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 Figure 3. 3D ribbon representation of the interaction domains of SARS-CoV-2 Spike (left, orange) and human ACE2 (right, green), based on the crystal structure 6LZG deposited on Protein Data Bank and produced by (10). The positions of the three most frequent Spike mutations in the interacting region (AA 350-550) with a non-zero GBPM score are indicated: N439K, N501Y and S477N. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 Figure 4. (A) Occurrence of AA-changing variants on SARS-CoV-2 Spike protein. X-axis indicates the position of the affected AA. Y-axis indicates the log10 of the number of occurrences of the variant in the SARS-CoV-2 dataset. Labels indicate variants affecting ACE2/Spike binding and detected in at least 5 SARS-CoV-2 sequences. Vertical dashed lines indicate crystalized region analyzed (aa 330 – 530). The D614G variant, located outside the RBD, is also indicated. (B) Scatter plot indicating the occurrence of the variant in the population (x-axis) and the GBPM score of the reference AA in the model (y-axis). Mutations with non-zero GBPM score are indicated. CC indicates the Pearson correlation coefficient and p indicates the p-value of the CC. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 5. Frequency of mutations on ACE2. X-axis indicates the AA position in isoform 1 (UniProt Q9BYF1-1). Y-axis indicates the allele frequency in the global population according to the GNOMA v3 database. Labels indicate AA changes observed in the human population with non-zero GBP average score in the ACE2/Spike interaction models. Vertical dashed lines indicate the crystaliz region analyzed in this study (aa 15 – 615). 17 id AD PM ed .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 Table 1. GBPM scores, average values, and quartile distribution of Spike relevant AAs in three PDB models. GBPM scores and average values are reported in kcal/mol. Residue # PDB entries GBPM 6LZG 6M0J 6VW1-A 6VW1-B Average score Quartile LYS 417 -43.58 -12.12 0.00 0.00 -13.93 Q2 ASN 439 0.00 0.00 -12.30 -34.94 -11.81 Q2 GLY 446 -22.52 -5.75 0.00 -10.32 -9.65 Q3 GLY 447 -5.63 0.00 0.00 0.00 -1.41 Q3 TYR 449 -25.72 -6.38 -20.37 -24.76 -19.31 Q1 TYR 453 0.00 0.00 -1.77 -1.76 -0.88 Q4 LEU 455 -11.59 -16.82 -21.78 -7.04 -14.31 Q2 PHE 456 -34.20 -30.16 -39.72 -20.76 -31.21 Q1 ALA 475 -52.35 -49.72 -38.73 -77.00 -54.45 Q1 GLY 476 -21.72 0.00 -17.16 -34.59 -18.37 Q2 SER 477 -22.32 0.00 -11.44 -40.68 -18.61 Q2 GLU 484 -8.52 -13.23 0.00 0.00 -5.44 Q3 PHE 486 -28.99 -53.63 -32.56 -53.43 -42.15 Q1 ASN 487 -31.67 -59.57 -33.98 -52.21 -44.36 Q1 TYR 489 -62.10 -27.67 -45.92 -69.38 -51.27 Q1 PHE 490 -4.58 -4.48 -22.90 -40.32 -18.07 Q2 GLN 493 -37.20 -56.08 -79.60 -70.51 -60.85 Q1 GLY 496 -15.54 -8.74 -18.72 -16.80 -14.95 Q2 PHE 497 -8.86 0.00 -4.68 -29.10 -10.66 Q3 GLN 498 -77.24 -80.38 -42.34 0.00 -49.99 Q1 PRO 499 0.00 0.00 0.00 -11.64 -2.91 Q3 THR 500 0.00 -66.00 -92.90 -122.50 -70.35 Q1 ASN 501 -60.14 -61.04 -61.82 -70.59 -63.40 Q1 GLY 502 -24.84 -35.42 -39.45 -40.92 -35.16 Q1 VAL 503 0.00 -5.37 -5.45 -5.54 -4.09 Q3 TYR 505 -30.60 -23.22 -20.90 -40.62 -28.84 Q1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 Table 2. GBPM scores, average values, and quartile distribution of ACE2 relevant AAs in three PDB models. GBPM scores and average values are reported in kcal/mol. Residue # PDB entries GBPM 6LZG 6M0J 6VW1-A 6VW1-B Average Score Quartile SER 19 -31.45 -26.08 -53.61 -79.33 -47.62 Q1 GLN 24 -31.15 -23.62 -34.15 -85.23 -43.54 Q1 THR 27 -16.93 -32.58 -38.70 -16.65 -26.22 Q2 PHE 28 -20.68 -25.02 -14.10 -27.48 -21.82 Q2 ASP 30 0.00 -17.01 0.00 0.00 -4.25 Q3 LYS 31 -84.06 -43.67 -32.98 -46.60 -51.83 Q1 HIS 34 0.00 -30.42 -27.78 -67.56 -31.44 Q2 GLU 35 -11.73 0.00 0.00 -19.40 -7.78 Q2 GLU 37 -11.58 -20.36 -11.83 -20.52 -16.07 Q2 ASP 38 -41.09 -40.52 -25.75 -34.16 -35.38 Q2 TYR 41 -52.50 -75.07 -62.35 -76.07 -66.50 Q1 GLN 42 -36.78 -37.15 -28.53 -63.49 -41.49 Q2 LEU 45 -12.80 -16.43 0.00 -16.20 -11.36 Q2 LEU 79 0.00 0.00 0.00 -5.99 -1.50 Q3 MET 82 0.00 0.00 -6.36 -6.00 -3.09 Q3 TYR 83 -40.50 -66.29 -57.86 -60.81 -56.37 Q1 GLU 329 0.00 0.00 0.00 -17.25 -4.31 Q3 ASN 330 -11.84 -5.92 -11.82 -6.04 -8.91 Q2 GLY 352 -1.97 -8.36 -8.86 -14.66 -8.46 Q2 LYS 353 -79.38 -70.11 -120.73 -46.03 -79.06 Q1 GLY 354 -21.87 -31.15 -12.74 -15.25 -20.25 Q2 ASP 355 -68.95 -81.24 -57.99 -89.12 -74.33 Q1 ARG 357 0.00 -4.99 0.00 0.00 -1.25 Q3 ALA 386 0.00 0.00 -4.85 0.00 -1.21 Q4 ARG 393 0.00 0.00 -4.85 0.00 -1.21 Q4 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 Table 3. Composition of the GBPM models designed. HBD = Hydrogen Bond Donor; HBA = Hysdrogen Bond Acceptor; # = number of features; AIE = Average Interaction Energy (in kcal/mol). GBPM Feature 6LZG 6M0J 6VW1-A 6VW1-B Host/Guest # AIE # AIE # AIE # AIE Hydrophobic 4 -2.07 4 -1.82 5 -2.05 3 -2.12 Spike/ACE2 HBD 18 -6.48 15 -6.47 17 -6.22 19 -6.31 HBA 4 -6.61 13 -5.25 12 -5.47 14 -5.48 Hydrophobic 1 -1.49 3 -1.16 2 -1.49 1 -1.76 ACE2/Spike HBD 18 -6.26 18 -6.32 24 -5.63 28 -5.94 HBA 7 -4.84 10 -4.53 9 -4.98 12 -4.60 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 Table 4. Spike mutations located within the RBD (AA 330-530) with at least two cases in the population and non-zero GBPM average score in the ACE2/Spike interaction models. The asterisk (*) indicates a stop codon. A lower GBPM score indicates a stronger effect in the ACE2/Spike interaction. Mutation Position Abundance Frequency GBPM Average Score Quartile S477N 477 16547 0.055995 -18.61 Q2 N439K 439 5587 0.018906 -11.81 Q2 N501Y 501 4921 0.016653 -63.3975 Q1 Y453F 453 917 0.003103 -0.8825 Q4 E484K 484 352 0.001191 -5.4375 Q3 K417N 417 260 0.00088 -13.925 Q2 S477I 477 157 0.000531 -18.61 Q2 G446V 446 58 0.000196 -9.6475 Q3 F490S 490 53 0.000179 -18.07 Q2 S477R 477 49 0.000166 -18.61 Q2 N501T 501 47 0.000159 -63.3975 Q1 L455F 455 44 0.000149 -14.3075 Q2 G476S 476 43 0.000146 -18.3675 Q2 E484Q 484 43 0.000146 -5.4375 Q3 A475V 475 35 0.000118 -54.45 Q1 F486L 486 34 0.000115 -42.1525 Q1 F490L 490 18 6.09E-05 -18.07 Q2 YQ505WK 505 14 4.74E-05 -28.835 Q1 Q493L 493 12 4.06E-05 -60.8475 Q1 V503F 503 9 3.05E-05 -4.09 Q3 E484A 484 8 2.71E-05 -5.4375 Q3 G446S 446 7 2.37E-05 -9.6475 Q3 E484D 484 4 1.35E-05 -5.4375 Q3 Q493* 493 4 1.35E-05 -60.8475 Q1 Y505W 505 4 1.35E-05 -28.835 Q1 G476A 476 3 1.02E-05 -18.3675 Q2 S477G 477 3 1.02E-05 -18.61 Q2 F456L 456 2 6.77E-06 -31.21 Q1 V503I 503 2 6.77E-06 -4.09 Q3 Y449F 449 2 6.77E-06 -19.3075 Q1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 Table 5. ACE2 variants with non-zero GBPM score in the Spike interaction model. variant rsID Allele Frequency GBPM Average Score Quartile S19P rs73635825 0.000655 -47.62 Q1 E329G rs143936283 6.63E-05 -4.31 Q3 M82I rs766996587 6.62E-05 -3.09 Q3 E37K rs146676783 5.68E-05 -16.07 Q2 G352V rs370610075 3.8E-05 -8.46 Q2 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 Supplementary Files Description Supplementary File 1: table of SARS-CoV-2 Spike mutations (source: GISAID database, 29 December 2020), indicating position, frequency in the sequenced SARS-CoV-2 genome and GBPM score (lower: predicted stronger effect in the Spike/ACE2 interaction). Supplementary File 2: table of human ACE2 variants (source: gnomAD database, v3, 18 July 2020), indicating position, frequency in the sequenced SARS-CoV-2 genome and GBPM score (lower: predicted stronger effect in the Spike/ACE2 interaction). Supplementary File 3: supplementary figures focusing on the most relevant mutations described in this study, with structural, chemical and positional considerations. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.09.289074doi: bioRxiv preprint https://doi.org/10.1101/2020.09.09.289074 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2020_11_13_381475 ---- Triplex and other DNA motifs show motif-specific associations with mitochondrial DNA deletions and species lifespan 1 Triplex and other DNA motifs show motif-specific associations with 1 mitochondrial DNA deletions and species lifespan. 2 Authors 3 Kamil Pabis1 4 1. Georg August University of Göttingen, Göttingen, Germany. 5 Mail: Kamil.pabis@gmail.com 6 7 8 ABSTRACT 9 The “theory of resistant biomolecules” posits that long-lived species show resistance to molecular 10 damage at the level of their biomolecules. Here, we test this hypothesis in the context of mitochondrial 11 DNA (mtDNA) as it implies that predicted mutagenic DNA motifs should be inversely correlated with 12 species maximum lifespan (MLS). 13 First, we confirmed that guanine-quadruplex and direct repeat (DR) motifs are mutagenic, as they 14 associate with mtDNA deletions in the human major arc of mtDNA, while also adding mirror repeat (MR) 15 and intramolecular triplex motifs to a growing list of potentially mutagenic features. What is more, 16 triplex motifs showed disease-specific associations with deletions and an apparent interaction with 17 guanine-quadruplex motifs. 18 Surprisingly, even though DR, MR and guanine-quadruplex motifs were associated with mtDNA 19 deletions, their correlation with MLS was explained by the biased base composition of mtDNA. Only 20 triplex motifs negatively correlated with MLS even after adjusting for body mass, phylogeny, mtDNA 21 base composition and effective number of codons. 22 Taken together, our work highlights the importance of base composition for the comparative 23 biogerontology of mtDNA and suggests that future research on mitochondrial triplex motifs is 24 warranted. 25 ABBREVIATIONS 26 BPs, mtDNA deletion break points 27 DR, direct repeats 28 ER, everted repeats 29 GQ, guanine-quadruplexes 30 IR, inverted repeats 31 MLS, species maximum lifespan 32 MR, mirror repeats 33 nBMST, non-B DNA motif search tool 34 Nc, number of effective codons 35 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint mailto:Kamil.pabis@gmail.com https://doi.org/10.1101/2020.11.13.381475 2 PGLS, phylogenetic generalized least squares 36 SD, standard deviation 37 Trip, Triplex forming motif 38 XR, any repeat half-site or motif 39 mtDNA, mitochondrial DNA 40 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 3 INTRODUCTION 41 Macromolecular damage to lipids, proteins and DNA accumulates with aging (Richardson and Schadt 42 2014, Gladyshev 2013), whereas cells isolated from long-lived species are resistant to genotoxic and 43 cytotoxic drugs, giving rise to the multistress resistance theory of aging (Miller 2009, Hamilton and 44 Miller 2016). By extension of this idea, the “theory of resistant biomolecules” posits that lipids, proteins 45 and DNA itself should be resilient in long-lived species (Pamplona and Barja 2007). In support of this 46 theory, it was shown that long-lived species possess membranes that contain fewer lipids with reactive 47 double bonds (Valencak and Ruf 2007) and perhaps a lower content of oxidation-prone cysteine and 48 methionine in mitochondrially encoded proteins (see Aledo et al. 2012 for a discussion). 49 Mitochondrial DNA (mtDNA) mutations constitute one type of macromolecular damage that 50 accumulates over time. Point mutations accumulate in proliferative tissues like the colon and in some 51 progeroid mice (Kauppila et al. 2017), while the accumulation of mtDNA deletions in postmitotic tissues 52 may underpin certain age-related diseases like Parkinson’s and sarcopenia (Lawless et al. 2020, Bender 53 et al. 2006). 54 If the theory of resistant biomolecules can be generalized, the mtDNA of long-lived species should resist 55 both point mutation and deletion formation. However, we will focus on deletions because they are 56 more pathogenic than point mutations at the same level of heteroplasmy (Gamamge et al. 2014) and 57 human tissues do not accumulate high levels of point mutations observed in progeroid mouse models 58 (Khrapko et al. 2006). 59 Since deletion formation depends on the primary sequence of the mtDNA (sequence motifs) it is 60 amenable to bioinformatic methods. Ever since a link between direct repeat (DR) motifs and deletion 61 formation became known, variations of the theory of resistant biomolecules have been tested, although 62 not necessarily under this name. It was reasoned that long-lived species evolved to resist deletion 63 formation and mtDNA instability by reducing the number of mutagenic motifs in their mtDNA 64 (Khaidakov et al. 2006, Yang et al. 2013). 65 We aim to extend these findings by re-evaluating and establishing new candidate motifs, which we then 66 correlate with species maximum lifespan (MLS). Studying multiple motif classes at once also allows us to 67 reveal relationships between potentially overlapping mtDNA motifs that may affect the data. We define 68 candidate motifs as those that are associated with deletion formation inside the major arc of human 69 mtDNA, because during asynchronous replication the major arc is single stranded for extended periods 70 of time (Persson et al. 2019) which should favor the formation of secondary structures. Finally, we test if 71 these motifs correlate with the MLS of mammals, birds and ray-finned fishes after correcting for 72 potential biases, especially global mtDNA base composition which is an important confounder (Aledo et 73 al. 2012) yet is neglected in some studies (Yang et al. 2013). 74 The choice of motifs to study is based on biological plausibility and published literature that will be 75 briefly reviewed below. Mutagenic motifs include repeats as well as guanine-quadruplex (GQ)- and 76 triplex-forming motifs. DR motifs can lead to DNA instability through strand-slippage if two DR motifs 77 mispair during replication (Persson et al. 2019). Whereas inverted repeat (IR), G-quadruplex and triplex 78 motifs destabilize progression of the replication fork through the formation of stable secondary 79 structures. Some of the structures formed include hairpins for IR motifs (Tremblay-Belzile et al. 2015), 80 triple stranded DNA for triplex motifs and bulky stacks of guanines for G-quadruplex motifs (Bacolla et 81 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 4 al. 2016; Fig. 1). Mirror repeat (MR) and everted repeat (ER) motifs, in contrast, do not allow stable 82 Watson-Crick base pairing and are thus less likely to be mutagenic, although a subset of MR motifs may 83 form triplex structures (Kamat et al. 2016). 84 Thus, many motifs can be mutagenic in principle, but what is the evidence that these motifs are related 85 to mtDNA instability, particularly deletions, and MLS? 86 Paradoxically, while DRs are the motif most consistently associated with mtDNA deletion breakpoints 87 (BPs), despite preliminary reports (Khaidakov et al. 2006, Lakshmanan et al. 2012, Yang et al. 2013), no 88 correlation with species MLS was seen in recent studies (Lakshmanan et al. 2015). In contrast, with the 89 exception of one preprint (Mikhailova et al. 2020), IRs are not known to be associated with mtDNA 90 deletions (Dong et al. 2014), although they do show a negative relationship with species MLS (Yang et 91 al. 2013) and may contribute to inversions (Tremblay‐Belzile et al. 2015). Whether age-related mtDNA 92 inversions underlie any pathology, however, requires further study. Finally, G-quadruplex motifs are 93 associated with both deletions (Dong et al. 2014) and point mutations (Butler et al. 2020), but no study 94 tested if they correlate with MLS. Triplex motifs are poorly studied with one report finding no 95 association between these motifs and deletions (Oliveira et al. 2013). 96 Based on these studies we decided to test the theory of resistant biomolecules by quantifying DR, MR, 97 IR, ER, G-quadruplex- and triplex-forming motifs. We stipulate that if a motif class played a causal role in 98 aging, it should be involved in deletion formation and its abundance should be negatively correlated 99 with species MLS. 100 101 Figure 1 102 A. Direct repeat, both half-sites have the same orientation. 103 B. Inverted repeat, the half-sites are complementary and has mirror symmetry. 104 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 5 C. Everted repeat, the half-sites are complementary. 105 D. Mirror repeat, the half-sites have mirror symmetry. 106 107 E. Triplex motifs can form a triple helical DNA structure also called H-DNA. 108 F. In a G-quadruplex multiple G-quartets (depicted as blue rectangles) stack on top of each other. 109 Adapted from Gurusaran et al. (2013) and Khristich and Mirkin (2020) with permission. Half-sites 110 shown in red. 111 METHODS 112 Detection of DNA motifs 113 Repeats were detected by a script written in R (vR-3.6.3). Briefly, to find all repeats with N basepairs 114 (bps), the mtDNA light strand is truncated by 0 to N bps and each of the N truncated mtDNAs is then 115 split every N bps. This generates every possible substring (and thus repeat) of length N. In the next step, 116 duplicate strings are removed. Afterwards we can find DR (a substring with at least two matches in the 117 mtDNA), MR (at least one match in the mtDNA and on its reverse), IR (at least one match in the mtDNA 118 and on its reverse-complement) and ER motifs (at least one match in the mtDNA and on its 119 complement). Overlapping and duplicate repeats were not counted for the correlation between repeats 120 and MLS. The code for the analyses performed in this paper can be found on github 121 (pabisk/aging_triplex2). 122 Unless stated otherwise, all analyses were performed in R. G-quadruplex motifs were detected by the 123 pqsfinder package (v2.2.0, Hon et al. 2017). Intramolecular triplex-forming motifs were detected by the 124 triplex package (v1.26.0, Hon et al. 2013) and duplicates were removed. We also compared the data 125 with two other publicly available tools, Triplexator (Buske et al. 2013), and with the non-B DNA motif 126 search tool (nBMST; Cer et al. 2011). Triplexator was run on a virtual machine in an Oracle VM 127 VirtualBox (v6.1) in -ss mode on the human mitochondrial genome and its reverse complement, the 128 results were combined and overlapping motifs from the output were removed. We used the web 129 interface of nBMST to detect mirror repeats/triplexes (v1.0). 130 Association between motifs and major arc deletions 131 The major arc was defined as the region between position 5747 and 16500 of the human mtDNA 132 (NC_012920.1). The following deletions and their breakpoints were located in this region and included: 133 1066 deletions from the MitoBreak database (Damas et al. 2014, mtDNA Breakpoints.xlsx), 1114 from 134 Persson et al. (2019) and 1894 from Hjelm et al. (2019). 135 Each deletion is defined by two breakpoints. A breakpoint pair was considered to associate with a motif 136 if the motif fell within a defined window around one or both breakpoints, depending on the analysis. 137 The window size was chosen in relation to the length of the studied motifs (30 bp for repeats and 50 bp 138 for other motifs). 139 Three different motif orientations relative to the breakpoints were considered. Two orientations for 140 motifs with half-sites (i.e. repeats), either both half-sites at any one breakpoint of a deletion, or one 141 half-site per breakpoint of a deletion. Motifs with overlapping half-sites were not counted. In the third 142 case, distinct G-quadruplex and triplex motifs could associate with one or both breakpoints of a deletion, 143 but were at most counted once, since the latter case is sufficiently rare. 144 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 6 In order to exclude overlapping “hybrid” motifs, MR and DR motifs with the same sequence were 145 removed whereas triplex and G-quadruplex motifs were removed if they were in proximity. 146 To generate controls, the mtDNA deletions as a whole were randomly redistributed inside the major arc 147 which, because of the fixed deletion size, allowed us to approximate the original distribution of 148 breakpoints (as suggested by Oliveira et al. 2013). Significance was determined via one-sample t-test in 149 Prism (v7.04) by comparing actual breakpoints to 20 such randomized controls. Alternative controls 150 were generated by shifting each breakpoint by 200 bp towards the midpoint of the major arc or as in 151 Fig. S9. 152 Cancer associated breakpoints 153 We obtained all autosomal breakpoints available from the Catalogue Of Somatic Mutations In Cancer 154 (COSMIC; release v92, 27th August 2020), which includes deletions, inversions, duplications and other 155 abnormalities (n=587515 in total). After removing breakpoints whose sequences could not be retrieved 156 (<1.7%), we quantified the number of predicted G-quadruplex and triplex motifs in a 500 bp window 157 centered on the breakpoints using default settings for the detection of these motifs. Sequences of 158 breakpoint regions were obtained from the GRCh38 build of the human genome using the BSgenome 159 package (v1.3.1). Each breakpoint shifted by +3000 bps served as its own control. 160 Lifespan, base composition and life history traits 161 We included three phylogenetic classes in our analysis for which we had sufficient data (n>100), 162 mammals, birds and ray-finned fishes (actinopterygii). MLS and body mass were determined from the 163 AnAge database (Tacutu et al. 2018) and, for mammals, supplemented with data from Pacifici et al. 164 (2013). The mtDNA accessions were obtained from an updated version of MitoAge (unpublished; Toren 165 et al. 2016). Species were excluded if body mass data was unavailable, if the sequence could not be 166 obtained using the genbankr package (v1.14.0), or if the extracted cytochrome B DNA sequence did not 167 allow for an alignment, precluding phylogenetic correction. The species data can be found in the 168 supplementary (Species Data.xlsx). 169 We analyzed the full mtDNA sequence, heuristically defined as the mtDNA sequence between the first 170 and last encoded tRNA, excluding the D-loop, which is rarely involved in repeat-mediated deletion 171 formation (Yang et al. 2013). The effective number of codons was calculated using Wright’s Nc (Smith et 172 al. 2019). Base composition was calculated for the light-strand. GC skew was calculated as the fraction 173 (G − C)/(G + C) and AT skew as (A − T)/(A + T). All correlations are Pearson’s R. Partial correlations were 174 performed using the ppcor package (v1.1). 175 Phylogenetic generalised least squares and phylogenetic correction 176 Observed correlations between traits and lifespan can be spurious due to shared species ancestry 177 (Speakman 2005). To correct for this, we use phylogenetic generalised least squares (PGLS) 178 implemented in the caper package (v1.0.1). Species phylogenetic trees were constructed via neighbor 179 joining based on aligned cytochrome B DNA sequences using Clustal Omega from the msa package 180 (v1.18.0) and in the resulting mammalian and bird tree, four branch edge lengths were equal to zero, 181 which were set to the lowest non-zero value in the dataset. 182 183 RESULTS 184 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 7 Direct repeats and mirror repeats are over-represented at mtDNA deletion breakpoints 185 In order to define candidate mtDNA motifs that could be linked with lifespan, we started by reanalyzing 186 motifs that associate with mtDNA deletion breakpoints reported in the MitoBreak database (Damas et 187 al. 2014; Fig. S1; mtDNA Breakpoints.xlsx). In the below analysis, we consider DR and IR motifs thought 188 to be mutagenic, as well as MR and ER motifs, so far not known to be mutagenic and we pool all 6 to 15 189 bp long repeats, since the data is similar between different repeat lengths (Fig. S2). 190 As shown by others, we found that DR motifs often flank mtDNA deletions (Fig. 2A). In contrast, no 191 strong association was seen for ER and IR motifs, even considering a larger window around the 192 breakpoint to allow for the fact that IRs could bridge and destabilize mtDNA over long distances 193 (Persson et al. 2019; Fig. S3). 194 Surprisingly, we also found MR motifs flanking deletion breakpoints more often than expected by 195 chance (Fig. 2A). However, DR and MR motifs are known to correlate with each other (Shamanskiy et al. 196 2019; Fig. 5B) and indeed we noticed a large sequence overlap between MR and DR motifs (Fig. 2B), 197 which could explain an apparent over-representation of MRs at breakpoints. Removal of overlapping 198 MR-DR hybrid motifs confirmed this suspicion. After this correction, the degree of enrichment was 199 strongly attenuated (Fig. 2C) and the total number of breakpoints flanked by MR motifs was reduced 200 by >80%. Nevertheless, long MR motifs remained particularly over-represented around deletions (Fig. 201 S4). 202 Since the prior analysis only considered motifs that flank both breakpoints, we next tested the idea that 203 IR and other motifs could be mutagenic if both half-sites are found at any of the breakpoints. However, 204 in this analysis no motif class showed enrichment around breakpoints (Fig. 2D). 205 206 207 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 8 208 Figure 2 209 Direct repeat (DR) and mirror repeat (MR) motifs are significantly enriched around actual deletion 210 breakpoints (BPs) compared to reshuffled BPs, but the same is not true for inverted repeat (IR) and 211 everted repeat (ER) motifs (A, D). The surprising correlation between MR motifs and deletion BPs is 212 attenuated when MRs that have the same sequence as DR motifs are removed (B, C). Controls were 213 generated by reshuffling the deletion BPs while maintaining their distribution (n=20, mean ±SD shown). 214 The schematic drawings above (A, D) depict the orientation of the repeat (XR) half-sites in relation to the 215 BPs. *** p < 0.001; ** p < 0.01 by one sample t-test. 216 217 A) The number of deletions associated with DR, MR, IR or ER motifs at both BPs compared with 218 reshuffled controls. 219 B) Venn diagram showing the number of MR, DR and hybrid MR-DR motifs that were identified within 220 the major arc. 221 C) The number of deletions associated with MR motifs, before (MR) and after removal of hybrid MR-DR 222 motifs (MRDR-), compared with reshuffled controls. 223 D) The number of deletions associated with DR, MR, IR or ER motifs at either BP compared with 224 reshuffled controls. 225 226 Predicted triplex-forming motifs are over-represented at mtDNA breakpoints 227 Given the association between MR motifs and breakpoints we decided to analyze triplex motifs, a 228 special case of homopurine and homopyrimidine mirror repeats (Khristich and Mirkin 2020, Bissler 229 2007), and their association with deletion breakpoints in the MitoBreak database. 230 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 9 Here, we use the triplex package to predict intramolecular triplex motifs because it has several 231 advantages compared to other software (Hon et al. 2013). For example, using the nBMST tool, as in a 232 previous study of mtDNA instability (Oliveira et al. 2013), we only identified two potential triplex motifs 233 within the major arc that did not overlap with the six motifs identified by the triplex package (Table S1). 234 In contrast, using Triplexator (Buske al. 2013) we were able to detect four of the six triplex motifs and 235 the motifs detected by Triplexator were also enriched at breakpoints (Table S2). 236 We noticed that predicted triplexes are G-rich and thus could be related to G-quadruplex motifs (Doluca 237 et al. 2013). In a comparison of the two motif types, however, we found several differences (Table S1, 238 S3). Triplex motifs were shorter and less abundant than predicted G-quadruplexes, associated with 239 fewer breakpoints altogether (Fig. 3) and, in contrast to G-quadruplexes almost exclusive to the G-rich 240 mtDNA heavy-strand, triplex motifs were also common on the light-strand. 241 The six triplex motifs detected by the triplex package were significantly enriched around deletion 242 breakpoints and when we excluded triplex-G-quadruplex hybrid motifs the result was attenuated but 243 remained significant (Fig. 3A). Given the higher risk of spurious findings with only six motifs, we 244 repeated the analysis using a relaxed definition of triplex and the results were fundamentally unchanged 245 (Fig. 3B). Furthermore, our results were not sensitive to reasonable changes in the size of the search 246 window around breakpoints (Fig. S5A, B), motif quality scores (Fig. S5C, D) or inclusion of overlapping 247 motifs (Fig. S5E-G). 248 Analogous to the situation with MR motifs we tested if overlapping triplex-DR hybrid motifs could bias 249 our results. Given the rarity of triplex motifs and the many DRs in the mitochondrial genome we choose 250 an alternative approach rather than excluding triplex motifs that overlapped any DR half-site. We 251 compared the fraction of triplex and G-quadruplex positive deletions associated with DRs (GQ+, DR+ and 252 Trip+, DR+) and not associated with DRs (GQ+, DR- and Trip+, DR-). We considered a deletion to be DR+ if 253 both breakpoints were flanked by the same DR sequence. In this case, only 44% of Trip+ deletions 254 associated with DRs whereas 66% of GQ+ deletions did (Table S4). 255 256 Figure 3 257 Triplex motifs are significantly enriched around actual breakpoints (BPs) compared to reshuffled BPs (A, 258 B) even after removal of G-quadruplex (GQ)-triplex hybrid motifs (TripGQ-). The number of unique triplex 259 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 10 motifs, GQ motifs and of hybrid triplex-GQ motifs, within the mtDNA major arc, is shown in the Venn 260 diagrams above (A, B). Enrichment of GQ motifs around BPs is shown for comparison in (C). Controls 261 were generated by reshuffling the deletion BPs while maintaining their distribution (n=20, mean ±SD 262 shown). The schematic drawing above (C) depicts the orientation of the GQ and triplex motifs (XR) in 263 relation to the BPs. *** p < 0.0001 by one sample t-test. 264 265 A) The number of deletion BPs associated with triplex motifs compared with reshuffled controls. 266 Analysis including (left side) or excluding triplex-GQ hybrid motifs (right side). 267 B) Same as (A) but with relaxed criteria for the detection of triplex motifs (min score=12) and GQ motifs 268 (min score=26). 269 C) The number of deletion BPs associated with GQ motifs compared with reshuffled controls. Relaxed 270 settings (left side, min score=26) and default settings (right side, min score=47). 271 272 Triplex forming motifs may be associated with mitochondrial disease breakpoints 273 Next, we sought to validate our findings on two recently published next generation sequencing datasets 274 (Hjelm et al. 2019, Persson et al. 2019; mtDNA Breakpoints.xlsx; Table S5). We were able to confirm 275 the enrichment of DR (Fig. S6A, S7A), MR (Fig. S6A, S7A) and G-quadruplex motifs (Fig. 4A, B; S6C, D) 276 around deletion breakpoints. Additionally, we confirmed that hybrid MR-DR motifs are responsible in 277 large part for the enrichment of MR motifs around breakpoints (Fig. S6B, S7B). 278 In contrast, we found that triplex motifs were not consistently enriched around breakpoints in the 279 dataset of Hjelm et al. (Fig. S6C, D), which is based on post-mortem brain samples from patients without 280 overt mitochondrial disease, whereas we saw enrichment in the dataset by Persson et al. (Fig. 4A, B), 281 which is based on muscle biopsies from patients with mitochondrial disease. This unexpected 282 discrepancy prompted us to take a second look at the MitoBreak data. In this dataset triplex motifs were 283 significantly more enriched at breakpoints in the mtDNA single deletion subgroup compared to the 284 healthy tissues subgroup (Fig. S8). In addition, we found more broadly that mitochondrial disease status 285 might explain the heterogenous results across datasets we have seen (Fig. 4C). 286 Further strengthening our findings, triplex motifs were enriched in the MitoBreak and Persson et al. 287 dataset regardless of the breakpoint shuffling method chosen and of our statistical assumptions (Fig. 288 S9). What is more, triplex motifs were also enriched at breakpoints when we pooled all three datasets 289 (Fig. 4D), although to a lesser extent. 290 Finally, G-quadruplex motifs close to triplex motifs were more strongly enriched at deletion breakpoints 291 than solitary G-quadruplex motifs (Fig. 4E; Fig. S10), suggesting that triplex formation may further 292 contribute to DNA instability. 293 294 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 11 295 Figure 4 296 In the Persson et al. (2019) dataset, triplex and G-quadruplex (GQ) motifs are enriched around deletion 297 breakpoints (BPs), using either default (A) or relaxed scoring criteria (B). Although triplex motifs 298 predominate in mitochondrial disease datasets (C), we also find that triplex motifs are significantly 299 enriched around BPs (D) after pooling the data from MitoBreak, Persson et al. (2019) and Hjelm et al 300 (2019). Finally, GQ and triplex motifs show stronger enrichment around BPs than either of them in 301 isolation (E). Controls were generated by reshuffling the deletion BPs while maintaining their 302 distribution (n=20, mean ±SD shown). The schematic drawing above (D) depicts the orientation of the 303 motifs (XR) in relation to the BPs. *** p<0.0001, **p<0.001 by one sample t-test. 304 305 A) The number of deletion BPs associated with GQ and triplex motifs compared with reshuffled controls 306 (min score = default). 307 B) The number of deletion BPs associated with GQ and triplex motifs compared with reshuffled controls 308 (min score = relaxed). 309 C) The number of deletion BPs associated with triplex motifs (relaxed settings, min score=12) stratified 310 by mitochondrial disease status. MitoBreak data includes single and multiple mitochondrial deletion 311 syndromes. 312 D) The number of deletion BPs associated with triplex motifs, or with triplex motifs excluding triplex-GQ 313 hybrid motifs (TripGQ-), compared with reshuffled controls. Default settings (left side, min score=15) and 314 relaxed settings (right side, min score=12). 315 E) The fold-enrichment of GQ and triplex motifs around deletion BPs is shown. Motifs were considered 316 overlapping if their midpoints were within 50 bp. 317 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 12 318 Repeats and lifespan: no support for the theory of resistant biomolecules 319 For our analysis, we focus on 11 bp long repeat motifs as short repeats are less likely to allow stable 320 base pairing and longer repeats are rare (Fig. S11) and because results considering repeat motifs of 321 different lengths usually agree with each other (Table S6; Yang et al. 2013). To allow comparability with 322 other studies (Lakshmanan et al. 2015) we analyzed non D-loop motifs, but results for major arc motifs 323 are numerically similar (Table S7). 324 First, consistent with Yang et al. (2013) we found that IR motifs show a negative correlation with the 325 MLS of mammals in the unadjusted model. In addition, we identified ER motifs, a class of symmetrically 326 related repeats, that show an even stronger inverse relationship with longevity (Fig. 5A; Table 1). 327 However, these inverse correlations vanished after taking into account body mass, base composition 328 and phylogeny in a PGLS model (Table 1). Second, in agreement with Lakshmanan et al. (2015) we 329 found that DR motifs do not correlate with the MLS of mammals. The same was true for the 330 symmetrically related MR motifs. Just as with IR motifs, modest inverse correlations vanished in the fully 331 adjusted model (Table 1). We also found the same null results in two other vertebrate classes, birds and 332 ray-finned fishes (Table S6). To gain hints as to causality, we finally tested if longer repeats, allowing 333 more stable base pairing, show stronger correlations with MLS, but to our surprise we noticed the 334 opposite (Fig. S12A-D). 335 Considering all four types of repeats together, we noticed that repeats with both half-sites on the same 336 strand (DR and MR) or half-sites opposite strands (IR and ER) were correlated with each other (Fig. 5B) 337 and with the same mtDNA compositional biases (Fig. 5C). Thus, for DR and MR motifs, an apparent 338 relationship with MLS may be explained by their inverse relationship with GC content and for IR and ER 339 motifs by an inverse relationship with GC content and a positive relationship with GC skew. 340 341 Figure 5 342 The number of everted repeat (ER) motifs is negatively correlated with species MLS in an unadjusted 343 analysis (A). Repeats with a similar orientation correlate with each other (B). Direct repeat (DR) and 344 mirror repeat (MR) motifs have a similar orientation since both half-sites are found on the same strand 345 and in the case of ER and inverted repeat (IR) motifs the half-sites are on opposite strands. Finally, we 346 show the major mtDNA compositional biases that co-vary with the four repeat classes (C) and may 347 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.11.13.381475doi: bioRxiv preprint https://doi.org/10.1101/2020.11.13.381475 13 explain an apparent correlation with MLS. Data is for 11 bp long repeats and Pearson’s R is shown in (A-348 C). 349 350 Table 1. Correlation between potentially mutagenic motifs and species lifespan 351 Motif Type Raw Adjusted DR11 11bp -0.113 0.055 MR11 11bp -0.155 -0.002 IR11 11bp -0.336 0.105 ER11 11bp -0.356 -0.047 triplex default -0.296 -0.211** triplex relaxed -0.190 -0.127^ GQ default 0.264 0.068 GQ relaxed 0.283 -0.097** The adjusted model takes into account body mass, GC content, GC skew, AT skew and number of 352 effective codons. Significant correlations in the raw or adjusted model are bolded/underlined (p<0.05). 353 The PGLS model additionally considers phylogeny. ^denotes p-values of 0.05 0.998) when we evaluated the R-square statistic (for more 215 details, see method S6 and table S2). The similarity of the four bias rate functions indicated that the 216 selection of the single gene mRNA transcript datasets had little impact on modeling non-uniform read 217 distribution along mRNA transcripts, implying the universal common non-uniform read distribution of 218 different mRNA transcripts of E. coli. Specifically, we used the average of these four coefficients as the 219 final coefficients of the exponential function, which was �(�) = ���� with � = 0.256 and � =220 0.00128. 221 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 Please place Fig. 2 here. 222 ATUs predicted by SeqATU reach precision and recall over 0.64 223 The performance evaluation was conducted by comparing the predicted ATUs with the ATUs in 224 SMRT_M9Enrich and SMRT_RiEnrich, which were generated based on the third-generation sequencing 225 and are not sensitive to transcripts with low expression levels. For a more accurate and fair evaluation, 226 maximal ATU clusters after pre-selection were retained in the subsequent evaluations (more details 227 about the pre-selection of maximal ATU clusters can be seen in method S7 and fig. S3). 228 The precision and recall of the predicted ATUs were calculated for each maximal ATU cluster. By 229 considering only perfect matching, the average precision and recall were 0.67 and 0.67 for 230 M9Enirch_Seq and 0.64 and 0.68 for RiEnrich_Seq, respectively. When using relaxed matching, the 231 average precision and recall increased to 0.77 and 0.75 for M9Enrich_Seq and 0.74 and 0.76 for 232 RiEnrich_Seq, respectively. The statistics for precision and recall on maximal ATU clusters with 233 different sizes, as shown in Fig. 3A and fig. S4A. These results showed that the average precision and 234 recall were decreasing with the increasing size of maximal ATU clusters (other than several large size 235 ones due to their small number of counts). The results also indicated that the evaluation results based on 236 relaxed matching were significantly higher than those based on perfect matching across different sizes. 237 This result implied that the incorrectly predicted ATUs by SeqATU based on perfect matching tended to 238 have strong similarities with the ATUs in the evaluation data. In addition, we also found that more than a 239 quarter of the incorrectly predicted ATUs (25%/29% for M9Enrich_Seq/RiEnrich_Seq) by SeqATU 240 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 based on perfect matching matched with the transcription units in RegulonDB (19). 241 The two evaluation datasets (SMRT_M9Enrich and SMRT_RiEnrich) were both from SMRT-242 Cappable-seq, while one of the processing steps of the technique filtered RNA reads smaller than 1,000 243 bp (6), which indicated that the ATUs in these two evaluation datasets were not comprehensive. To 244 address this issue, we enriched the evaluation data by adding the ATUs defined by SEnd-seq (7), as 245 SEnd-seq did not introduce any filtering based on RNA size. When we used the new evaluation data, the 246 ATUs predicted by SeqATU improved by 15% (0.77) and 19% (0.76) in terms of the average precision 247 based on perfect matching for M9Enrich_Seq and RiEnrich_Seq, respectively, and by 9% (0.84) and 248 12% (0.83) based on relaxed matching. The statistics for precision across different sizes of the maximal 249 ATU clusters are shown in Fig. 3B and fig. S4B, showing that the values of precision based on perfect 250 matching were significantly improved across different sizes of maximal ATU clusters by using the 251 evaluated ATUs from SMRT-Cappable-seq and SEnd-seq. This result suggested that the ATUs we 252 predicted, which were not in SMRT_M9Enrich and SMRT_RiEnrich, may be due to the RNA length 253 selection of SMRT-Cappable-seq. We enriched the evaluation data by adding the ATUs in RegulonDB 254 (19) and also found the improvement of precision across different sizes of maximal ATU clusters for 255 M9Enrich_Seq and RiEnrich_Seq (fig. S4C). 256 Furthermore, to facilitate the understanding of the performance of SeqATU and to measure the 257 influence of the maximal ATU clusters from rSeqTU on our ATU prediction method, SMRT maximal 258 ATU clusters collected from SMRT_M9Enrich and SMRT_RiEnrich (for more details, see method S8) 259 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 were applied for the CQP in two conditions (M9 minimal medium and Rich medium). We found that 260 precision and recall increased to 0.73 and 0.77 for M9Enrich_Seq, respectively, and 0.69 and 0.80 for 261 RiEnrich_Seq based on perfect matching (fig. S4D). Additionally, when using relaxed matching, 262 precision and recall significantly increased to 0.82 and 0.84 for M9Enrich_Seq, respectively, and 0.79 263 and 0.86 for RiEnrich_Seq (fig. S4D). The significantly improved results verified the ability of SeqATU 264 to accurately predict ATU when giving more accurate maximal ATU clusters. In addition, we found that 265 the number of predicted ATUs and the evaluated ATUs under the maximal ATU cluster with the same 266 size were similar except for the maximal size (Fig. 3C), and they were far less than the theoretical 267 number, which indicated that SeqATU can effectively exclude most of the incorrect ATUs. 268 Please place Fig. 3 here. 269 The bias rate constraints efficiently improve the ability of SeqATU to predict ATUs 270 We tried to use SeqATU without bias rate constraints to predict the ATUs of E. coli and found that its 271 performance significantly decreased compared with SeqATU (Fig. 4 and fig. S5). Specifically, the F-272 score of SeqATU without bias rate constraints was 0.69/0.68 based on perfect matching for 273 M9Enrich_Seq/RiEnrich_Seq, compared with 0.75/0.74 for SeqATU. When using relaxed matching, the 274 F-score of SeqATU without bias rate constraints was 0.79/0.78 for M9Enrich_Seq/RiEnrich_Seq, 275 compared with 0.83/0.83 for SeqATU. This result suggested that the bias rate constraints of SeqATU 276 could capture useful information about the non-uniform distribution of the RNA-Seq reads along the 277 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 mRNA transcripts (32-35) and then efficiently improve the ability of the model to predict complex 278 ATUs. 279 Please place Fig. 4 here. 280 ATUs predicted by SeqATU display a dynamic composition and overlapping nature 281 A total of 2,973 distinct ATUs were identified in M9 minimal medium, and 2,767 were identified in Rich 282 medium. Among them, there were 1,423/1,550 distinct ATUs on the forward strand and 1,323/1,444 on 283 the reverse strand for M9Enrich_Seq/RiEnrich_Seq. Each of the predicted ATUs was comprised of an 284 average of 2.59 genes, with the largest ATU containing 28 genes across the two conditions. The 285 distribution of the size of the predicted ATUs is shown in Fig. 5A, from which we can see that the 286 majority of ATUs (more than 87%) contained fewer than five genes in M9 minimal medium and Rich 287 medium. Approximately 41% of the genes in E. coli were contained in more than one ATU for 288 M9Enrich_Seq, compared to 43% genes for RiEnrich_Seq, suggesting that the ATUs in a maximal ATU 289 cluster generally overlapped with each other (Fig. 5B). In addition, there were 1,576 ATU maximal 290 clusters for M9Enrich_Seq and 1,512 ATU maximal clusters for RiEnrich_Seq. SeqATU identified a 291 total of 1,977 identical ATUs under the two conditions, whereas there were 1,786 distinct ATUs. Among 292 the distinct ATUs across the two conditions, 394 ATUs were from the same maximal ATU clusters in the 293 two maximal ATU cluster datasets, and the rest were from different maximal ATU clusters. The fact 294 there were distinct ATUs under the two conditions suggests that ATUs are dynamically responsive to 295 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 different conditions or environmental stimuli (for more real examples about the ATUs under different 296 conditions, see fig. S6). 297 The dynamic composition of predicted ATUs by SeqATU is of great significance to understand the 298 interactions inside polymicrobial communities. For example, chronic airway infection by Pseudomonas 299 aeruginosa considerably contributes to lung tissue destruction and impairment of pulmonary function in 300 cystic-fibrosis (CF) patients (39). Marie et al. found that the presence of E. coli complemented the 301 growth defect of a P. aeruginosa bioA-disrupted mutant that is unable to grow on rich medium, and can 302 be beneficial to P. aeruginosa when biotin supply is limited (39). An ATU with a high expression level 303 coded by the uvrB gene is identified by SeqATU in Rich medium, while it does not exist in M9 minimal 304 medium (Fig. 6). We predicted the uvrB gene to be involved in the biotin metabolism pathway, as the 305 bioB, bioF, bioC, and bioD genes contained in a same ATU with it have been known in the biotin 306 metabolism KEGG pathway. Therefore, the observation by Marie et al. can be explained that the ATUs 307 coded by the uvrB gene of E. coli can provide the biotin supply for P. aeruginosa under rich medium. 308 This result showed that SeqATU could increase our understanding of interspecies competition and 309 cooperation, which play an important role in shaping the composition and structure of polymicrobial 310 bacterial populations. 311 Please place Fig. 5 here. 312 Please place Fig. 6 here. 313 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 Predicted ATUs by SeqATU are verified by experimental TSSs and TTSs 314 An experimental TSS dataset of E. coli from SEnd-seq (7) and a TF binding site dataset of E. coli from 315 the experimental dataset of RegulonDB (19) were used to further verify the reliability of SeqATU and 316 were named dataset 1 and dataset 2, respectively. There were 5,512 experimental TSSs in dataset 1 and 317 3,220 experimental TF binding sites in dataset 2. We considered the 5’-end genes and no 5’-end genes of 318 the predicted ATUs by SeqATU. A gene that is not the 5’-end gene of any predicted ATU is named a no 319 5’-end gene. We identified 2,177/2,005 5’-end genes and 1,266/1,160 no 5’-end genes of the predicted 320 ATUs for M9Enrich_Seq/RiEnich. A gene validated by experimental TSSs or TF binding sites means 321 that it is the immediate downstream gene of an experimental TSS or TF binding site. As a result, the 322 proportion of 5’-end genes of the predicted ATUs that were validated by experimental TSSs or TF 323 binding sites was over 1.7 times greater than that of the no 5’-end genes (Table 1). Specifically, the 324 proportion of 5’-end genes (29%/30% for M9Enrich_Seq/RiEnrich_Seq) validated by experimental TF 325 binding sites was over three times greater than the no 5’-end genes (9.2%/9.0% for 326 M9Enrich_Seq/RiEnrich_Seq). These results further verified the reliability of the ATUs predicted by 327 SeqATU in terms of the TSS level. In addition, four other experimental TSS or promoter datasets from 328 RegulonDB (19), dRNA-seq (14), and Cappable-seq (13) were also examined. The results are shown in 329 table S3, and we also found a higher proportion of 5’-end genes of the predicted ATUs validated by 330 experimental TSSs or promoters than that of no 5’-end genes. 331 We also used two experimental TTS datasets of E. coli from SEnd-seq (7) and RegulonDB (19) to 332 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 verify the reliability of predicted ATUs by SeqATU in terms of TTS level. These two experimental TTS 333 datasets were named dataset 3 and dataset 4, respectively. There were 1,540 experimental TTSs in 334 dataset 3 and 367 experimental TTSs in dataset 4. We considered the 3’-end genes and no 3’-end genes 335 of the predicted ATUs by SeqATU. A gene that is not the 3’-end gene of any predicted ATU is named a 336 no 3’-end gene. We identified 2,290/2,187 3’-end genes and 1,153/978 no 3’-end genes of the predicted 337 ATUs for M9Enrich_Seq/RiEnrich_Seq. A gene validated by experimental TTSs means that it is the 338 immediate upstream gene of an experimental TTS. As a result, the proportion of 3’-end genes of the 339 predicted ATUs that were validated by experimental TTSs was over two times greater than that of no 3’-340 end genes (Table 2). Specifically, the proportion of 3’-end genes (51%/53% for 341 M9Enrich_Seq/RiEnrich_Seq) validated by experimental TTSs from SEnd-seq was over three times 342 greater than that of no 3’-end genes (15%/14% for M9Enrich_Seq/RiEnrich_Seq). These results further 343 verified the reliability of the ATUs predicted by SeqATU in terms of the TTS level. In addition, two 344 other computationally predicted TTS datasets from the works by Nadiras et al. (40) and Kingsford et al. 345 (41) were also examined. The results are shown in table S4, and we also found the proportion of 3’-end 346 genes (63%/62% for M9Enrich_Seq/RiEnrich_Seq) validated by computationally predicted Rho-347 independent TTSs was over two times greater than that of no 3’-end genes (29%/29% for 348 M9Enrich_Seq/RiEnrich_Seq). 349 Please place Table 1 here. 350 Please place Table 2 here. 351 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 The gene pairs frequently encoded in the same ATUs are more functionally related than those that 352 can belong to two distinct ATUs 353 Functional analysis was conducted by integrating GO terms from the Gene Ontology (GO) database 354 (42). In detail, we measured the level of functional relatedness for two types of consecutive gene pairs, 355 which is similar to the definition in the work by Mao et al. (38). Two types of consecutive gene pairs 356 were (i) gene pairs each consisting of a 5’-end gene of an ATU and the gene in its immediate upstream 357 on the same strand and (ii) all the other gene pairs inside an ATU (Fig. 7A). In addition, we used a 358 scoring scheme to measure the GO-based functional similarity between a pair of genes by Wu et al. (43). 359 This study developed a GO similarity score and showed that the larger the score, the more likely that 360 two genes are functionally related. In brief, the GO similarity score of a gene pair �� and �� is 361 denoted as ��� (�� , �� ): 362 ��� ���, �� � = �����∈�(��), ��∈�(��) �(�� , �� ) 363 where �� and �� are the GO terms assigned to �� and �� , respectively; �(�� , �� ) is the maximal 364 number of common terms between paths in the two GO graphs induced by the GO terms �� and ��. 365 As a result, the mean GO similarity score was higher for type-ii gene pairs (5.97 versus 4.04 for 366 M9Enrich_Seq and 5.86 versus 3.91 for RiEnrich_Seq) than for type-i gene pairs. A total of 574/524 367 type-ii gene pairs had GO similarity scores greater than four (64%/63% of a total of 899/834), while 368 only 461/404 type-i gene pairs had GO similarity scores greater than four (36%/34% of a total of 369 1,274/1,179) for M9Enrich_Seq/RiEnrich_Seq. We also applied a c�-test (44) to determine whether the 370 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 distribution of ��� ��� , �� � was different for the type-i gene pairs and type-ii gene pairs. The c �-371 statistics corresponded to a P-value less than 10��, which revealed that the distribution of ��� ��� , �� � 372 for the type-ii gene pairs was significantly different from the type-i gene pairs. Fig. 7B shows the 373 distribution of ��� ��� , �� � for the type-i gene pairs and the type-ii gene pairs. These results strongly 374 indicated that the type-ii gene pairs had a higher degree of GO similarity than the type-i gene pairs, 375 suggesting that the gene pairs frequently encoded in the same ATUs (type-ii gene pairs) are more 376 functionally related than those that can belong to two distinct ATUs (type-i gene pairs). 377 We also carried out a similar analysis of the two different gene pairs based on KEGG enrichment 378 analysis (see more details in method S9) and found that the proportion of type-ii gene pairs (59%/57% 379 for M9Enrich_Seq/RiEnrich_Seq), whose two genes were contained in the same KEGG pathway, was 380 higher than the proportion of type-i gene pairs (32%/28% for M9Enrich_Seq/RiEnrich_Seq) (Fig. 7C). 381 The distribution of the KEGG similarity scores of the two different types of gene pairs is shown in Fig. 382 7D, suggesting that genes of type-ii gene pairs have a higher probability of participating in the same 383 KEGG pathway than those of type-i gene pairs. 384 Please place Fig. 7 here. 385 DISCUSSION 386 We developed SeqATU, the first computational method for genome-scale ATU prediction by analyzing 387 next- and third-generation RNA-Seq data, using a CQP model. Linear constraints provided by the bias 388 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 rate of read distribution were, for the first time, integrated into the CQP model. Positional bias refers to 389 the non-uniform distribution of reads over different positions of a transcript (33, 35), which is handled 390 by learning non-uniform read distributions from given RNA-Seq reads (32) or modeling the RNA 391 degradation (45). The bias rate function we proposed can address the non-uniform read distribution 392 along mRNA transcripts and also be desirable for standard next-generation RNA-Seq data that involves 393 more degraded mRNAs, as the exponential function has been used to model the degradation of mRNA 394 transcripts (45). As a result, a total of 2,973 distinct ATUs for M9Enrich_Seq and 2,767 distinct ATUs 395 for RiEnrich_Seq were identified by SeqATU. The precision and recall reached 0.67/0.64 and 0.67/0.68, 396 respectively, based on perfect matching and 0.77/0.74 and 0.75/0.76, respectively, based on relaxed 397 matching for M9Enrich_Seq/RiEnrich_Seq. We further validated predicted ATUs using experimental 398 transcription factor binding sites or transcription termination sites from RegulonDB and SEnd-Seq. In 399 addition, the proportion of the 5’- or 3’-end genes of predicted ATUs that were validated by 400 experimental transcription factor binding sites and transcription termination sites was over three times 401 greater than that of no 5’- or 3’-end genes, demonstrating the high reliability of predicted ATUs. Gene 402 pairs frequently encoded in the same ATUs were more functionally related than those that can belong to 403 two distinct ATUs according to GO and KEGG enrichment analyses. These results demonstrated the 404 reliability and accuracy of our predicted ATUs, implying the ability of SeqATU to reveal the 405 transcriptional architecture of the bacterial genome. 406 In fact, the ATU architecture of bacteria is much more complex than that determined with currently 407 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 24 used experimental techniques. We investigated the 5’-end genes and no 5’-end genes of the experimental 408 ATUs identified by SMRT-Cappable-seq (6) using a combination of experimental TSSs from 409 RegulonDB (19), dRNA-seq (14), Cappable-seq (13), and SEnd-seq (7). As a result, we found that the 410 proportion of 5’-end genes (99%) validated by experimental TSSs was not significantly different from 411 that of no 5’-end genes (92%). The high percentage of no 5’-end genes validated by experimental TSSs 412 implied that the ATUs identified by experimental techniques are only a small proportion of the 413 comprehensive ATUs in bacterial organisms due to the dynamic mechanisms of ATUs. These results 414 further verified the necessity of developing robust computational methods for ATU identification. 415 SeqATU not only provides a powerful tool to understand the transcription mechanism of bacteria but 416 also provides a fundamental tool to guide the reconstruction of a genome-scale transcriptional regulatory 417 network. First, the ATU structure can help us to make new functional predictions, as genes in an ATU 418 tend to have related functions. Second, ATUs can elucidate condition-specific uses of alternative sigma 419 factors (8, 46). For example, the thrLABC operon is regulated by transcriptional attenuation. Totsuka et 420 al. found that under the log phase growth condition, the thrLABC operon is the only transcript, while 421 two transcripts are found under stationary phase growth condition, the thrLABC and thrBC. As validated 422 experimentally, � � can regulate the additional promoter located in front of thrB under the stationary 423 phase growth condition and then separately regulate thrBC, which elucidates the condition-specific uses 424 of � � (8). Third, understanding the ATU structure is of great help to construct transcriptional and 425 translation regulatory networks, such as for the construction of the σ-TUG (σ-factor-transcription unit 426 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 25 gene) network (47). The transcription regulatory network consists of nodes (ATU and regulatory 427 proteins) and links (interactions) (48), and the comprehensive ATU structure can provide a nearly 428 complete set of nodes, which can improve the accuracy of regulatory prediction. 429 Although SeqATU has obtained satisfactory predicted results, there are still several challenges 430 regarding the computational prediction of ATUs. On the one hand, due to the influence of the 3’ 431 untranslated region (UTR) and 5’ untranslated region (UTR) in the intergenic regions, the expression 432 value of intergenic regions cannot be reproduced perfectly by the same calculation used for the 433 expression value of genetic regions. Without accurate reproduction, it is difficult to obtain the best 434 expression combination of ATUs by the programming model based on the expression value of genetic 435 and intergenic regions. On the other hand, due to the lack of strand-specific RNA-Seq data, it is difficult 436 to distinguish the expression level of intergenic regions between two consecutive genes on the same 437 strand derived from ATUs containing these two genes or antisense RNAs (asRNAs) (6, 49). All of these 438 challenges and the great significance of ATU prediction inspire and encourage us to discover more 439 information to determine the ATU structure in bacteria. For example, we plan to add high confidence 440 TSSs and TTSs information to our programming model in the future. Additionally, since the microbiome 441 is increasingly recognized as a critical component in human diseases, such as inflammatory bowel 442 disease (50), antibiotic-associated diarrhoea (51), neurological disorders (52), and cancer (53) (54), 443 predicting new ATUs of uncultured species from metagenomic and metatranscriptomic data is of great 444 significance in uncovering new regulatory pathway and metabolic products during the development of 445 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 26 diseases (55). However, due to a majority of species with unknown genomes or genome annotations 446 within a microbial community, ATU prediction on metagenomics and metatranscriptomics is still a 447 challenging task, which encourage us to pay more attention on it. 448 REFERENCES 449 1. F. Jacob, D. Perrin, C. Sanchez, J. Monod, Operon: a group of genes with the expression 450 coordinated by an operator. C R Hebd. Seances. Acad. Sci 250, 1727-1729 (1960). 451 2. F. Jacob, J. Monod, Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 452 318-356 (1961). 453 3. Z. Liu, J. Feng, B. Yu, Q. Ma, B. Liu, The functional determinants in the organization of bacterial 454 genomes. Brief. Bioinform., doi.org/10.1093/bib/bbaa1172 (2020). 455 4. W.-C. Chou, Q. Ma, S. Yang, S. Cao, D. M. Klingeman, S. D. Brown, Y. Xu, Analysis of strand-456 specific RNA-seq data using machine learning reveals the structures of transcription units in 457 Clostridium thermocellum. Nucleic Acids Res. 43, e67-e67 (2015). 458 5. S.-Y. Niu, B. Liu, Q. Ma, W.-C. Chou, rSeqTU—a machine-learning based R package for 459 prediction of bacterial transcription units. Frontiers in genetics 10, 374 (2019). 460 6. B. Yan, M. Boitano, T. A. Clark, L. Ettwiller, SMRT-Cappable-seq reveals complex operon 461 variants in bacteria. Nat. Commun. 9, 3676 (2018). 462 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 27 7. X. Ju, D. Li, S. Liu, Full-length RNA profiling reveals pervasive bidirectional transcription 463 terminators in bacteria. Nature microbiology 4, 1907-1918 (2019). 464 8. K. Totsuka, K. Totsuka, The Transcription Unit Architecture of the Escherichia Coli Genome. Nat. 465 Biotechnol. 27, 1043-1049 (2009). 466 9. A. H. Bhat, D. Pathak, A. Rao, The alr-groEL1 operon in Mycobacterium tuberculosis: an interplay 467 of multiple regulatory elements. Scientific Reports 7, 43772 (2017). 468 10. C. M. Sharma, S. Hoffmann, F. Darfeuille, J. Reignier, S. Findeiß, A. Sittka, S. Chabas, K. Reiche, 469 J. Hackermüller, R. Reinhardt, The primary transcriptome of the major human pathogen 470 Helicobacter pylori. Nature 464, 250-255 (2010). 471 11. J. M. Durand, G. R. Bjork, Putrescine or a combination of methionine and arginine restores 472 virulence gene expression in a tRNA modification-deficient mutant of Shigella flexneri: a possible 473 role in adaptation of virulence. Mol. Microbiol. 47, 519-527 (2010). 474 12. L. E. Wroblewski, R. M. Peek, K. T. Wilson, Helicobacter pylori and gastric cancer: factors that 475 modulate disease risk. Clin. Microbiol. Rev. 23, 713-739 (2010). 476 13. L. Ettwiller, J. Buswell, E. Yigit, I. Schildkraut, A novel enrichment strategy reveals unprecedented 477 number of novel transcription start sites at single base resolution in a model prokaryote and the 478 gut microbiome. BMC Genomics 17, 199-199 (2016). 479 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 28 14. M. K. Thomason, T. Bischler, S. K. Eisenbart, K. U. Forstner, A. Zhang, A. Herbig, K. Nieselt, C. 480 M. Sharma, G. Storz, Global transcriptional start site mapping using differential RNA sequencing 481 reveals novel antisense RNAs in Escherichia coli. J. Bacteriol. 197, 18-28 (2015). 482 15. T. Bischler, H. S. Tan, K. Nieselt, C. M. Sharma, Differential RNA-seq (dRNA-seq) for annotation 483 of transcriptional start sites and small RNAs in Helicobacter pylori. Methods 86, 89-101 (2015). 484 16. D. Dar, M. Shamir, J. Mellin, M. Koutero, N. Stern-Ginossar, P. Cossart, R. Sorek, Term-seq 485 reveals abundant ribo-regulation of antibiotics resistance in bacteria. Science 352, 6282 (2016). 486 17. J. Clauwaert, G. Menschaert, W. Waegeman, An in-depth evaluation of annotated transcription 487 start sites in E. coli using deep learning. bioRxiv, doi: https://doi.org/10.1101/2020.03.16.993501, 488 4 November 2020, pre-print: not peer-reviewed. (2020). 489 18. S. Goodwin, J. D. Mcpherson, W. R. Mccombie, Coming of age: ten years of next-generation 490 sequencing technologies. Nat. Rev. Genet. 17, 333-351 (2016). 491 19. A. Santos-Zavaleta, H. Salgado, S. Gama-Castro, M. Sánchez-Pérez, L. Gómez-Romero, D. 492 Ledezma-Tejeida, J. S. García-Sotelo, K. Alquicira-Hernández, L. J. Muñiz-Rascado, P. Peña-493 Loredo, RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge 494 of gene regulation in E. coli K-12. Nucleic Acids Res. 47, D212-D220 (2018). 495 20. N. Sierro, Y. Makita, M. J. L. De Hoon, K. Nakai, DBTBS: a database of transcriptional regulation 496 in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res. 497 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29 36, 93-96 (2008). 498 21. P. S. Dehal, M. P. Joachimiak, M. N. Price, J. T. Bates, J. K. Baumohl, C. Dylan, G. D. Friedland, 499 K. H. Huang, K. Keith, P. S. Novichkov, MicrobesOnline: an integrated portal for comparative and 500 functional genomics. Nucleic Acids Res. 38, D396-D400 (2010). 501 22. H. Cao, Q. Ma, X. Chen, Y. Xu, DOOR: a prokaryotic operon database for genome analyses and 502 functional inference. Brief. Bioinform. 20, 1568-1577 (2019). 503 23. X. Mao, Q. Ma, C. Zhou, X. Chen, H. Zhang, J. Yang, F. Mao, W. Lai, Y. Xu, DOOR 2.0: presenting 504 operons and their functions through dynamic and integrated views. Nucleic Acids Res. 42, D654-505 D659 (2013). 506 24. K. Chetal, S. C. Janga, OperomeDB: A Database of Condition-Specific Transcription Units in 507 Prokaryotic Genomes. Biomed Research International 2015, 1-10 (2015). 508 25. J. Yang, X. Chen, A. Mcdermaid, Q. Ma, DMINDA 2.0: integrated and systematic views of 509 regulatory DNA motif identification and analyses. Bioinformatics 33, 2586-2588 (2017). 510 26. T. Blanca, C. Ricardo, C. E. Martinez-Guerrero, M. Enrique, ProOpDB: Prokaryotic Operon 511 DataBase. Nucleic Acids Res. 40, D627-D631 (2012). 512 27. R. McClure, D. Balasubramanian, Y. Sun, M. Bobrovskyy, P. Sumby, C. A. Genco, C. K. 513 Vanderpool, B. Tjaden, Computational analysis of bacterial RNA-Seq data. Nucleic Acids Res. 41, 514 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 30 e140-e140 (2013). 515 28. X. Chen, W. Chou, Q. Ma, Y. Xu, SeqTU: A Web Server for Identification of Bacterial 516 Transcription Units. Scientific Reports 7, 43925 (2017). 517 29. I. A. Garanina, G. Y. Fisunov, V. M. Govorun, BAC-BROWSER: The Tool for Visualization and 518 Analysis of Prokaryotic Genomes. Frontiers in Microbiology 9, 2827 (2018). 519 30. B. Taboada, K. Estrada, R. Ciria, E. Merino, Operon-mapper: a web server for precise operon 520 identification in bacterial and archaeal genomes. Bioinformatics 34, 4118-4120 (2018). 521 31. H. Li, R. Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform. 522 Bioinformatics 25, 1754-1760 (2009). 523 32. Z. Wu, X. Wang, X. Zhang, Using non-uniform read distribution models to improve isoform 524 expression inference in RNA-Seq. Bioinformatics 27, 502-508 (2011). 525 33. A. Roberts, C. Trapnell, J. Donaghey, J. L. Rinn, L. Pachter, Improving RNA-Seq expression 526 estimates by correcting for fragment bias. Genome Biol. 12, 1-14 (2011). 527 34. R. Bohnert, G. Rï¿ ½tsch, rQuant. web: a tool for RNA-Seq-based transcript quantitation. Nucleic 528 Acids Res. 38, W348-W351 (2010). 529 35. W. Li, T. Jiang, Transcriptome assembly and isoform expression level estimation from biased 530 RNA-Seq reads. Bioinformatics 28, 2914-2921 (2012). 531 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 31 36. B. Xiong, Y. Yang, F. R. Fineis, J.-P. Wang, DegNorm: normalization of generalized transcript 532 degradation improves accuracy in RNA-seq analysis. Genome Biol. 20, 75 (2019). 533 37. J. Chaitanya, Degradation of mRNA in Escherichia coli. IUBMB Life 54, 315-321 (2010). 534 38. X. Mao, Q. Ma, B. Liu, X. Chen, H. Zhang, Y. Xu, Revisiting operons: an analysis of the landscape 535 of transcriptional units in E. coli. BMC Bioinformatics 16, 356 (2015). 536 39. B. Marie, K. H. Thilo, F. Thierry, T. Mikael, R. Adriana, V. D. Christian, Metabolic pathways of 537 Pseudomonas aeruginosa involved in competition with respiratory bacterial pathogens. Frontiers 538 in Microbiology 6, 321 (2015). 539 40. C. Nadiras, E. Eveno, A. Schwartz, N. Figueroa-Bossi, M. Boudvillain, A multivariate prediction 540 model for Rho-dependent termination of transcription. Nucleic Acids Res. 46, 8245-8260 (2018). 541 41. C. L. Kingsford, K. Ayanbule, S. L. Salzberg, Rapid, accurate, computational discovery of Rho-542 independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol. 543 8, R22 (2007). 544 42. M. Ashburner, S. Lewis, On Ontologies for Biologists: The Gene Ontology—Untangling the Web. 545 Novartis Found. Symp. 247, 66-80; discussion 80-63, 84-90, 244-252 (2002). 546 43. H. Wu, Z. Su, F. Mao, V. Olman, Y. Xu, Prediction of functional modules based on comparative 547 genome analysis and Gene Ontology application. Nucleic Acids Res. 33, 2822-2837 (2005). 548 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 32 44. S. A. Teukolsky, B. P. Flannery, W. Press, W. Vetterling, Numerical Recipes in C: The Art of 549 Scientific Computing. Cambridge University Press, Cambridge (1992). 550 45. L. Wan, X. Yan, T. Chen, F. Sun, Modeling RNA degradation for RNA-Seq with applications. 551 Biostatistics 13, 734-747 (2012). 552 46. C. Yanofsky, Attenuation in the control of expression of bacterial operons. Nature 289, 751 (1981). 553 47. B. K. Cho, D. Kim, E. M. Knight, K. Zengler, B. O. Palsson, Genome-scale reconstruction of the 554 sigma factor network in Escherichia coli : topology and functional states. BMC Biol. 12, 4-4 (2014). 555 48. B.-K. Cho, P. Charusanti, M. J. Herrgård, Microbial regulatory and metabolic networks. Curr. Opin. 556 Biotechnol. 18, 360-364 (2007). 557 49. A. Toledo-Arana, O. Dussurget, G. Nikitas, N. Sesto, H. Guet-Revillet, D. Balestrino, E. Loh, J. 558 Gripenland, T. Tiensuu, K. Vaitkevicius, The Listeria transcriptional landscape from saprophytism 559 to virulence. Nature 459, 950-956 (2009). 560 50. B. Yue, X. Luo, Z. Yu, S. Mani, Z. Wang, W. Dou, Inflammatory bowel disease: a potential result 561 from the collusion between gut microbiota and mucosal immune system. Microorganisms 7, 440 562 (2019). 563 51. B. H. Mullish, H. R. Williams, Clostridium difficile infection and antibiotic-associated diarrhoea. 564 Clin. Med. 18, 237 (2018). 565 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 33 52. M. Maguire, G. Maguire, Gut dysbiosis, leaky gut, and intestinal epithelial proliferation in 566 neurological disorders: towards the development of a new therapeutic using amino acids, 567 prebiotics, probiotics, and postbiotics. Rev. Neurosci. 30, 179-201 (2019). 568 53. S. Vivarelli, R. Salemi, S. Candido, L. Falzone, M. Santagati, S. Stefani, F. Torino, G. L. Banna, 569 G. Tonini, M. Libra, Gut microbiota and cancer: from pathogenesis to therapy. Cancers 11, 38 570 (2019). 571 54. G. Cammarota, G. Ianiro, A. Ahern, C. Carbone, A. Temko, M. J. Claesson, A. Gasbarrini, G. 572 Tortora, Gut microbiome, big data and machine learning to promote precision medicine for cancer. 573 Nature Reviews Gastroenterology & Hepatology 17, 635-648 (2020). 574 55. S. S. A. Zaidi, X. Zhang, Computational operon prediction in whole-genomes and metagenomes. 575 Briefings in functional genomics 16, 181-193 (2017). 576 ACKNOWLEDGEMENTS 577 Funding: This work was supported by the National Nature Science Foundation of China (NSFC) 578 [61772313 to B.L., 11931008 to B.L.]; Interdisciplinary Science Innovation Group Project of Shandong 579 University (2019); and the Innovation Method Fund of China [2018IM020200 to B.L.]. The authors 580 would like to thank Yang Li for his assistance in language polishing. Authors’ contributions: B.L., 581 Q.M. and W.C. conceived the basic idea and designed the overall analyses. Q.W. carried out most of the 582 computational analysis and data interpretation. All the authors wrote the manuscript. Competing 583 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 34 interests: The authors declare that they have no competing interests. Data and materials availability: 584 The raw data and source code of SeqATU and a detailed tutorial can be found at 585 https://github.com/OSU-BMBL/SeqATU. 586 FIGURES AND TABLES 587 Table 1. Results of predicted ATUs verified by experimental TSSs or TF binding sites. Overview of 588 the experimental TSS and TF binding site datasets (dataset 1 and dataset 2) and the proportion of 5’-end 589 genes and no 5’-end genes of the predicted ATUs by SeqATU for M9Enrich_Seq and RiEnrich_Seq, which 590 were validated by experimental TSSs or TF binding sites. 591 dataset 1 dataset 2 Source Ju et al. (7) RegulonDB TF binding sites Technique SEnd-seq Collection TSSs/TF binding sites 5,512 3,220 M9Enrich_Se q 5’-end genes 83% 29% no 5’-end genes 47% 9.2% RiEnrich_Seq 5’-end genes 89% 30% no 5’-end genes 44% 9.0% 592 593 Table 2. Results of predicted ATUs verified by experimental TTSs. Overview of the experimental 594 TTS datasets (dataset 3 and dataset 4) and the proportion of 3’-end genes and no 3’-end genes of the 595 predicted ATUs by SeqATU for M9Enrich_Seq and RiEnrich_Seq, which were validated by 596 experimental TTSs. 597 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 35 dataset 3 dataset 4 Source Ju et al. (7) RegulonDB TTSs Technique SEnd-seq Collection TTSs 1,540 3,67 M9Enrich_Se q 3’-end genes 51% 11% no 3’-end genes 15% 5.2% RiEnrich_Seq 3’-end genes 53% 11% no 3’-end genes 14% 4.8% 598 599 600 Fig. 1. Schematic overview of SeqATU. The blue arrow and orange line denote gene and RNA-Seq 601 read, respectively. The preprocessing stage requires RNA-Seq data in the FASTQ format, the reference 602 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 36 genome sequence in the FASTA format, and gene annotations in the GFF format, generating linear 603 constraints for the next convex quadratic programming (CQP) stage. There are two steps in the 604 preprocessing stage: (i) calculating the expression value of the genetic region �� and intergenic region 605 ��,� and (ii) modelling non-uniform read distribution along mRNA transcripts; specifically, we acquired 606 a bias rate function �(�) = �� � using nonlinear regression and then constructed genetic or intergenic 607 region bias rate vectors. The maximal ATU cluster data determined by rSeqTU and the linear constraints 608 from preprocessing are both taken as inputs of CQP. CQP seeks the optimum expression combination of 609 all of the to-be-identified ATUs to minimize the gap ��� between the predicted ATU expression profile 610 and the genetic and intergenic region expression profile. Finally, the output of CQP is the predicted 611 ATUs. 612 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 37 613 Fig. 2. Results of modelling non-uniform read distribution along mRNA transcripts. The four bias 614 rate functions (� = ����) by nonlinear regression had similar coefficients (� and �) across the four 615 datasets M9Enrich_1, M9Enrich_2, RiEnrich_1 and RiEnrich_2. 616 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 38 617 Fig. 3. Overall evaluation results of SeqATU. (A) Precision and recall based on perfect matching and 618 relaxed matching for M9Enrich_Seq (left) and RiEnrich_Seq (right) using evaluated ATUs from SMRT-619 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 39 Cappable-seq. (B) Average precision based on perfect matching for M9Enrich_Seq (left) and 620 RiEnrich_Seq (right) using evaluated ATUs from SMRT-Cappable-seq (black) and evaluated ATUs from 621 SMRT-Cappable-seq and SEnd-seq (red). The magnitude of the point denotes the number of maximal 622 ATU clusters with same size. (C) Average number of ATUs across different sizes of SMRT maximal 623 ATU clusters for M9Enrich_Seq (left) and RiEnrich_Seq (right). 624 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 40 625 Fig. 4. Comparative analysis of the performance between SeqATU and SeqATU without the bias 626 rate constrains for SMRT maximal ATU clusters. (A) Precision, recall and F-score based on perfect 627 matching for M9Enrich_Seq and RiEnrich_Seq. (B) Precision, recall and F-score based on relaxed 628 matching for M9Enrich_Seq and RiEnrich_Seq. 629 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 41 630 Fig. 5. Comprehensive analysis of the predicted ATUs by SeqATU. (A) Number of ATUs across 631 different sizes. The size of an ATU is the number of its component genes. (B) Distribution of the number 632 of ATUs per gene. 633 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 42 634 Fig. 6. Integrative Genomics Viewer (IGV) representation of the mapping and ATUs. Mapping and 635 ATUs of M9Enrich_Seq (orange) and RiEnrich_Seq (blue) were shown for the maximal ATU cluster 636 containing the bioB, bioF, bioC, bioD and uvrB genes. 637 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 43 638 Fig. 7. Interpretation and results of the functional relatedness of different gene pairs based on GO 639 and KEGG enrichment analyses. (A) Illustration of two different gene pairs i and ii. (B) Functional 640 relatedness results based on GO enrichment analysis for M9Enrich_Seq (left) and RiEnrich_Seq (right). 641 (C) The proportion of two different gene pairs whose genes are contained in the same KEGG pathway 642 for M9Enrich_Seq (left) and RiEnrich_Seq (right). (D) The functional relatedness results based on 643 KEGG enrichment analysis for M9Enrich_Seq (left) and RiEnrich_Seq (right). 644 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_01_04_425250 ---- A read count-based method to detect multiplets and their cellular origins from snATAC-seq data 1 A read count-based method to detect multiplets and their cellular origins from 1 snATAC-seq data 2 Asa Thibodeau1*, Alper Eroglu1*, Nathan Lawlor1, Djamel Nehar-Belaid1, Romy Kursawe1, Radu Marches1, 3 George A. Kuchel2, Jacques Banchereau1, Michael L. Stitzel1,3,4, A. Ercument Cicek5,6, Duygu Ucar1,3,4 4 5 1 The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA 6 2 University of Connecticut Center on Aging, UConn Health Center, Farmington, CT, 06030, USA 7 3 Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, 8 06030, USA 9 4 Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT, 06030, USA. 10 5 Computer Engineering Department, Bilkent University, Ankara, 06800, Turkey 11 6 Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA 12 * These authors contributed equally to this work. 13 Correspondence: duygu.ucar@jax.org 14 15 ABSTRACT 16 Similar to other droplet-based single cell assays, single nucleus ATAC-seq (snATAC-seq) data harbor multiplets 17 that confound downstream analyses. Detecting multiplets in snATAC-seq data is particularly challenging due to 18 its sparsity and trinary nature (0 reads: closed chromatin, 1: open in one allele, 2: open in both alleles), yet offers 19 a unique opportunity to infer multiplets when >2 uniquely aligned reads are observed at multiple loci. Here, we 20 implemented the first read count-based multiplet detection method, ATAC-DoubletDetector, that detects 21 multiplets independently of cell-type. Using PBMC and pancreatic islet datasets, ATAC-DoubletDetector 22 captured simulated heterotypic multiplets (different cell-types) with ~0.60 recall, showing ~24% improvement 23 over state of the art. ATAC-DoubletDetector detected homotypic multiplets with ~0.61 recall, representing the 24 first method to detect multiplets originating from the same cell type. Using our novel clustering-based algorithm, 25 multiplets were annotated to their cellular origins with ~85% accuracy. Application of ATAC-DoubletDetector will 26 improve downstream analysis of snATAC-seq.27 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 2 MAIN 28 Single nucleus ATAC-seq (snATAC-seq)1–3 technology is widely used to study epigenomes of diverse cells and 29 tissues with increased resolution3,4. However, as with other droplet based single cell technologies, snATAC-seq 30 data harbor multiplet nuclei5. The presence of multiplets can confound downstream analyses by introducing 31 combined epigenomic profiles that originate from two or more nuclei, increasing the difficulty of clustering and 32 comparing different cell types within a sample. Compared to other single cell assays, the difficulty of detecting 33 multiplets in snATAC-seq is further increased due to data sparsity and the trinary nature of chromatin accessibility 34 levels (e.g., 0 reads: closed chromatin, 1: open in one allele, 2: open in both alleles). 35 The current state of the art for detecting multiplets in snATAC-seq data adapt detection methods 36 developed for single cell RNAseq (scRNA-seq). Notably, two snATAC-seq data analysis packages, SnapATAC6 37 and ArchR7, either employ or implement a method similar to multiplet detection methods (i.e., DoubletFinder8 38 and Scrublet9) for scRNA-seq. In these methods, synthetic heterotypic multiplets (i.e., originating from different 39 cell types) are simulated by combining profiles of two or more cells, which are then used to detect putative 40 multiplets based on cluster similarity. Such algorithms assume that multiplets and singlets exhibit distinct 41 genomic profiles, which becomes problematic when true singlets share genomic profiles with two or more cell 42 types. Under this assumption, these methods will fail to detect homotypic multiplets (i.e., originating from the 43 same cell type) since their overall genomic profile is considered to be similar to that of the underlying cell type. 44 However, homotypic multiplets are characterized by increased read counts compared to singlets, suggesting 45 new methods that utilize read counts can detect them. In order to overcome the limitations of existing methods 46 to detect both homotypic and heterotypic multiplets, we developed a novel multiplet detection method, ATAC-47 DoubletDetector, that exploits read count distributions to infer multiplets in snATAC-seq data. 48 ATAC-DoubletDetector’s efficacy was tested in two snATAC-seq datasets generated from peripheral 49 blood mononuclear cells (PBMCs) samples (n=2) and pancreatic islet (n=2) tissues. We identified multiplets in 50 these tissues and quantified the algorithm’s efficacy using simulated homotypic and heterotypic multiplets. We 51 found that when snATAC-seq samples were adequately sequenced (e.g., >20k valid read pairs per cell), ATAC-52 DoubletDetector proved very effective for detecting both homotypic and heterotypic multiplets (recall ranging 53 from 0.74-0.89 in PBMCs). In addition, ATAC-DoubletDetector includes a novel clustering-based algorithm that 54 accurately annotates the cellular origins of detected multiplets (85% average accuracy in our simulations), 55 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 3 providing further data quality insights. ATAC-DoubletDetector is provided as a user-friendly computational 56 framework with documentation and source code freely available at: https://github.com/UcarLab/ATAC-57 DoubletDetector. 58 59 Results 60 ATAC-DoubletDetector leverages the fact that the expected number of uniquely aligned reads for a given locus 61 ranges from 0 to 2 per nucleus in snATAC-seq data: 0 = closed chromatin, 1 = open in one allele (i.e., from either 62 maternal or paternal chromosomes), 2 = open in two alleles (i.e., both maternal and paternal chromosomes) 63 (Fig. 1a). A locus can have more than two reads (>2) when: 1) it contains repetitive sequences; 2) there are 64 sequencing or alignment errors; or 3) reads stem from multiplet nuclei. In the case of multiplets, we expect to 65 observe many loci with >2 reads since their epigenomic profiles are derived from two or more nuclei resulting in 66 increased accessible DNA. ATAC-DoubletDetector identifies all loci with >2 reads for each cell/nucleus (Fig. 1b) 67 by utilizing sorted read alignments to detect their overlapping read intervals (22-39 bp on average across all 68 samples). A unified list of these loci across all nuclei is then generated to quantify the number of occurrences 69 where >2 reads align to a locus in a given nucleus (Fig. 1c). As a proof of concept, highly significant multiplets 70 (P-Values < 10-324) can be clearly seen harboring many more loci with >2 reads (924-1054 loci) than average 71 (~23 loci per nuclei) (Extended Data Fig.1). Random occurrences of loci with >2 reads (i.e., due to sequencing 72 or alignment errors) were modeled with the Poisson cumulative distribution function using the mean number of 73 overlaps detected across all cells. Nuclei that harbor significantly more loci with >2 reads are identified as 74 multiplets based on their deviations from the distribution using False Discovery Rate (FDR) (Fig. 1c). To trace 75 multiplets back to their cellular origins, we employed a clustering-based algorithm as part of the ATAC-76 DoubletDetector framework. Marker peaks are detected to generate reference accessibility profiles for each cell 77 type using single cell clustering. Epigenomic similarity scores at marker peaks are then used to compare multiplet 78 profiles with singlet profiles to differentiate between heterotypic and homotypic multiplets and annotate them. 79 We demonstrate the utility and performance of our computational framework by applying our methods in 80 PBMC and islet sample datasets (Fig. 1d). First, we simulated artificial multiplets in PBMC and islet samples and 81 quantified ATAC-DoubletDetector’s ability to identify and annotate these multiplets. Second, we compared 82 ATAC-DoubletDetector to ArchR, measuring their overall performances and their ability to detect simulated 83 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 4 heterotypic and homotypic multiplets. Finally, we measure the efficacy of our annotation method and analyze 84 multiplet cellular origins to understand whether cell type influences the rate of multiplet occurrences. 85 86 ATAC-DoubletDector detects heterotypic and homotypic multiplets in PBMC and islet samples. We 87 generated snATAC-seq libraries from two human PBMC and two human pancreatic islet samples using 10x 88 Genomics Chromium platform3. Sequence reads were preprocessed using Cell Ranger ATAC pipeline 89 (methods), resulting in an average of 5,559 and 6,173 nuclei per sample and an average of 24,393 and 16,625 90 valid read pairs per cell for PBMC and islet samples respectively (Fig. 2a). Valid read pairs refer to all pairs of 91 paired end reads that align to autosomes and pass quality control flags/thresholds (methods). Despite deeper 92 sequencing for islet samples, fewer valid read pairs were observed in islet samples compared to PBMC samples 93 (Fig. 2b), which can be explained by increased mitochondrial reads in islets (114,821,502 and 47,522,248 total 94 reads aligned to chrM) compared to PBMCs (2,610,761and 947,233 total reads aligned to chrM). 95 Nuclei clustering using an in-house implementation (methods) of a two-pass clustering method3 for 96 snATAC-seq data identified 16 and 15 clusters for PBMC1 and PBMC2. Correlating pseudo-bulk accessibility 97 profiles of these clusters with accessibility maps from sorted bulk ATAC-seq data10 (Extended Data Fig. 2a,b) 98 grouped them into 5 major cell types: myeloid (including CD14+, CD16 monocytes and conventional dendritic 99 cells), B, CD4+ T, CD8+ T, and NK cells (Extended Data Fig. 2c,d). These annotations were confirmed based on 100 chromatin accessibility patterns at cell-specific marker genes (Extended Data Fig. 3a,b). The same clustering 101 procedure identified 14 and 12 distinct clusters for islet1 and islet2, which were then annotated as alpha, beta, 102 delta, and ductal cells by integrating their accessibility profiles with in-house islet scRNA-seq data (Extended 103 Data Fig. 4a,b). These annotations were confirmed by analyzing the chromatin accessibility patterns at known 104 cell-specific marker genes11 (Extended Data Fig. 4c,d). 105 We applied ATAC-DoubletDetector on PBMCs and human islet samples using an FDR cutoff of 0.01 106 (Methods). Nuclei detected as multiplets were distributed throughout all clusters (Fig. 2c-d, Extended Data Fig. 107 5) and in one case (PBMC1) multiplets formed their own distinct cluster (see selected multiplets in Fig. 2d). The 108 percentage of detected multiplets were higher in PBMCs (7%, 10.84%) compared to islets (5% for both samples) 109 (Fig. 2e), which is likely due to the lower valid read pairs per nuclei in islets as previously mentioned (Fig. 2b). 110 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 5 To further study the biological relevance of these detected multiplets, we selected a cluster which 111 exclusively encompassed multiplets (Fig. 2d; PBMC 1 selected multiplets) and analyzed their chromatin 112 accessibility profiles (Fig. 2f). The selected multiplets were characterized by a high chromatin accessibility at the 113 promoters of both CD3G (T cell marker gene) and LYZ (monocyte marker gene), suggesting T cell-monocyte 114 multiplets. These results demonstrate how read count distribution information from snATAC-seq can be used to 115 effectively detect multiplets. 116 117 ATAC-DoubletDetector effectively detects simulated heterotypic and homotypic multiplets. To quantify 118 the efficacy of ATAC-DoubletDetector, we generated artificial multiplets by randomly selecting 5% of nuclei in a 119 sample and pairing them together to artificially form multiplets (repeated 10 times per sample). This resulted in 120 artificial multiplets at 2.5% of the total number of nuclei within a sample. These artificial multiplets serve as 121 positive multiplet examples and enable us to measure recall (i.e., the fraction of detected artificial multiplets 122 among all artificial multiplets introduced in the sample). We first evaluated ATAC-DoubletDetector’s ability to 123 detect heterotypic, homotypic, and a combination of both multiplet types. We then compared it’s performance in 124 comparison to another method ArchR7. 125 ATAC-DoubletDetector detected heterotypic multiplets introduced in PBMC samples with high recall 126 (average recall 0.80 for PBMC1 and 0.90 for PBMC2 over 10 runs), outperforming ArchR (0.23 and 0.24 127 respectively) (Fig. 3a). Average recall for ATAC-DoubletDetector was lower in islet1 and islet2 than PBMCs (0.37 128 and 0.34 average recall respectively) whereas the average recall showed improvement for ArchR (0.68 and 0.30 129 average recall respectively). Decreased performance of ATAC-DoubletDetector’s in islets can be explained by 130 low number of valid read pairs per nuclei in islet samples compared to PBMCs (Fig 2b). Notably, ATAC-Doublet 131 detector was equally effective for detecting homotypic multiplets (average recall 0.82 and 0.91 for PBMC 1 and 132 PBMC 2, 0.38 and 0.31 for islet 1 and islet 2) (Fig. 3b), demonstrating the utility of using read counts to detect 133 multiplets. As expected, ArchR had low recall for detecting homotypic multiplets (average between 0.07 and 0.11 134 for all samples), as this algorithm identifies multiplets with distinct genomic profiles from singlets. Finally, we 135 measured the efficacy to simultaneously detect both types of multiplets by introducing a more realistic- 136 heterotypic and homotypic multiplet 1:1 ratio (Extended Data Fig. 6a). As expected, the average recall values of 137 ATAC-DoubletDetector’s were similar (0.82 and 0.92 for PBMC1 and PBMC 2, 0.34 and 0.33 for islet1 and islet2 138 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 6 respectively), while, those of ArchR were lower (0.13 and 0.16 for PBMC1 and PBMC2, 0.40 and 0.17 for islet1 139 and islet2), likely due to its poor homotypic multiplet detection performance. 140 To further study how the valid read pairs influence ATAC-DoubletDetector’s performance, we generated 141 artificial multiplets using cells with ranging reads per nucleus (Fig 3c-d, Extended Data Fig. 6b). We observed a 142 noticeable increase in average recall (> 0.96 recall) for ATAC-DoubletDetector, when the number of valid read 143 pairs was above 47.2k, corresponding to an average of 23.6k valid reads pairs per nucleus. In contrast, ArchR 144 did not show significant differences in performances with respect to the number of valid read pairs per nucleus 145 (Extended Data Fig. 6b), as it relies more on genomic profile similarity to detect multiplets. More exhaustive 146 analyses of 100 repetitions per sample further confirmed that the majority (96%, 98% for PBMC1 and PBMC2 147 and 83%, 72% for islet1 and islet2) of multiplets with >40k valid read pairs (i.e., multiplets formed from nuclei 148 with 20k valid read pairs each) were detected with this method (Extended Data Fig. 7). Together, these analyses 149 suggest that when >20k valid read pairs are captured per nucleus, ATAC-DoubletDetector is very effective in 150 detecting both homotypic and heterotypic multiplets from snATAC-seq data. 151 To compare ATAC-DoubletDetector and ArchR performances, we ran ArchR with recommended 152 parameter settings (i.e., k=10 nearest neighbors and 1.5 filter ratio). Only 38 to 78 multiplets across all samples 153 were detected by both methods (Fig 3e-f, Extended Data Fig. 8, Extended Data Fig. 9a-b) and majority of these 154 multiplets were among the ones that formed their own clusters (i.e., heterotypic multiplets). For example, the 155 majority of selected multiplets detected in cluster in Fig 2d were detected by both methods (Extended Data Fig. 156 8), which are multiplets that have unique epigenomic profiles; hence easier to detect with the synthetic multiplet-157 based method employed by ArchR. Notably, 47.35% of Delta cells were identified as multiplets by ArchR for 158 Islet1 (Figure 3f, Extended Data Fig. 8). Delta cells resemble both alpha and beta cells in their genomic profile, 159 hence these cells were mistakenly detected as multiplets by ArchR, demonstrating a pitfall for synthetic multiplet-160 based methods. Multiplets are expected to have higher read counts than singlets since they combine chromatin 161 accessibility profiles of more than one nucleus. In alignment with this, multiplets detected by ATAC-162 DoubletDetector had significantly higher valid read pair counts compared to singlets (average valid read pairs of 163 46,980 for multiplets and 18,561 for singlets for all samples) (P-Values < 1.375 x 10-152). In contrast, read counts 164 for ArchR multiplets were significantly lower (average P-Values < 1.016 x 10-57) than ATAC-DoubletDetector 165 multiplets, observing read counts closer to that of singlets (average read count per cell 23,703 for ArchR 166 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 7 multiplets and 19,951 for singlets) (Extended Data Fig. 9c). In summary, these analyses showed that when there 167 is sufficient number of valid read pairs per cell (> 20k), count based methods are advantageous over synthetic 168 multiplet-based methods as they can accurately detect both homotypic and heterotypic multiplets. 169 170 Marker peaks can effectively annotate cellular origins of multiplets. Cellular origin annotations of multiplets 171 were inferred using a three-step algorithm (Fig. 4a). First, nuclei were clustered and annotated to their respective 172 cell types. Second, marker peaks were detected for each cluster/cell type. Third, we calculated epigenomic 173 similarity of each multiplet to different cell types by counting marker peak reads for the multiplet and the k=15 174 nearest neighbor nuclei (Methods). Cluster similarity scores were then used to annotate multiplets. For example, 175 in PBMCs, for each multiplet we calculated 5 scores, where each score represents the similarity of the multiplet 176 epigenome to that of the five studied clusters (Figure 4b). The distribution of these similarity scores are used to 177 first distinguish heterotypic and homotypic multiplets, by comparing their profiles to annotated singlets (Methods). 178 For example, in PBMC1, nuclei in B cell cluster (cluster 5) had high similarity score for B cell marker peaks and 179 low scores for all other cell types (Figure 4b). In contrast, nuclei in cluster 13 had high similarity scores for NK, 180 CD4+ T, CD8+ T and myeloid cells, a signature of heterotypic multiplets (Fig. 4b). Once the multiplet type is 181 identified, their cellular origins are annotated using the highest scoring cell type(s). 182 We evaluated the efficacy of this annotation pipeline using artificial multiplets, where cells were randomly 183 selected and paired together to form both heterotypic and homotypic multiplets. Using these artificial multiplets, 184 we categorized multiplets as homotypic or heterotypic and annotated multiplets with respect to the number of 185 cell types associated with them. We identified the cellular origins of both types of multiplets with an average 186 accuracy of 82.47%, 85.87% in PBMC1, PBMC2 and 85.7%, 85.5% in islet1, islet2 (Fig. 4c). For example, in 187 PBMC1, 96% of all simulated B and myeloid multiplets were correctly annotated. Cell types that have similar 188 functions, hence similar epigenomes, observed lower annotation accuracies; such as 86% for simulated NK and 189 CD8+ T cells. Our annotations were equally effective for annotating both homotypic and heterotypic multiplets, 190 showing 83.65% accuracy on average to annotate homotypic multiplets and 85.59% accuracy to annotate 191 heterotypic multiplets. 192 193 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 8 Multiplet cell-type compositions reflect cellular compositions of the underlying tissue. Using ATAC-194 DoubletDetector’s annotation pipeline, we annotated all detected multiplets in PBMCs and islets. Inspection of 195 aggregate accessibility profiles at marker gene promoters (MS4A1, CD3G, CD4, CD8A, TREM1, NKG7, and 196 KLRF1) for each cell type in PBMC2 (Fig. 5a) revealed that annotated multiplets have accessibility at relevant 197 marker gene promoters. For instance, homotypic B cell multiplets had strong signal at the promoter of B cell 198 marker gene MS4A1, whereas heterotypic multiplets originating from CD8+ T cell and B cells had high 199 accessibility signals for both B cell marker gene MS4A1 and CD8+ T cell marker gene CD8A. 200 As expected, homotypic multiplets clustered together with the underlying cell type, whereas heterotypic 201 multiplets typically formed their own clusters (Fig. 5b-c, Extended Data Fig. 10a-b). The majority of heterotypic 202 multiplets for islet1 were found between major cell type clusters and near the delta cell cluster while homotypic 203 multiplets resided within the boundaries of singular cell type clusters (Fig. 5d). For PBMC1, the majority of 204 multiplets resided within multiplet cluster we previously identified and as a subcluster of CD8+ T cells (Fig. 5e). 205 As before, homotypic multiplets were found within corresponding cell type clusters. Overall, the majority of 206 detected multiplets were homotypic (76.7-84.3% in islets, 63-78.7% in PBMCs), with cell types being distributed 207 with respect to their cell proportions for both homotypic and heterotypic multiplet types (Fig. 5d-e, Extended Data 208 Fig. 10c-d). Indeed, in both tissues, the propensity of a cell type to form a multiplet was positively correlated 209 with the percent of that cell type within the tissue (Pearson’s R = 0.824, 0.897, P-Value < 0.087, 0.04 for PBMC1 210 and PBMC2, Pearson’s R = 0.931, 0.475 P-Value < 0.07, 0.525 for islet1 and islet2) (Fig. 5f-g, Extended Data 211 Fig. 10e-f), suggesting that snATAC-seq multiplets are more likely to occur randomly than through specific 212 interactions between nuclei. For example, the most abundant cell type in islet1 was beta cells (46.62% of the 213 cell population) which contributed to 51.96% of multiplets (Fig. 5f). Heterotypic multiplet annotations in islet 214 samples mostly originated from alpha, beta and delta cells. In PBMCs, the most frequent heterotypic multiplets 215 were the ones stemming from CD4+ T and CD8+ T cells (Fig. 5f, Extended Data Fig. 10e). 216 217 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 9 DISCUSSION 218 Detecting and discarding multiplets from snATAC-seq data is a critical step for improving data quality as 219 multiplets can form their own clusters and can confound downstream analyses. ATAC-DoubletDetector exploits 220 read count distributions for a given nucleus to effectively detect and eliminate multiplets without requiring prior 221 knowledge of cell-type information. It accomplishes this by first efficiently counting loci with >2 uniquely aligned 222 reads per nucleus and identifying nuclei with read count distributions deviating from expectations. Unlike other 223 methods that utilize artificial multiplet examples to identify putative multiplets (i.e., ArchR), ATAC-224 DoubletDetector is capable of detecting both homotypic (i.e., multiplets originating from the same cell type) and 225 heterotypic multiplets (i.e., multiplets originating from different cell types). Eliminating heterotypic multiplets is 226 essential for improved clustering and differential analyses between clusters and samples, whereas homotypic 227 multiplets introduce bias in allele-specific analyses. Hence, detecting and removing both types of multiplets will 228 improve downstream analyses. 229 The number of valid read pairs per cells is the most important factor affecting the performance of ATAC-230 DoubletDetector. When read depth per nucleus is sufficiently high (e.g., >20k read pairs per nucleus), ATAC-231 DoubletDetector is very effective in detecting both heterotypic and homotypic multiplets (average recall = 0.836 232 to detect artificial multiplets in PBMCs). Since ATAC-DoubletDetector does not depend on artificial multiplet 233 examples, it is not inherently biased towards cell types that resemble others. For example, in islets, delta cells 234 transcriptionally resemble alpha and beta cells, hence artificial multiplets generated by combining alpha and beta 235 cells have genomic profiles that resemble delta cells. These instances are particularly challenging for methods 236 that depend on artificial multiplet examples (e.g., ArchR for snATAC7, DoubletFinder8 and Scrublet9 for scRNA-237 seq). In alignment with this, ArchR categorized 47.35% of delta cells as multiplets in islet1. Given the success of 238 ATAC-DoubletDetector for identifying multiplets from snATAC-seq data with enough reads per nuclei, it can also 239 be effective in detecting and eliminating multiplets in recent multi-ome transcriptome and epigenome assays12. 240 Epigenomic signal at marker peaks is an effective way to annotate cellular origins of multiplets, where 241 we achieved 84.69% accuracy on average in simulations. Annotations of detected multiplets showed that 242 majority are homotypic. Furthermore, the propensity of nuclei to form multiplets was positively correlated with 243 the abundance of that cell type within the tissue. Since cells are lysed and nuclei are profiled in snATAC-seq 244 protocols3; these assays will likely not be prone to biological multiplets due to cell-cell interactions). Therefore, 245 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 10 snATAC-seq multiplets likely occur randomly among all cells; hence the most abundant cells are the most likely 246 to form multiplets. 247 Quantifying the efficacy of multiplet detection methods is a challenging task since true examples of singlet 248 and multiplets are not known. To overcome this challenge, we evaluated ATAC-DoubletDetector’s ability to 249 capture multiplets by simulating artificial multiplets, enabling us to measure recall. ATAC-DoubletDetector 250 identified 5-10.84% of cells as multiplets in islet and PBMC samples, which was in alignment with expectations. 251 Hence, we believe false positive calls are also restricted in our method. Although we quantified our method by 252 forming artificial multiplets, ATAC-DoubletDetector pipeline can be easily extended to capture and annotate 253 multiplets that include data from multiple nuclei. 254 Multiplets are inevitable in single cell sequencing and performing better data analyses calls for their 255 removal. ATAC-DoubletDetector introduces a novel and effective count-based solution for detecting multiplets 256 and provides a framework for annotating their cellular origins, improving future downstream analyses. ATAC-257 DoubletDetector code and documentation is freely available at https://github.com/UcarLab/ATAC-258 DoubletDetector, providing an easy to use interface for all backgrounds. Our multiplet detection algorithm is fast 259 and can be incorporated into data analyses pipelines, where processing of an average library (i.e., ~5,886 cells 260 at ~20,508 valid read pairs per cell) takes <30 minutes. 261 262 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 11 METHODS 263 snATAC-seq cell labeling, capture, library preparation, and sequencing. For single nucleus ATAC 264 sequencing (snATACseq) experiments, viable single cell suspensions from each sample were used to generate 265 snATACseq data using the 10X Chromium platform according to the manufacturer’s protocols (Demonstrated 266 Protocol Nuclei Isolation for ATAC Sequencing Document CG000169; Chromium Single Cell ATAC_User Guide 267 RevB Document CG000168). Briefly, >100,000 cells of interest were centrifuged, the supernatant was removed 268 without disrupting the cell pellet, Lysis Buffer was added for 5 minutes on ice to generate isolated and 269 permeabilized nuclei, followed by quenching by dilution with Wash Buffer. After centrifugation to pellet the 270 washed nuclei, Diluted Nuclei Buffer was used to re-suspend nuclei at the desired nuclei concentration as 271 determined using a Countess II FL Automated Cell Counter and combined with ATAC Buffer and ATAC Enzyme 272 to form a Transposition Mix. Transposed nuclei were immediately combined with Barcoding Reagent, Reducing 273 Agent B and Barcoding Enzyme and loaded onto a 10X Chromium Chip E for droplet generation, followed by 274 library construction. The barcoded sequencing libraries were subjected to bead clean-up and checked for quality 275 on an Agilent 4200 TapeStation, quantified by qPCR (KAPA Biosystems Library Quantification Kit for Illumina 276 platforms), and pooled for sequencing on an Illumina NovaSeq 6000 S2 flow cell (paired-end libraries 2x50bp). 277 278 Human islet isolation 279 Human islets were obtained through partnerships with the Integrated Islet Distribution Program (IIDP, 280 http://iidp.coh.org/). Assessment of human islet function was performed by islet GSIS static incubation assay on 281 the day after arrival, following the IIDP protocol. Primary human islets were cultured in Prodo media (PIM-S + 282 supplements PIM-G + PIM-ABS) in 5% CO2 at 37oC for ~24 hours prior to beginning studies. In preparation of 283 single cell suspension for 10x platform, human islets were dispersed with StemPro Accutase (Thermo Fisher 284 Scientific) 1ml/1000IEq for 10min at 37oC. Islet single cell suspension was washed three times in PBS-0.03% 285 BSA and cell number determined using Countess II FL Automated Cell Counter (Life Tech). Nuclei isolation for 286 single cell ATAC sequencing was performed following the 10x protocol 287 (https://assets.ctfassets.net/an68im79xiti/5g035d2ngCW1aB9DFqPphO/71445a59fb282ea273a866c26cb5d31288 9/CG000169_DemonstratedProtocol_NucleiIsolation_ATAC_Sequencing_RevD.pdf, based on the OMNI 289 nucleiprep by Corces et al.13). 290 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 12 291 Identifying snATAC-seq loci with >2 reads. Position sorted paired-end read alignments from snATAC-seq 292 data are compared to detect all loci with >2 unique reads per nucleus. To avoid instances where reads overlap 293 due to technical reasons, we removed all read pairs that are marked using the following parameters in the 294 HTSJDK14 library: 1) ReadPairedFlag = True, 2) ReadUnmappedFlag = False, 3) MateUnmappedFlag = False, 295 4) SecondaryOrSupplementary = False, 5) DuplicateReadFlag = False, and ReferenceIndex != 296 MateReferenceIndex (i.e., read pairs map to the same chromosome). To reduce overlaps due to alignment 297 errors, reads are excluded based on i) mapping quality scores less than or equal to 30, and ii) insert sizes (i.e., 298 the end to end distance between 5’ and 3’ read positions) greater than 900bp (~6 nucleosomes) in length. 299 To identify instances of >2 reads overlapping at any specific locus, all intervals are identified for which 300 an overlap was observed for at least two valid read pairs. Reads defining each interval are then compared to 301 one another to identify all subintervals that exceed the specified overlap threshold (i.e., 2). To efficiently identify 302 these subintervals, for each subset, interval breakpoints were defined at the start and end positions of each 303 paired end read. For each interval breakpoint, an integer value of 1 was assigned to all breakpoints originating 304 from start positions, and -1 to all breakpoints originating from an end position. Interval breakpoints are then 305 visited in start position sorted order to generate a cumulative sum based on the assigned values at each 306 breakpoint. The cumulative sum indicates the total number of overlaps between two interval breakpoints and 307 efficiently identifies all sub-intervals with a number of overlaps greater than the specified threshold. 308 Once all subintervals satisfying the threshold are identified for a subset of reads, the algorithm repeats 309 this process for the remaining paired end read subsets. Each step is performed using a linear time algorithm 310 (i.e., O(n), n is the number of total reads), with an additional O(log(m)) (m equals the number of nuclei) overhead 311 for each read to identify their respective nucleus origin, resulting in O(n*log(m)) runtime. The runtime can be 312 reduced to an expected O(n) runtime by instead using an appropriate hash function for cell identifiers/barcodes. 313 Note that this algorithm assumes that reads are sorted beforehand and is otherwise superseded by time it takes 314 to sort reads by their chromosome and start positions (i.e., O(n*log(n)). 315 316 Detecting significant multiplets from snATAC overlap counts. Loci with >2 reads were first filtered using 317 simple repeats, segmental duplications, repeat masker and blacklist regions obtained from UCSC Genome 318 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 13 Browser15 and ENCODE16,17. Next, filtered regions from all nuclei were merged if they overlapped by at least one 319 base pair. Using this unified list of loci, a binary matrix was generated where rows in the matrix represent loci 320 with >2 reads for at least one nucleus, and the columns represent the individual cells within the sample. Values 321 within the matrix were assigned to 1 if the cell and genomic region combination observed >2 reads overlapping, 322 and 0 otherwise. From this matrix, multiplets can be detected using column sums (i.e., the total number of >2 323 read overlap instances for each nucleus) while repetitive element sequences can be inferred using row sums 324 (i.e., the total number of cells observing >2 reads at the same locus). 325 The events of observing >2 reads overlapping within the same region for multiple cells or across multiple 326 regions within the same cell can be modeled using the Poisson distribution. Occurrences of these events are 327 independent, counted within set intervals (i.e., counting regions across the entire genome within cells or counting 328 cells within the same genomic regions), are either present or not within these intervals, and have a constant 329 average rate of occurring, satisfying the assumptions of the Poisson distribution. We therefore detected 330 significant multiplets and inferred repetitive sequences using the Poisson cumulative distribution function, using 331 respective mean row and column sum counts as the expected values to calculate Poisson probabilities. In this 332 process, we first use Poisson probabilities to infer repetitive sequences where a significant number of nuclei 333 observe >2 reads at the same genomic region. All inferred repetitive sequence loci are removed from further 334 analysis. Next, we calculate the Poisson probability of observing more loci with >2 reads than expected in a 335 nucleus(i.e., multiplets) using column sums. Poisson probabilities for both inferring repetitive sequence and 336 multiplet detection were corrected using the Benjamini Hochberg procedure to adjust for multiple hypothesis 337 testing. Repetitive sequence inferences and multiplets were predicted by selecting regions or cells with adjusted 338 Poisson probabilities less than 0.01. 339 340 Multiplet annotation pipeline. Detected multiplets are annotated using clusters identified for snATAC-seq 341 samples, merging them with respect to specific cell types present in the cell population. In our study, PBMC 342 clusters were merged to represent CD4+T, CD8+T, Natural Killer (NK), myeloid and B cells and islet clusters 343 were merged to represent alpha, beta, delta and ductal cells. Marker peaks for all cell type clusters with at least 344 150 cells were identified with the FindMarkers function in Seurat18, using the logistic regression setting. For the 345 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 14 sake of unison, the top 100 marker peaks are then identified for each cell type cluster based on Bonferroni 346 adjusted p-value of average log fold changes. 347 To account for data sparsity in snATAC-seq data, aggregate read profiles are calculated for each cell 348 and marker peak. Aggregate read profiles are found by taking average read counts for each cell’s 15 nearest 349 neighbors using the top 50 singular value decomposition (SVD) components. The cumulative distribution function 350 in R (i.e., ecdf) is then used to find the abundance of reads for each cluster’s marker peaks. Distribution scores 351 represent the percent of each cell type’s accessibility profiles present within the cell. In order to distinguish 352 multiplet types (i.e., heterotypic or homotypic) singlet profiles were calculated for each cell type in the sample. 353 For each cell type’s singlet cells, abundance scores at every marker peak were averaged to find the representive 354 abundance score profile for that cell type. Multiplets that have a profile close to their abundant cell type’s singlet 355 profile were classified as homotypic. Euclidean distance was used to measure the similarity between the profiles 356 of multiplets and singlets. Mixture models were then fitted to the distances with the Mclust R package19 to group 357 the closeness of the multiplets to their corresponding cell type’s singlet profile. Multiplets in the group with largest 358 distance to the singlet profile are considered heterotypic. Multiplets are then annotated using the top 1 (for 359 homotypic) or 2 (for heterotypic) abundance scores. 360 361 snATAC-seq nuclei clustering. To cluster nuclei from snATAC-seq data, we employed an in-house 362 implementation (https://github.com/UcarLab/snATACClusteringPipeline) of a two pass clustering method 363 previously described3 with notable differences. First, we restrict the number of 2.5kb bins in the first pass 364 clustering to the top 50k bins, up from 20k bins. For second pass clustering, we increase the number of peaks 365 to include all peaks identified in pass 1 up to 200k. 366 367 Integration of scRNA-seq and snATAC-seq data. Integrative clustering and analysis of single cell 368 transcriptomes and single nucleus epigenomes was performed using the R package Seurat18,20. First, gene 369 activity scores were derived from the resultant snATAC-seq peak count-matrix using the 370 CreateGeneActivityMatrix function with default parameters. Next, single nuclei with < 5,000 total read counts 371 were discarded from analyses. The resultant single nuclei and gene activity scores were log normalized and 372 scaled. Using the processed scRNA-seq data (also analyzed with Seurat), we identified anchors between the 373 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 15 snATAC-seq gene activity score matrix and scRNA-seq gene expression matrix following the methodology 374 described by Butler et al. (2018)18. After identifying anchors between the datasets, cell-type labels from the 375 scRNA-seq dataset were transferred to the snATAC-seq dataset and a prediction and confidence score was 376 assigned for each cell. 377 378 Simulating artificial multiplets to measure multiplet detection performances. To measure recall for 379 detecting multiplets, artificial multiplets were simulated by combining accessibility profiles of nuclei within each 380 sample population tested. For each sample, cells were randomly selected equal to 5% of the total cell population 381 and paired together to introduce artificial multiplets equivalent to 2.5% of the total population. Introducing 2.5% 382 artificial multiplets ensured that they were not the majority compared to real multiplets (5-11% of cells across all 383 samples) present in the data. Cell pairs were randomly reselected until they formed heterotypic, homotypic, or 384 1:1 ratio of heterotypic and homotypic multiplets based on cell type annotations. Simulations measuring the 385 number of valid read pairs per nucleus did not have restrictions based on cell type and were selected based on 386 read depth when stratifying by number of valid read pairs (i.e., Fig. 3c-d, Extended Data Fig. 6b) or completely 387 at random (i.e., Extended Data Fig. 7). Once cell pairs were identified, artificial multiplets were introduced by 388 generating modified barcode mappings (for ATAC-DoubletDetector) or barcodes in fragment files (for ArchR7), 389 which assigned artificial multiplet reads to the same cell identifier (i.e., the first nucleus in the pair). Artificial 390 multiplets were simulated 10 or 100 runs depending on the analysis. 391 392 CODE AVAILABILITY 393 ATAC-DoubletDetector is provided as a user-friendly computational framework with documentation and source 394 code freely available at: https://github.com/UcarLab/ATAC-DoubletDetector. 395 396 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 16 REFERENCES 397 1. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 398 523, 486–490 (2015). 399 2. Cusanovich, D. A. et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular 400 indexing. Science 348, 910–914 (2015). 401 3. Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell 402 development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019). 403 4. Rai, V. et al. Single-cell ATAC-Seq in human pancreatic islets and deep learning upscaling of rare cells 404 reveals cell-specific type 2 diabetes regulatory signatures. Mol. Metab. 32, 109–121 (2020). 405 5. Lareau, C. A., Ma, S., Duarte, F. M. & Buenrostro, J. D. Inference and effects of barcode multiplets in 406 droplet-based single-cell assays. Nat. Commun. 11, 866 (2020). 407 6. Fang, R. et al. SnapATAC: A Comprehensive Analysis Package for Single Cell ATAC-seq. 408 https://www.biorxiv.org/content/10.1101/615179v3 (2020). 409 7. Granja, J. M. et al. ArchR: An integrative and scalable software package for single-cell chromatin 410 accessibility analysis. http://biorxiv.org/lookup/doi/10.1101/2020.04.28.066498 (2020) 411 doi:10.1101/2020.04.28.066498. 412 8. McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. DoubletFinder: Doublet Detection in Single-Cell RNA 413 Sequencing Data Using Artificial Nearest Neighbors. Cell Syst. 8, 329-337.e4 (2019). 414 9. Wolock, S. L., Lopez, R. & Klein, A. M. Scrublet: Computational Identification of Cell Doublets in Single-415 Cell Transcriptomic Data. Cell Syst. 8, 281-291.e9 (2019). 416 10. Ucar, D. et al. The chromatin accessibility signature of human immune aging stems from CD8+ T cells. J. 417 Exp. Med. 214, 3123–3144 (2017). 418 11. Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific 419 expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017). 420 12. Ma, S. et al. Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and Chromatin. Cell 421 183, 1103-1116.e20 (2020). 422 13. Corces, M. R. et al. An improved ATAC-seq protocol reduces background and enables interrogation of 423 frozen tissues. Nat. Methods 14, 959–962 (2017). 424 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 17 14. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinforma. Oxf. Engl. 25, 2078–2079 425 (2009). 426 15. Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 47, D853–427 D858 (2019). 428 16. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 429 489, 57–74 (2012). 430 17. Davis, C. A. et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 431 46, D794–D801 (2018). 432 18. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data 433 across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018). 434 19. Scrucca, L., Fop, M., Murphy, T. B. & Raftery, A. E. mclust 5: Clustering, Classification and Density 435 Estimation Using Gaussian Finite Mixture Models. R J. 8, 289–317 (2016). 436 20. Stuart, T. et al. Comprehensive Integration of Single-Cell Data. Cell 177, 1888-1902.e21 (2019). 437 438 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 18 439 Fig. 1: Overview of detecting multiplets in snATAC-seq. a, Tn5 transposase cleaves accessible DNA at maternal and paternal chromosomes. Number of ATAC-seq read counts per loci per nucleus are expected to be 0, 1, or 2. b, Instances where more than 2 (>2) reads are observed for any locus in a cell are identified using an efficient algorithm for counting the number of overlapping reads. c, Poisson cumulative distribution function is used to detect multiplets based on deviations from expected number of loci with >2 reads. d, Overview of downstream analyses: 1) quantification of multiplet detection performances using artificial multiplets, 2) comparison of ATAC- DoubletDetector to alternative method ArchR, 3) annotating cellular origins of multiplets using a clustering-based method. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 19 440 Fig. 2: ATAC-DoubletDetector identifies heterotypic and homotypic multiplets in human PBMC snATAC-seq data. a, Summary of snATAC-seq samples generated and used in this study from human PBMC and islets. b, Valid read pair distributions for PBMC and islet snATAC-seq samples. c, PBMC clusters were annotated based on their correlations with sorted bulk ATAC-seq data (See. Extended Data Fig.2). d, All multiplets (heterotypic and homotypic) detected by ATAC- DoubletDetector in PBMC1. Selected multiplets refer to multiplets for which aggregated profiles are shown in panel f of this figure. e, The number of cells and percentage of multiplets detected by ATAC-DoubletDetector in PBMC and islet samples. f, Chromatin accessibility profiles of CD4+ T, myeloid, and selected multiplets around for T cell marker gene (CD3G) and myeloid cell marker gene (LYZ). CD4+ T and myeloid cells show strong accessibility signals for their relevant marker genes while selected multiplets have accessible chromatin for both marker genes. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 20 441 Fig. 3: ATAC-DoubletDetector detects multiplets with high recall when read depth is sufficient. a-b, Recall for detecting heterotypic (a) and homotypic (b) artificial multiplets. ATAC-DoubletDetector consistently detected both heterotypic and homotypic multiplets with similar recall, while ArchR was only effective for predicting heterotypic multiplets for data with high heterogeneity. c-d, Performance of detecting artificial multiplets at increasing valid read pair (insertions) distributions for PBMC1(c) and islet1(d). ATAC-DoubletDetector effectively detects multiplets at the >40k valid read pairs per nucleus. ArchR’s performance did not observe the same level of effect for read depth. e, Reference annotations for islet1. Islet1 annotations correspond to alpha, beta, delta and ductal cell types. f, Representative UMAP plots for multiplets detected by ATAC- DoubletDetector and ArchR for islet1 (other samples shown in Extended Fig. 8). We identified islet clusters for Alpha, Beta, Delta, and Ductal cells. Majority of multiplets detected were not shared between the two methods. Heterotypic multiplets were the most common. Note: ArchR detected the majority of Delta cells as multiplets. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 21 442 Fig. 4: Multiplet cell-type origins are predicted with high accuracy. a, Overview of the cell origin annotation pipeline. First, cells are clustered. Second, marker peaks are identified. Third, multiplets and their k-nearest neighbor cells are used to generate cluster similarity scores. b, Example of aggregate cluster profiles for predicting cell origin annotations. Clusters corresponding to cell types observe strong signal for their respective cell types (e.g., Cluster 5) while clusters corresponding to multiplets show a mixed profile of cell types (e.g., Cluster 13). c, Heatmaps of cell origin annotation accuracies for predicting artificial multiplets derived from cells of the specific cell type pairings. Multiplet annotations showed high accuracies for the majority of cell type compositions. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 22 443 Fig. 5: Majority of multiplets are homotypic and correspond to cell type proportions. a, Accessibility maps for cell origin annotations for multiplets identified in PBMC2. Homotypic multiplets observe strong signal for their respective marker genes. Heterotypic multiplets observe a combined signal at respective marker genes corresponding to the respective annotated cell types. b-c, UMAP clustering for heterotypic and homotypic multiplet annotations in PBMC1 (b) and islet1 (c). Heterotypic multiplets are found between major cell type clusters. Homotypic multiplets are observed on the periphery of major cell type clusters. d-e, Heterotypic and homotypic multiplet cell distributions (left bars). Homotypic cell type annotations (right bars) for PBMC (d) and islet (e) samples. Majority of multiplets are annotated as homotypic. Homotypic cell type distributions show similar distribution to the overall proportions of each cell type in their respective samples. f-g, Cell and multiplet proportions for PBMC2(f) and islet1(g). Multiplet cell type proportions are highly correlated with overall cell proportions. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 23 444 445 Extended Data Fig. 1: Multiplets observe many loci with >2 reads. The binary matrix of loci with >2 reads per cell reveals high 446 confidence multiplet (marked by arrows) that harbor many loci with >2 reads. These multiplets can be clearly seen compared to the other 447 cells in the subset. 448 449 450 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 24 451 452 Extended Data Fig. 2: Pseudo-bulk snATAC-seq profile correlations with sorted bulk ATAC-seq revealed 5 major cell types. a, 453 b, Spearman correlation heatmaps between pseudo-bulk (snATAC) and sorted bulk ATAC-seq accessibility profiles for PBMC1 (a) and 454 PBMC2 (b). Pseudo-bulk profiles cluster with four major cell types: Myeloid, B, CD4+ T, CD8+ T and Natural Killer (NK). c, d, Annotated 455 UMAP clusters for PBMC1 (c) and PBMC2 (d). Myeloid, B form distinct clusters for both samples. CD4+T, CD8+T and NK cell types share 456 more accessible loci and tend to cluster more closely to one another. 457 458 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 25 459 Extended Data Fig. 3: Annotated snATAC-seq clusters reflect accessibility at cell specific promoters. a, b, Annotated UMAPs for 460 PBMC1 (a) and PBMC2 (b) at the promoters of CD3G (T-Cell Marker), CD4 (CD4+ T cell marker), CD8A (CD8+ T cell marker), MS4A1 461 (B cell marker), NKG7 (NK cell marker), and TREM1 (Myeloid cell marker). Accessibility was binarized to 0 or 1 based on the presence 462 or absence of a read within these promoters. Using these markers, B and Myeloid cell types are clearly annotated with their respective 463 markers. CD4+ T and CD8+ T cells can be observed by combining CD3G with CD4 and CD8A markers respectively whereas NK cells are 464 can be seen using NKG7 and excluding nuclei with accessibility at CD3G promoter. 465 466 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 26 467 Extended Data Fig. 4: Islet snATAC-seq clusters correspond to scRNA-seq and cell marker annotations. a, b, UMAP clusters of 468 snATAC-seq data for islet1 (a) and islet2 (b) annotated as alpha, beta delta or ductal cells via integration with annotated scRNA-seq data. 469 Four distinct clusters are observed with these cell types. c, d. Cell specific clusters correspond to their respective marker peaks for both 470 islet 1(c) and islet2 (d). Accessibility was binarized to 0 or 1 based on the presence or absence of a read within these promoters. Alpha, 471 beta, delta and ductal cells are clearly identified with their respective marker genes: GCG (Alpha), INS (Beta), SST (Delta), and KRT19 472 (Ductal). 473 474 475 476 477 478 479 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 27 480 Extended Data Fig. 5: Multiplets are distributed throughout snATAC-seq clusters. Multiplet annotated UMAP clustering of PBMC1, 481 PBMC2, islet1 and islet2 reveal that multiplets are distributed throughout all identified clusters and in some cases form their own multiplet 482 clusters (i.e., center cluster in PBMC1). Multiplets between major cell type clusters are likely to be heterotypic whereas multiplets at the 483 periphery of annotated clusters are likely to be homotypic. 484 485 486 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 28 487 Extended Data Fig. 6: ATAC-DoubletDetector detects both homotypic and heterotypic multiplets at high read depth. a, Recall for 488 detected both homotypic and heterotypic artificial multiplets at a 1:1 ratio. ATAC-DoubletDetector did not observe noticeable differences 489 in performances due to its robustness for detecting both multiplet types. ArchR showed reduced performance compared to heterotypic 490 multiplet only detection due to the inclusion of homotypic multiplets. b, Recall for multiplets stratified by read count distributions (top for 491 each sample) and valid read pair distributions for each multiplet subset (bottom for each sample). ATAC-DoubletDetector performances 492 increased when the number of valid read pairs exceeded ~40k valid read pairs per nuclei, suggesting multiplets can be reliably detected 493 when nuclei have >20k valid read pairs each. ArchR did not show significant differences in performance due to read depth. 494 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 29 495 Extended Data Fig. 7: Artificial multiplets are detected when combined valid read pairs exceed 40k. For each sample, multiplets 496 were detected (Top left for each sample) or not detect (Top right for each sample), depending on whether one or both nuclei exceeded 497 20k valid read pairs. Histogram of combined profiles revealed that the majority of detected multiplets (bottom left for each sample) had at 498 least 20k valid read pairs while multiplets not detected were those with less than 40kb valid read pairs (bottom right for each sample). 499 When nuclei are sequenced for 20k valid reads per nuclei, multiplets will harbor 40k valid read pairs and can be detected by ATAC-500 DoubletDetector. 501 502 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 30 503 504 Extended Data Fig. 8: ATAC-DoubletDetector and ArchR identify different multiplet subsets. UMAP clusters annotating ATAC-505 DoubletDetector multiplets (green), ArchR multiplets (orange), or their intersection (black). Majority of multiplets detected by both ATAC-506 DoubletDetector and ArchR were between major cell type clusters (i.e., heterotypic multiplets). 507 508 509 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 31 510 Extended Data Fig. 9: ATAC-DoubletDetector and ArchR multiplets comparisons reveal nature of their underlying algorithms. 511 a, Venn diagrams and total number of multiplets detected by ATAC-DoubletDetector and ArchR. Only a small subset of multiplets is 512 detected by both methods. b, Total number of nuclei and multiplets detected by each method. Differences in number of nuclei are due to 513 differences in inputs (i.e., alignment (BAM) files for ATAC-DoubletDetector and fragment files (Cell Ranger output) for ArchR). Overall, 514 ArchR detects more multiplets using default parameters than ATAC-DoubletDetector. c, Valid read pair distributions between multiplets 515 and singlets detected by ATAC-DoubletDetector and ArchR. Differences in number of valid read pairs between multiplet and singlets 516 were more significant for ATAC-DoubletDetector than ArchR while the number valid read pairs for ATAC-DoubletDetector were 517 significantly greater than ArchR multiplet. 518 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 32 519 520 Extended Data Fig. 10: Multiplet annotations correspond to cell proportions. a-b, UMAP clustering for heterotypic and homotypic 521 multiplet annotations in PBMC2 (a) and islet2 (b). Heterotypic multiplets are found between major cell type clusters. Homotypic multiplets 522 are observed on the periphery of major cell type clusters. c-d, Heterotypic cell type annotations for PBMC (d) and islet (e) samples. 523 Majority of multiplets are annotated as homotypic. f-g, Cell and multiplet proportions for PBMC1(f) and islet2(g). Multiplet cell type 524 proportions are highly correlated with overall cell proportions. Islet2 observed more beta cell multiplets than other cell types/samples, 525 reducing correlation and significance for islet2. 526 527 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 10_1101-2020_08_28_271981 ---- Full-length de novo protein structure determination from cryo-EM maps using deep learning Full-length de novo protein structure determination from cryo-EM maps using deep learning Jiahua He and Sheng-You Huang∗ School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, P. R. China Abstract Advances in microscopy instruments and image processing algorithms have led to an increas- ing number of cryo-EM maps. However, building accurate models for the EM maps at 3-5 Å resolution remains a challenging and time-consuming process. With the rapid growth of de- posited EM maps, there is an increasing gap between the maps and reconstructed/modeled 3- dimensional (3D) structures. Therefore, automatic reconstruction of atomic-accuracy full-atom structures from EM maps is pressingly needed. Here, we present a semi-automatic de novo struc- ture determination method using a deep learning-based framework, named as DeepMM, which builds atomic-accuracy all-atom models from cryo-EM maps at near-atomic resolution. In our method, the main-chain and Cα positions as well as their amino acid and secondary structure types are predicted in the EM map using Densely Connected Convolutional Networks. DeepMM was extensively validated on 40 simulated maps at 5 Å resolution and 30 experimental maps at 2.6-4.8 Å resolution as well as an EMDB-wide data set of 2931 experimental maps at 2.6-4.9 Å resolution, and compared with state-of-the-art algorithms including RosettaES, MAINMAST, and Phenix. Overall, our DeepMM algorithm obtained a significant improvement over existing methods in terms of both accuracy and coverage in building full-length protein structures on all test sets, demonstrating the efficacy and general applicability of DeepMM. Availability: https://github.com/JiahuaHe/DeepMM Supplementary information: Supplementary data are available. ∗Email: huangsy@hust.edu.cn; Phone: +86-27-87543881; Fax: +86-027-87556576 1 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ 1 Introduction Cryo-electron microscopy (cryo-EM) has now become a widely used technique for structure deter- mination of macromolecular structures in the recent decade1–4. Advances in microscopy instruments and image processing algorithms have led to the rapid increase in the number of solved EM maps1–3. The ‘resolution revolution’ in cryo-EM has paved a way for the determination of high-resolution structures of previously intractable biological systems5–16. According to the statistics of the Electron Microscopy Data Bank (EMDB)17, there were 2435 maps deposited in 2019, which are almost 4 times the 640 maps released in 2015. With the rapid growth of deposited EM maps, there is an increasing gap between the maps and reconstructed/modeled 3-dimensional (3D) structures. As of April 1, 2020, there were 10560 EMDB maps, but only 4805 associated structures were deposited in the Protein Data Bank (PDB)18. For those maps determined at near-atomic resolution (3.0∼5.0 Å), it is difficult to build high-resolution models with conventional software designed for X-ray crystallography. In view of the fact that near-atomic resolution maps take up the majority of current and henceforth released maps17, tools, which can re- construct structures de novo from EM maps without using known structures as templates19, are press- ingly needed. As such, some algorithms like EM-fold20, Gorgon21, Rosetta22, 23, Pathwalking24–26, Phenix27–29, and MAINMAST30, 31, have been recently presented for constructing and/or assembling structure fragments from Cryo-EM maps. Despite the present progress in de novo structure building for cryo-EM maps, there are various limitations in current approaches. They can either only build structural fragments20, 21, 28 or have low accuracy in terms coverage and/or sequence reproduction23, 24, 30. It remains challenging to automat- ically build an accurate all-atom structure from the EM maps at near-atomic resolution. Recently, machine learning has been actively applied in structure determination for EM maps, such as single particle picking32, tomogram annotation33, secondary structure prediction34, and backbone tracing35. However, applying deep learning to build full-length protein structures for near-atomic resolution EM maps remains a challenging work. Here, we have developed a semi-automatic de novo atomic-accuracy structure reconstruction method for EM maps at near-atomic resolution through Densely Connected Convolutional Networks 2 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ (DenseNets) using a deep learning-based framework, named DeepMM. Instead of tracing the protein main-chain on the raw EM density map, DeepMM first predicted the probability of main-chain atoms (N, C, and Cα) and Cα positions near each grid point using one DenseNet36. Then, the method traced the main-chain according to the predicted main-chain probability map. The amino acid and secondary structure types were predicted by a second DenseNet. Finally, the protein sequence was aligned to the main-chain according to the predicted Cα probabilities, amino acid types, and secondary structure types for all-atom structure building. 2 Methods 2.1 Workflow of DeepMM The workflow of DeepMM is illustrated in Figure 1a. Specifically, staring from a cryo-EM map and the target protein sequence, DeepMM first standardizes the order of axis, and interpolates grid interval to 1.0 Å. Then, DeepMM cuts the entire map into small voxels of size 11Å×11Å×11Å. Afterwards, one DenseNet (say DenseNet A) is used to predict the main-chain and Cα probability on each of the voxels. All the predicted probability values form a 3D probability map. Next, possible main- chain paths are generated in the predicted main-chain probability map using a main-chain tracing algorithm30. The Cα probability values of main-chain points are interpolated from the predicted 3D Cα probability map. Afterwards, the amino acid and secondary structure types are predicted for each main-chain point through the second DenseNet (say DenseNet B). With the predicted Cα probability, amino acid type, and secondary structure type for each main-chain point, the target protein sequence is then aligned to the main-chain paths based on the Smith-Waterman dynamic programming (DP) algorithm37. The resulted multiple Cα models are ranked by their alignment scores. Finally, the all-atom structures are constructed from the top Cα models using the ctrip program in the Jackal modeling package38, 39 and refined by an energy minimization using Amber40. 2.2 Training the DenseNets of DeepMM Two Densely Connected Convolutional Networks (DenseNets) are embedded into our DeepMM algo- rithm. Figure 1b illustrates the architecture of the networks. DenseNet is a feed-forward multi-layer 3 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ network which uses additional paths between earlier and later layers in a dense block. DenseNets have several compelling advantages. They alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters36. DeepMM also employs a hard parameter-sharing multi-task learning method, which can greatly reduces the risk of overfitting41. The first network (i.e. DenseNet A) is used to simultaneously predict the main-chain probability and Cα probability of a grid point. The second network (i.e. DenseNet B) is used to pre- dict the amino acid type and secondary structure type of a main-chain local dense point (LDP). The input for the DenseNet A are voxels of size 11Å × 11Å × 11Å. The second network (DenseNet B) takes the voxels of size 10Å × 10Å × 10Å as input because main-chain points are not always on the integer grid after mean shift. For each voxel, the density values are normalized to the range of [0, 1] according to the maximum and minimum density values in the voxel. 3D convolutions and 3D pool- ing layers are used instead of their 2D counterparts used in traditional image processing because the density maps have three dimensions. Several dense blocks are used in both networks, each of which consists of eight densely connected layers. For DenseNet A, the first two dense blocks are shared by both tasks, whereas for DenseNet B, only one shared block is adopted. After the shared blocks, each task employs two task-specific blocks and gives the final prediction. The details of network architecture are provided in Supplementary Table 1. All the training parameters and procedure used for simulated EM maps are essentially the same to the parameters and procedure used for experimental EM maps unless otherwise specified. For DenseNet A, all the grid points above a density value D0 were used for training, where D0 was set to 1.0 for simulated maps at 5.0 Å resolution. For experimental maps, D0 was set to 1/2 of its recommended contour level. The labels (main-chain probability and Cα probability) of a grid point ~a were calculated as follows: P ~X ~a = min{e − ‖~a− ~X‖2 r 2 0 , ∀ ~X ∈ ‖~a − ~X‖ < rcut} (1) where X stands for the N, C, or Cα atoms. The r0 is the radius at which the probability drop to 1/e. If no atom is within rcut of a grid point, the corresponding probability is set to 0. A total of 512 voxels were trained in one batch and 30 epochs were trained for the whole data set. The Adam optimizer with an initial learning rate of 0.001 was used to minimize the mean absolute error (MAE). Learning 4 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ rate decay was adopted, where the learning rate was reduced to 1/10 of the current value after every 10 epochs. To avoid over-fitting, the weight decay parameter of Adam optimizer was set to 1e-6 as the L2 regularization. For DenseNet B, one point was randomly sampled within 1.0 Å for every main-chain atom in the training set. The corresponding amino acid type and second structure type marked by STRIDE42 were assigned to each point. Twenty types of amino acids were grouped into four classes according to their sizes, shapes and distributions in their EM density maps43, as illustrated in Figure 2d. Specifically, GLY, ALA, SER, CYS, VAL, THR, ILE and PRO are grouped as Class I. LEU, ASP, ASN, GLU, GLN and MET are grouped as Class II. LYS and ARG are grouped as Class III. HIS, PHE, TYR and TRP are grouped as Class IV. Residues that have structure codes of H, G, or I by STRIDE were labelled as “Helix”, those with codes of B/b or E were labelled as “Sheet”, and the other residues were labelled as “Coil”. All the training parameters were identical to those for DenseNet A except for using CrossEntropyLoss as loss function. 2.3 Tracing the main-chain path The main-chain tracing algorithm in MAINMAST30 was used to trace the main-chain path in our predicted main-chain probability map. In brief, local dense points (LDPs) are first identified using the mean shift algorithm, which iteratively shifts the initial grid points towards the local highest probability by computing the weighted average of probability values. Then, the shifted points that are within a threshold distance of 0.5 Å are clustered, and the point with the highest probability in the cluster is chosen as the representative, called LDP. The next step is to connect LDPs into a minimum spanning tree (MST) and iteratively refine the tree structure with a Tabu search method. After multiple steps of Tabu search, the longest path of the refined tree is traced as the main-chain path. The details of the algorithm can be found in the MAINMAST study30. 2.4 Aligning target sequence to main-chain path The Smith-Waterman dynamic programming (DP) algorithm37 is used to align the target sequence to the predicted main-chain path. The predicted Cα probability value, amino acid type, and secondary structure type are assigned to each point of the main-chain. Instead of using 20 amino acid types, 5 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ amino acids are grouped into four classes according to their sizes, shapes, and distributions in EM density maps (Figure 2d). Secondary structures are categorized into three types of Helix, Sheet, and Coil. The match between the target sequence and main-chain path is evaluated by two scoring matrices for amino acid and secondary structure, respectively (Figure 2b). Namely, a target residue is more likely to be aligned to a main-chain point with the same amino acid type, the same secondary structure type, and a higher Cα probability, and vice versa. The detailed alignment protocol is shown in Figures 2a, b and c. The n residues {Ai(i = 1, ...n)} in the protein are aligned to m LDPs {Lj(j = 1, ...m)} in the main-chain path. The matching score M(i, j) for a pair of Ai and Lj is computed as follows. M(i, j) = wAAMAA(TAA(Ai), TAA(Lj)) + wSSMSS(TSS(Ai), TSS(Lj)) (2) where MAA and MAA are the scoring matrices for amino acid and secondary structure matching 43, 44, respectively. For a residue Ai, the amino acid type is one of the four amino acid classes (TAA(Ai) = 1, 2, 3, 4). The predicted amino acid type for an LDP Lj is also one of the four amino acid classes (TAA(Li) = 1, 2, 3, 4). Similarly, the secondary structure matching score is calculated using the sec- ondary structure type predicted from the sequence (TSS(Ai) = 1, 2, 3) by SPIDER2 45 and secondary structure type predicted on LDPs (TSS(Li) = 1, 2, 3). The scoring matrices MAA and MSS used in the alignment are shown in Figure 2b. The wAA and wSS are the weights for corresponding matching scores and set to 1.0 and 0.5, respectively. With the calculated matching score M(i, j), an alignment is calculated with the follow rule to form a DP matrix, F , as follows. F(i, j) = max            F(i − 1, j) + gap F(i − 1, j − 1) − wCα−Cα|dstd − d| + wCαPCα(j) + M(i, j) F(i, j − 1) (3) where gap is the gap penalty for unassigned residues in the protein sequence. To ensure a full-length structure reconstruction, gap is set to −10000.0 so as to forbid skipped residues. The |dstd − d| is the penalty score for Cα-Cα distance, where dstd is the standard Cα-Cα distance and d is the distance between LDP Lj and the last aligned LDP. The PCα(j) is the predicted Cα probability for LDP Lj. The wCα−Cα and wCα are the weights for the corresponding scores. Here, wCα is set to 1.6, and wCα−Cα is set to 1.0, 0.7, and 0.8 for “Helix”, “Sheet”, and “Coil”, respectively. For each combination 6 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ of parameters in the main-chain tracing procedure, 160 Cα models are generated. Finally, all the generated Cα models are ranked by their alignment scores. 2.5 Parameter settings of DeepMM The parameters of mean-shift, MST construction, and Tabu search are set to be the same to those in MAINMAST30, unless otherwise specified. DeepMM employs several parameter combinations to generate multiple Cα models for one EM map. For each combination of parameters, 10 trajectories of Tabu search are carried out, yielding 10 main-chain paths. Since DeepMM starts from the main-chain probability map, fewer parameter combinations are needed to reconstruct reliable 3D structures. For both simulated and experimental maps, the thresholds of probability (Φthr) and normalized probability (θthr) are both set to 0. For the 40 simulated maps, only one parameter combination is adopted. Specifically, the maximum number of Tabu search steps (Nround) is set to 100, the sphere radius of local MST (rlocal) is set to 5.0 Å, and the constraint for the length (dkeep) is set to 0.5 Å. For the 30 experimental maps, we employ the following 27 combinations of parameters: the sphere radius of local MST (rlocal=5.0, 7.5, 10.0 Å), the edge weight threshold (dkeep=0.5, 1.0, 1.5 Å), and the maximum number of the Tabu search steps (Nround=2500, 5000, 7500). For the extended EMDB- wide test set of 2931 maps, we employ fewer combinations of parameters so as to save computational cost: the edge weight threshold (dkeep=0.5, 1.0 Å) and the maximum number of the Tabu search steps (Nround=2500, 5000). The sphere radius of local MST (rlocal) is set to 10 Å. For each of the generated main-chain path, 16 Cα models are generated using 8 different standard Cα-Cα distances (dstd=3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8 Å) on two sequence directions. Namely, 160 models (16 models for each of the 10 trajectories) are constructed for each parameter combination. The Cα models are ranked by their alignment scores and then an RMSD cutoff of 5 Å is used to remove the one with lower alignment score in two similar structures. Finally, the top 10 scored protein Cα models are selected to build the all-atom structures. 7 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ 2.6 Datasets used 2.6.1 Training sets Two data sets, simulated EM map set and experimental EM map set, were used to train our DeepMM method for simulated maps and experimental maps, respectively. For simulated EM maps, 2000 representative structures for different superfamilies in the SCOPe database46 were taken from Emap2sec34 as training set. Those structures were removed from the training set if they have a TM-score47 of over 0.5 with any structure in the test set. To save the com- putational cost, only 100 randomly selected structures from the training set were retained. Next, we used the e2pdb2mrc.py program from the EMAN2 package (version 2.11)48 to generate the simulated EM maps at 5.0 Å resolution and 1.0 Å grid interval for each structure in training and test set. The training SCOPe entries used in this study were listed in Supplementary Table 5. For experimental EM maps, all the EM density maps at 2-5 Å resolution that have associated PDB models were downloaded from the EMDB. As of December 26, 2019, 2546 EM maps were collected. Any PDB structure and its corresponding EM map that met the following criteria were removed: (i) including nucleic acids, (ii) missing side-chain atoms, (iii) including “HETATM” residues, (iv) including “UNK” residues, (v) including more than 1 subunit (MODEL), and (vi) including less than 50 or more than 300 residues. Then, 1588 chains from the remaining 361 experimental EM maps were clustered with 50% sequence identity using CD-HIT49, yielding a total of 1340 chains. To ensure a valid evaluation, chains were removed from training set if they have over 30% sequence identity with any chain in the test set. Each protein chain was zoned out from the whole map using a distance of 4.0 Å30. For good quality maps, protein chain and its associated map should have sufficient structural agreement. The cross-correlation between the experimental map and the simulated map density at the same resolution with the experimental map generated from the structure was calculated using the UCSF Chimera50. Only the chains with a cross-correlation of over 0.65 were kept34. The final training set consists of 100 non-redundant protein chains. The grid intervals for experimental maps were unified to 1.0 Å using trilinear interpolation. The training EM maps and their corresponding PDB chains used in this study are listed in Supplementary Table 6. 8 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ 2.6.2 Test sets Three test sets were used to evaluate our DeepMM approach for its accuracy and general applicability, including one simulated map set and two experimental maps. The simulated map set was taken from the test set of 40 simulated maps used by MAINMAST30. The maps were generated at 5.0 Å resolution with a grid spacing of 1.0 Å using the e2pdb2mrc.py program in the EMAN2 package48. The first experimental test set is the benchmark of 30 EM maps at 2.6-4.8 Å resolution, which have been used to evaluate MAINMAST30. The corresponding EM maps were downloaded from the EMDB, For each EM map, a single subunit was zoned out from the whole density map at a distance cutoff of 4.0 Å. In addition, to evaluate the accuracy and general applicability of DeepMM, we have also con- structed a large test set of EMBD-wide experimental maps. The generation procedure of this set was similar to that for the experimental training set. Specifically, for each chain of the EM PDB structure at 2.5-5.0 Å resolution and no more than one subunit (MODEL) from the EMDB, a single density patch was zoned out from the whole density map at a distance cutoff of 4.0 Å. Any protein chain and its corresponding EM map patch that met the following situation were removed: (i) including nu- cleic acids, (ii) missing side-chain atoms, (iii) including “HETATM” residues, (iv) including “UNK” residues, (v) including less than 50 or equal or more than 300 residues, (vi) having over 30% sequence identity to any chain in the training set. The cross-correlation between the experimental map and the simulated density map at the same resolution generated from the structure should be over 0.6534. Each protein chain was zoned out from the whole map using a distance of 4.0 Å30. The finial test set consists of 2931 protein chains, which are listed in Supplementary Table 4. 3 Results 3.1 Model reconstruction for simulated EM maps We first evaluated the performance of our DeepMM algorithm on the test set of 40 simulated density maps at 5 Å resolution. DeepMM traced the main-chain of protein on the predicted main-chain probability map rather than the raw EM density map. Thus, the generated Cα models by our DeepMM 9 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ are closer to the native structures with fewer search trajectories and steps compared to MAINMAST. For each of the 40 maps, DeepMM built 160 Cα models, which were ranked by their alignment scores. The top-ranked model was selected as the predicted structure. Figure 3 shows a comparison of the predicted Cα models for the protein chains of different lengths by DeepMM and MAINMAST. The detailed results are provided in Supplementary Table 2. It can be seen from the figure that our DeepMM method obtained a much better performance than MAIN- MAST. As shown in Figure 3a, DeepMM built significantly more accurate Cα models, and achieved an average Cα RMSD of 0.54 Å when the top scored model was considered, compared to 1.79 Å for MAINMAST. DeepMM also generated high-quality models with less than 1.0 Å Cα RMSD for all of the 40 maps, compared with only one such model by MAINMAST. Moreover, DeepMM achieved the high-accuracy models with less than 0.5 Å RMSD for 22 of 40 maps, whereas MAINMAST failed to generate any model with < 0.5 Å RMSD (Figure 3a). The program CLICK51 was also used to evaluate the accuracy of the Cα models built by DeepMM and MAINMAST. The corresponding re- sults are shown in Figure 3b. Similar to the results of Cα RMSD comparison, DeepMM generated many more high-quality models according to the CLICK RMSD criterion and achieved an average CLICK RMSD of 0.53 Å when the top model was considered, compared to 2.18 Å for MAINMAST. In addition, DeepMM also achieved a significantly higher structure overlap than MAINMAST (Fig- ure 3c). Except for two top scored models with 99.75% and 99.44% structure overlap, the rest 38 top models generated by DeepMM all have a 100% structure overlap. On average, DeepMM ob- tained a high structure overlap of 99.98%, compare to 81.88% for MAINMAST. Figure 3 also reveals that DeepMM generated consistently high-accuracy models for all the proteins of different lengthes, whereas MAINMAST tended to perform worse with the increasing number of residues in the protein, suggesting the higher robustness of DeepMM than MAINMAST. 3.2 Model reconstruction for experimental EM maps Our DeepMM method was further tested on the benchmark of 30 experimental density maps at 2.6- 4.8 Å resolution. For each of the 30 experimental density maps, DeepMM built 4320 protein Cα models, which were then ranked by their alignment scores. Figure 4a shows a comparison of the Cα RMSDs for the models built by DeepMM and MAIN- 10 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ MAST. The corresponding data are provided in Supplementary Table 3. It can be seen from the figure that DeepMM generated significantly more accurate models than MAINMAST. On average, DeepMM obtained a Cα RMSD of 10.7 Å for the top scored models, which is much better than 22.4 Å by MAINMAST. Moreover, DeepMM predicted a model of < 10 Å for 18 out of 30 top scored models, of which 14 models are within 5.0 Å Cα RMSD. By contrast, only 7 and 4 models are within 10.0 Å and 5.0 Å for MAINMAST, respectively. Figure 4b shows a comparison of the results for the models predicted by DeepMM and RosettaES. It can be seen from the figure that DeepMM performed much better and generated many more accurate models than RosettaES. Compared to 18 models within 10 Å RMSD by DeepMM, only six models were predicted within 10.0 Å RMSD by RosettaES for the top predictions. On average, Rosetta obtained an average Cα RMSD of 27.0 Å, which is much higher than 10.7 Å for DeepMM. Further examination of the predicted results also reveals that the model accuracy depends more on the quality than on the resolution of a map. Namely, compared to maps with relatively higher resolution but lower quality like EMD-3246A/B (2.8 Å) and EMD-5495 (3.5 Å), maps with relatively lower resolution but higher quality like EMD-2867 (4.3 Å) and EMD-3073 (4.1 Å) are more likely to be successful in reconstructing a correct model (Supplementary Table 3). This phenomenon can be attributed to the fact that resolution is a global estimation and resolvability is not necessarily uniform throughout the whole map52. Figure 5 gives two examples of successfully reconstructed structures by DeepMM. One exam- ple, EMD-2867, which is a nucleoprotein at 4.3 Å resolution, was successfully reconstructed by DeepMM, as shown in Figure 5a. It can be seen from the figure that the predicted main-chain by DeepMM overlaps well with that of the deposited structure. Accordingly, the predicted model shows an atomic-accuracy with a Cα RMSD of 3.1 Å. Figure 5b shows the results of another example, EMD-6272, which is the bovine rotavirus VP6 at 2.6 Å resolution. Because of its high resolution, DeepMM predicted a very high accurate model with a small Cα RMSD of 1.7 Å. Correspondingly, the constructed full-atom model by DeepMM shows an excellent overlap with the deposited structure (Figure 5b). 11 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ 3.3 Evaluation of DeepMM on the EMDB-wide data set To investigate the accuracy and general applicability of our DeepMM method, we have further eval- uated the performance of DeepMM on a large test set of EMDB-wide experimental maps. This large test set consists of 2931 diverse EM maps with 2.6-4.9 Å resolutions from the EMDB that have asso- ciated structures in the PDB (See the Methods section). For each of the 2931 test cases, our DeepMM method was conducted to reconstruct structures using four combinations of parameters, yielding 640 models for each case. Figure 6 shows a summary of the results predicted by DeepMM. The corre- sponding data are provided in Supplementary Table 4. Two metrics, RMSD and TMscore, were used to evaluate the overall accuracy of predicted models. On average, DeepMM achieved a Cα RMSD of 9.8 Å for the top prediction and 8.4 Å for the top 10 predictions on this test set of 2931 maps. The corresponding average TM-scores are 0.648 and 0.694 for top 1 and top 10 predictions, suggesting the high accuracy of our DeepMM approach. Figure 6a shows the percentage of the predicted models at different Cα RMSD cutoffs. It can be seen from the figure that 53.6% of the top models built by DeepMM are within 10 Å Cα RMSD. For the top 10 scored predictions, 59.9% of the cases have an RMSD of less than 10 Å. The percentage of the models with different TM-score cutoffs are showed in Figure 6b. It can be seen from the figure that 65.6% of the top models built by DeepMM have a TM-score of > 0.5. When the top 10 models were considered, the corresponding percentage increased to 73.6%. Comparing the results in Figures 6a and b also reveals that the percentages for TM-score are significantly higher than those for Cα- RMSD, suggesting that the models built by DeepMM still share the same fold with native structure even if they have a large Cα RMSD. Figure 6c shows the percentage of correctly predicted top models (i.e. within 10 Å Cα RMSD) at different resolutions. For EM maps at 2.5-3.0 Å resolution, DeepMM achieved an excellent per- formance in successfully reconstructing a correct model, and achieved a success rate of 95.3% and 96.2% for the top 1 and 10 scored models, respectively. The performance of DeepMM decreased with the decreasing map resolution. Specifically, for the EM maps with a resolution of 3.0-3.5 Å, 3.5-4.0 Å, and 4.0-4.5 Å, DeepMM obtained a success rate of 82.5%/87.3%, 53.2%/62.1%, and 22.9%/29.8% for the top 1/10 predictions, respectively. For EM maps with a resolution of 4.5 Å or worse, it is chal- 12 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ lenging for DeepMM to build correct models. On average, for the maps at 3-5 Å resolution, DeepMM gave a success rate of 50.3% and 57.0% in reconstructing a correct model within 10 Å Cα-RMSD for the top 1 and 10 predictions, respectively. Figure 6d shows the percentage of correctly predicted top models using the criterion of TM-score > 0.5 in different resolution ranges. Similar trends in Figure 6c can be observed in Figure 6d. Specifically, for the maps with a resolution of 2.5-3.0 Å, 3.0-3.5 Å, 3.5-4.0 Å, 4.0-4.5 Å, and 4.5-5.0 Å, DeepMM achieved correct models with a TMscore of > 0.5 for 99.1%/99.5%, 90.7%/94.2%, 69.5%/79.6%, 36.9%/49.1%, and 15.0%/23.5% of the test cases when the top 1/10 predictions were considered, respectively. On average, for the maps at 3-5 Å resolution, DeepMM obtained a success rate of 63.0% and 71.6% in building a model with TMscore > 0.5 for the top 1 and 10 predictions, respectively. Next, DeepMM was compared with Phenix on this test set, where the Phenix models were gener- ated using the phenix.map to model tool28 in the Phenix package (version 1.18.2-3874). Two metrics calculated by phenix.chain comparison were used to evaluate the accuracy of a model. One is the fraction of the CA atoms in one model matching the CA atoms in another model within 3.0 Å re- gardless of their residue names (i.e. coverage or residue match). The other is the percentage of the sequence in the target structure reproduced by the query model (i.e. specificity of sequence match). It should be mentioned that our sequence match is conducted using 20 types of amino acids. A model with a high percentage of residue match may have a very low percentage of sequence match because of mismatching of residue names. Figures 7a and b show the percentages of protein residues and the sequence reproduced by DeepMM and Phenix at different resolutions. Figures 7c and d give the histograms of corresponding average values at different resolutions. It can be seen from the figure that DeepMM achieved a significantly better performance than Phenix in both residue match and sequence match, especially for those maps at low resolutions. For the maps at resolutions better than 3.0 Å, 94.2% of protein residues in the deposited structures were reproduced by our DeepMM method, com- pared to 84.7% by Phenix. The corresponding average sequence match is 78.0% for our DeepMM approach, which is much higher than 59.7% for Phenix. For the maps at 3-5 Å resolution, the average residue match for DeepMM is 80.7%, compared with 65.0% for Phenix. The corresponding average sequence match is 38.1% for DeepMM, which is much higher than 19.2% for Phenix. Given that the prediction of sequence match is much more challenging than that of residue match, the much better 13 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ performance of DeepMM than Phenix in sequence match demonstrated the atomic-accuracy of the model built by DeepMM. It is worth mentioning that DeepMM can build fully-connected, full-length all-atom protein mod- els, whereas Phenix is designed to build initial models of structure fragments. Figure 8 shows the protein models built by DeepMM and Phenix for one example, Chain A of 6DW1, part of a GABAA receptor at 3.1 Å resolution. The deposited structure with its associated EM density map (EMD-8923) is displayed in panel a. Figures 8b and c show the Phenix model and its superimposition with the de- posited structure, respectively. It can be seen from the figures that the model built by Phenix consists of multiple fragments without showing any secondary structures, as expected. The predicted model by Phenix for this map had a residue match of 86.7%, but gave a very low sequence match of 9.9%. Therefore, although Phenix recovered most parts of the target protein structure from the EM density map, it assigned wrong residue names for most of the modeled fragments because its low sequence match, as shown in Figure 8c. In contrast, DeepMM built an excellent all-atom structure for this map, with a near-perfect residue match of 97.1% and a high sequence match of 86.8%. Therefore, the model predicted by DeepMM reproduced most of the secondary structures and had an almost identi- cal chain trace to the deposited structure(Figure 8d). The corresponding amino acid names were also assigned correctly by our DeepMM approach (Figure 8e). 4 Conclusion In summary, we have developed a semi-automatic de novo structure determination method for near- atomic resolution cryo-EM maps using a deep learning-based framework, named as DeepMM. Our DeepMM approach can reconstruct complete all-atom protein structures for EM maps with atomic- accuracy. DeepMM was extensively validated on diverse benchmarks and compared with state-of-the- art approaches including RosettaES, MAINMAST, and Phenix. DeepMM has also been evaluated on an EMDB-wide large test set of 2931 experimental maps at 2.6-4.9 Å resolution. Overall, DeepMM was able reconstruct the protein models with TMscore>0.5 for over 60% of the test cases. DeepMM is fast and able to reconstruct an all-atom structure from an EM map within 1 hr on a single-GPU machine for an average-length protein chain of 300 amino acids. Given the high computational effi- 14 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ ciency and all-atomic accuracy, it is anticipated that DeepMM will serve as an indispensable tool for semi-automatic atomic-accuracy structure determination for near-atomic-resolution cryo-EM maps. Acknowledgements The authors acknowledge professor Daisuke Kihara and his students Genki Terashi and Sai Raghaven- dra Maddhuri Venkata Subramaniya from Purdue University for providing their datasets. This work was supported by the National Natural Science Foundation of China (grant Nos. 62072199 and 31670724) and the startup grant of Huazhong University of Science and Technology. Competing interests The authors declare no competing interests. 15 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ References (1) Nogales E. The development of cryo-EM into a mainstream structural biology technique. Nat Methods. 2016;13(1):24-7. (2) Frank J. Advances in the field of single-particle cryo-electron microscopy over the last decade. Nat Protoc. 2017;12(2):209-212. (3) Cheng Y. Single-particle cryo-EM-How did it get here and where will it go. Science. 2018;361(6405):876- 880. (4) Raunser S. Cryo-EM Revolutionizes the Structure Determination of Biomolecules. Angew Chem Int Ed Engl. 2017;56(52):16450-16452. (5) Safdari HA, Pandey S, Shukla AK, Dutta S. Illuminating GPCR Signaling by Cryo-EM. Trends Cell Biol. 2018;28(8):591-594. (6) Luque D, Castón JR. Cryo-electron microscopy for the study of virus assembly. Nat Chem Biol. 2020;16(3):231-239. (7) Li X, Mooney P, Zheng S, Booth CR, Braunfeld MB, Gubbens S, Agard DA, Cheng Y. Electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-EM. Nat Meth- ods. 2013;10(6):584-90. (8) Punjani A, Rubinstein JL, Fleet DJ, Brubaker MA. cryoSPARC: algorithms for rapid unsupervised cryo- EM structure determination. Nat Methods. 2017;14(3):290-296. (9) Scheres SH. RELION: implementation of a Bayesian approach to cryo-EM structure determination. J Struct Biol. 2012;180(3):519-30. (10) Adams PD, Afonine PV, Bunkóczi G, Chen VB, Davis IW, Echols N, Headd JJ, Hung LW, Kapral GJ, Grosse-Kunstleve RW, McCoy AJ, Moriarty NW, Oeffner R, Read RJ, Richardson DC, Richardson JS, Terwilliger TC, Zwart PH. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr D Biol Crystallogr. 2010;66(Pt 2):213-21. (11) Zhang B, Zhang X, Pearce R, Shen HB, Zhang Y. A New Protocol for Atomic-Level Protein Struc- ture Modeling and Refinement Using Low-to-Medium Resolution Cryo-EM Density Maps. J Mol Biol. 2020;432(19):5365-5377. (12) Xie R, Chen YX, Cai JM, Yang Y, Shen HB. SPREAD: A Fully Automated Toolkit for Single-Particle Cryogenic Electron Microscopy Data 3D Reconstruction with Image-Network-Aided Orientation Assign- ment. J Chem Inf Model. 2020;60(5):2614-2625. (13) Yin S, Zhang B, Yang Y, Huang Y, Shen HB. Clustering Enhancement of Noisy Cryo-Electron Microscopy Single-Particle Images with a Network Structural Similarity Metric. J Chem Inf Model. 2019;59(4):1658- 1667. (14) Yang YJ, Wang S, Zhang B, Shen HB. Resolution Measurement from a Single Reconstructed Cryo-EM Density Map with Multiscale Spectral Analysis. J Chem Inf Model. 2018;58(6):1303-1311. (15) Kim DN, Gront D, Sanbonmatsu KY. Practical Considerations for Atomistic Structure Modeling with Cryo-EM Maps. J Chem Inf Model. 2020;60(5):2436-2442. (16) Joseph AP, Lagerstedt I, Jakobi A, Burnley T, Patwardhan A, Topf M, Winn M. Comparing Cryo- EM Reconstructions and Validating Atomic Model Fit Using Difference Maps. J Chem Inf Model. 2020;60(5):2552-2560. 16 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ (17) Patwardhan A. Trends in the Electron Microscopy Data Bank (EMDB). Acta Crystallogr D Struct Biol. 2017;73(Pt 6):503-508. (18) Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235-42. (19) Alnabati E, Kihara D. Advances in Structure Modeling Methods for Cryo-Electron Microscopy Maps. Molecules. 2019;25(1):82. (20) Lindert S, Staritzbichler R, Wötzel N, Karakaş M, Stewart PL, Meiler J. EM-fold: De novo folding of alpha-helical proteins guided by intermediate-resolution electron microscopy density maps. Structure. 2009;17(7):990-1003. (21) Baker ML, Abeysinghe SS, Schuh S, Coleman RA, Abrams A, Marsh MP, Hryc CF, Ruths T, Chiu W, Ju T. Modeling protein structure at near atomic resolutions with Gorgon. J Struct Biol. 2011;174(2):360-73. (22) Wang RY, Kudryashev M, Li X, Egelman EH, Basler M, Cheng Y, Baker D, DiMaio F. De novo protein structure determination from near-atomic-resolution cryo-EM maps. Nat Methods. 2015;12(4):335-8. (23) Frenz B, Walls AC, Egelman EH, Veesler D, DiMaio F. RosettaES: a sampling strategy enabling auto- mated interpretation of difficult cryo-EM maps. Nat Methods. 2017;14(8):797-800. (24) Baker MR, Rees I, Ludtke SJ, Chiu W, Baker ML. Constructing and validating initial Cα models from subnanometer resolution density maps with pathwalking. Structure. 2012;20(3):450-63. (25) Chen M, Baldwin PR, Ludtke SJ, Baker ML. De Novo modeling in cryo-EM density maps with Path- walking. J Struct Biol. 2016;196(3):289-298. (26) Chen M, Baker ML. Automation and assessment of de novo modeling with Pathwalking in near atomic resolution cryoEM density maps. J Struct Biol. 2018;204(3):555-563. (27) Terwilliger TC, Adams PD, Afonine PV, Sobolev OV. A fully automatic method yielding initial models from high-resolution cryo-electron microscopy maps. Nat Methods. 2018;15(11):905-908. (28) Terwilliger TC, Adams PD, Afonine PV, Sobolev OV. Cryo-EM map interpretation and protein model- building using iterative map segmentation. Protein Sci. 2020;29(1):87-99. (29) Afonine PV, Poon BK, Read RJ, Sobolev OV, Terwilliger TC, Urzhumtsev A, Adams PD. Real-space refinement in PHENIX for cryo-EM and crystallography. Acta Crystallogr D Struct Biol. 2018;74(Pt 6):531-544. (30) Terashi G, Kihara D. De novo main-chain modeling for EM maps using MAINMAST. Nat Commun. 2018;9(1):1618. (31) Terashi G, Kagaya Y, Kihara D. MAINMASTseg: Automated Map Segmentation Method for Cryo-EM Density Maps with Symmetry. J Chem Inf Model. 2020;60(5):2634-2643. (32) Tegunov D, Cramer P. Real-time cryo-electron microscopy data preprocessing with Warp. Nat Methods. 2019;16(11):1146-1152. (33) Chen M, Dai W, Sun SY, Jonasch D, He CY, Schmid MF, Chiu W, Ludtke SJ. Convolutional neural networks for automated annotation of cellular cryo-electron tomograms. Nat Methods. 2017;14(10):983- 985. (34) Maddhuri Venkata Subramaniya SR, Terashi G, Kihara D. Protein secondary structure detection in intermediate-resolution cryo-EM maps using deep learning. Nat Methods. 2019;16(9):911-917. 17 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ (35) Si D, Moritz SA, Pfab J, Hou J, Cao R, Wang L, Wu T, Cheng J. Deep Learning to Predict Protein Backbone Structure from High-Resolution Cryo-EM Density Maps. Sci Rep. 2020;10(1):4282. (36) Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, 2261-2269. (37) Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195-7. (38) Xiang Z, Honig B. Extending the accuracy limits of prediction for side-chain conformations. J Mol Biol. 2001;311(2):421-30. (39) Petrey D, Xiang Z, Tang CL, Xie L, Gimpelev M, Mitros T, Soto CS, Goldsmith-Fischman S, Kernytsky A, Schlessinger A, Koh IY, Alexov E, Honig B. Using multiple structure alignments, fast model building, and energetic analysis in fold recognition and homology modeling. Proteins. 2003;53 Suppl 6:430-5. (40) Case DA, Cheatham TE 3rd, Darden T, Gohlke H, Luo R, Merz KM Jr, Onufriev A, Simmerling C, Wang B, Woods RJ. The Amber biomolecular simulation programs. J Comput Chem. 2005;26(16):1668-88. (41) Ruder S. An overview of multi-task learning in deep neural networks. arXiv preprint. 2017 Jun 15;arXiv:1706.05098. (42) Heinig M, Frishman D. STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res. 2004;32(Web Server issue):W500-2. (43) Ho CM, Li X, Lai M, Terwilliger TC, Beck JR, Wohlschlegel J, Goldberg DE, Fitzpatrick AWP, Zhou ZH. Bottom-up structural proteomics: cryoEM of protein complexes enriched from the cellular milieu. Nat Methods. 2020;17(1):79-85. (44) Wen Z, He J, Huang SY. Topology-independent and global protein structure alignment through an FFT- based algorithm. Bioinformatics. 2020;36(2):478-486. (45) Heffernan R, Dehzangi A, Lyons J, Paliwal K, Sharma A, Wang J, Sattar A, Zhou Y, Yang Y. Highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteins. Bioinfor- matics. 2016;32(6):843-9. (46) Fox NK, Brenner SE, Chandonia JM. SCOPe: Structural Classification of Proteins–extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2014;42(Database issue):D304-9. (47) Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33(7):2302-9. (48) Tang G, Peng L, Baldwin PR, Mann DS, Jiang W, Rees I, Ludtke SJ. EMAN2: an extensible image processing suite for electron microscopy. J Struct Biol. 2007;157(1):38-46. (49) Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150-2. (50) Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE. UCSF Chimera– a visualization system for exploratory research and analysis. J Comput Chem. 2004;25(13):1605-12. (51) Nguyen MN, Tan KP, Madhusudhan MS. CLICK–topology-independent comparison of biomolecular 3D structures. Nucleic Acids Res. 2011;39(Web Server issue):W24-8. (52) Pintilie G, Zhang K, Su Z, Li S, Schmid MF, Chiu W. Measurement of atom resolvability in cryo-EM maps with Q-scores. Nat Methods. 2020;17(3):328-334. 18 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ Figure 1 DenseNet A Preprocess cryo-EM map Cut map into voxels Predict main-chain and Cα probability of each voxel DenseNet B Predict amino acid type and secondary structure of main-chain points Align protein sequence to Cα main-chain path Construct all-atom protein model Input voxel Shared Block 1 Shared Block 2 Shared layers Task B Block 1Task A Block 1 Task A Block 2 Task B Block 2 Specific layers Prediction for Task A Prediction for Task B a b DenseNet M a in -c h a in t ra c in g Figure 1: Workflow of our DeepMM method. (a) The flowchart of DeepMM. DeepMM first pre- dicts the main-chain and Cα probability of each density voxel using a Densely Connected Convolu- tional Network (DenseNet), and then traces the protein’s main-chain path on the predicted main-chain probability map. Next, the amino acid and secondary structure types for each main chain point are predicted by a second DenseNet. The Cα models are generated by aligning the target sequence to the main-chain paths. Finally, the all-atom structures are constructed from the Cα models using the ctrip program and refined by an Amber energy minimization. (b) The multi-task deep DenseNet ar- chitecture used in DeepMM. Starting from an input EM density voxel, two dense blocks are shared by both tasks in DenseNet A, while only one dense block is shared by both tasks in DenseNet B. Each prediction task employs two task-specific dense blocks and gives the final prediction. 19 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ Figure 2 S E R I V ... S C A T H C E E E E ... C H H C C Coil 0.86 I Sheet 0.19 I Sheet 0.77 III Helix 0.65 II Coil 0.32 IV Coil 0.94 I Coil 0.46 III Coil 0.78 I Helix 0.22 IV Sheet 0.47 III Sheet 0.75 I Coil 0.08 I Sheet 0.44 I Sheet 0.06 I Coil 0.42 II Sheet 0.25 I Coil 0.59 IV Coil 0.38 IV Target Main-chain path Scoring matrix AA I II III IV I 0.7 -0.4 -0.8 -1.4 II -0.4 0.6 -0.5 -1.2 III -0.8 -0.5 1.1 -1.2 IV -1.4 -1.2 -1.2 1.2 SS Helix Sheet Coil Helix 1.5 -1.0 -0.5 Sheet -1.0 1.0 -0.5 Coil -0.5 -0.5 0.5 a b c Score Cα models #1 520.24 475.36#2 295.27#100 ... ... ... ... Alignment result I II III IV d GLY ALA SER CYS VAL THR ILE PRO LEU ASP ASN GLU GLN MET LYS ARG HIS PHE TYR TRP Figure 2: Alignment protocol between the target sequence and the predicted main-chain for DeepMM. (a) DeepMM runs alignments of the target sequence of the EM map against each candidate main-chain path. Each sphere represents a predicted local dense point (LDP) on the main-chain path. Predicted information including the Cα probability (on the top), secondary structure (in the middle) and amino acid class (at the bottom) of LDPs is utilized during alignment. For the target sequence, its secondary structure is predicted by the SPIDER2 program, as illustrated in the sequence colored in azure under the amino acid sequence. (b) Scoring matrices for amino acid type matching and secondary structure matching. (c) The generated Cα models are ranked by their alignment score. (d) Twenty amino acids are grouped into four classed according to the similarity of their side-chain EM densities. 20 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ Figure 3 0 100 200 300 400 500 600 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 100 200 300 400 500 600 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 100 200 300 400 500 600 0 20 40 60 80 100 DeepMM MAINMAST ca C α R M S D ( Å ) Protein length (aa) b DeepMM MAINMAST C L IC K R M S D ( Å ) Protein length (aa) DeepMM MAINMAST S tr u c tu re o v e rl a p ( % ) Protein length (aa) Figure 3: Comparison of the results by DeepMM and MAINMAST for the protein chains with different lengths. (a) The Cα RMSDs of the top predicted models. (b) The RMSDs of matched Cα atoms within 3.5 Å by the structure alignment tool CLICK. (c) The structure overlap calculated by CLICK, which is defined as the fraction of matched Cα atoms. 21 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ Figure 4 0 10 20 30 40 50 0 10 20 30 40 50 0 20 40 60 80 100 0 20 40 60 80 100 a M A IN M A S T R M S D ( Å ) DeepMM RMSD (Å) b R o s e tt a R M S D ( Å ) DeepMM RMSD (Å) Figure 4: Comparison of the top models for DeepMM and two other approaches on the test set of 30 experimental maps. The solid line in the figure is the plot of y = x, and the dashed line stands for y = 10. (a) Comparison of the models by DeepMM and MAINMAST in terms of Cα RMSD. (b) Comparison of the models by DeepMM and Rosetta in terms of Cα RMSD. 22 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ Figure 5 a b Figure 5: Examples of the models generated by DeepMM for experimental EM maps. The EM density map (transparent grey) and its associated native protein structure (green) are displayed on the left side. The Cα chains of the DeepMM model (red) and the native structure (green) are shown in ball-and-stick format on the predicted main-chain probability map (transparent yellow) in the middle. The full-atom structure generated by DeepMM (red) and the native protein structure (green) are displayed on the right side. (a) The nucleoprotein at 4.3 Å map resolution (EMD-2867). The top ranked model by DeepMM has a Cα RMSD of 3.1 Å. (b) The bovine rotavirus VP6 at 2.6Å map resolution (EMD-6272). The top model by DeepMM has a Cα RMSD of 1.7 Å. 23 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ Figure 6 0 5 10 15 20 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 dc b P e rc e n ta g e ( % ) RMSD (Å) Top 1 Top 10 a P e c e n ta g e ( % ) TM-score Top 1 Top 10 2.5-3.0 3.0-3.5 3.5-4.0 4.0-4.5 4.5-4.0 All 0 20 40 60 80 100 % o f R M S D < 1 0 Å Resolution (Å) Top 1 Top 10 2.5-3.0 3.0-3.5 3.5-4.0 4.0-4.5 4.5-5.0 All 0 20 40 60 80 100 % o f T M -s c o re > 0 .5 Resolution (Å) Top 1 Top 10 Figure 6: Test results of DeepMM on the 2931 experimental test cases. (a) The percentage of the top scored models at different Cα RMSD cutoffs. (b) The percentage of the top scored models at different TM-score cutoffs. (c) The percentages of top scored models within 10 Å RMSD in different map resolution ranges. (d) The percentages of the top scored models with a TM-score above 0.5 in different map resolution ranges. 24 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ Figure 7 2.5 3.0 3.5 4.0 4.5 5.0 0 20 40 60 80 100 2.5 3.0 3.5 4.0 4.5 5.0 0 20 40 60 80 100 DeepMM Phenix R e s id u e m a tc h ( % ) Resolution (Å) DeepMM Phenix dc b S e q u e n c e m a tc h ( % ) Resolution (Å) a 2.5 3.0 3.5 4.0 4.5 5.0 0 20 40 60 80 100 DeepMM Phenix A v e ra g e r e s id u e m a tc h ( % ) Resolution (Å) 2.5 3.0 3.5 4.0 4.5 5.0 0 20 40 60 80 100 DeepMM Phenix A v e ra g e s e q u e n c e m a tc h ( % ) Resolution (Å) Figure 7: Comparison of the models by DeepMM and Phenix on the large test set of 2931 experimental maps at different resolutions. The results for Phenix are colored in orange, and those for DeepMM are colored in royal blue. (a) Percentages of the protein residues in the deposited structures reproduced by DeepMM and Phenix. (b) Percentages of the sequence of the deposited structure reproduced by DeepMM and Phenix. (c) Average percentage of residue match by DeepMM and Phenix. (d) Average percentage of sequence match by DeepMM and Phenix. 25 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ Figure 8 Phenix DeepMM a b c d e Figure 8: Protein models reconstructed by DeepMM and Phenix for the Chain A of 6DW1 and its associated EM density map at 3.1 Å resolution (EMD-8923). (a) The native structure overlapped with its associated EM density map. (b) The model predicted by Phenix, which has a residue match of 86.7% and a sequence match of 9.9%. (c) The Phenix model (orange) overlapped with the native structure (green). The enlarged box on the right side shows that the residue names assigned by Phenix model are different from those of the native structure. (d) The model predicted by Phenix, which has a residue match of 97.1% and a sequence match of 86.8%. (e) The DeepMM model (royal blue) overlapped with the native structure (green). The enlarged view of the top region of the protein on the right side shows that the sequence assigned by DeepMM is close to that of the native structure. 26 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2020.08.28.271981doi: bioRxiv preprint https://doi.org/10.1101/2020.08.28.271981 http://creativecommons.org/licenses/by-nc/4.0/ Introduction Methods Workflow of DeepMM Training the DenseNets of DeepMM Tracing the main-chain path Aligning target sequence to main-chain path Parameter settings of DeepMM Datasets used Training sets Test sets Results Model reconstruction for simulated EM maps Model reconstruction for experimental EM maps Evaluation of DeepMM on the EMDB-wide data set Conclusion 10_1101-2021_01_04_425285 ---- debar, a sequence-by-sequence denoiser for COI-5P DNA barcode data 1 Title: 1 debar, a sequence-by-sequence denoiser for COI-5P DNA barcode data 2 3 Authors 4 Cameron M. Nugent1,2,* 5 Tyler A. Elliott2 6 Sujeevan Ratnasingham2 7 Paul D. N. Hebert2 8 Sarah J. Adamowicz1 9 10 1 Department of Integrative Biology, University of Guelph. Guelph, Ontario, Canada 11 2 Centre for Biodiversity Genomics, University of Guelph. Guelph, Ontario, Canada 12 *Corresponding author: nugentc@uoguelph.ca 13 14 15 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint mailto:nugentc@uoguelph.ca https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 2 Abstract 16 17 DNA barcoding and metabarcoding are now widely used to advance species discovery and 18 biodiversity assessments. High-throughput sequencing (HTS) has expanded the volume and 19 scope of these analyses, but elevated error rates introduce noise into sequence records that can 20 inflate estimates of biodiversity. Denoising —the separation of biological signal from instrument 21 (technical) noise—of barcode and metabarcode data currently employs abundance-based 22 methods which do not capitalize on the highly conserved structure of the cytochrome c oxidase 23 subunit I (COI) region employed as the animal barcode. This manuscript introduces debar, an R 24 package that utilizes a profile hidden Markov model to denoise indel errors in COI sequences 25 introduced by instrument error. In silico studies demonstrated that debar recognized 95% of 26 artificially introduced indels in COI sequences. When applied to real-world data, debar reduced 27 indel errors in circular consensus sequences obtained with the Sequel platform by 75%, and 28 those generated on the Ion Torrent S5 by 94%. The false correction rate was less than 0.1%, 29 indicating that debar is receptive to the majority of true COI variation in the animal kingdom. In 30 conclusion, the debar package improves DNA barcode and metabarcode workflows by aiding the 31 generation of more accurate sequences aiding the characterization of species diversity. 32 33 Keywords: COI, DNA barcode, metabarcode, denoising, Markov model, biodiversity 34 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 3 Introduction 35 36 Motivated by global biodiversity decline, conservation policies and strategies are being 37 implemented to mitigate extinction rates (Driscoll et al. 2018; Baynham-Herd et al. 2018). 38 Accurate assessments of biodiversity and its change over time are critical to support conservation 39 strategies, to remediate environmental damage, and to manage natural resources, but this 40 information is lacking for most ecosystems (Sogin et al. 2006; Hajibabaei et al. 2016; Hebert et 41 al. 2016; D’Souza & Hebert 2018). 42 DNA barcoding provides a technological solution to the problem of identifying 43 organisms and characterizing biodiversity (Hebert et al. 2003; Hubert & Hanner 2015). Instead 44 of identifying specimens through morphological study, standardized DNA regions—termed 45 DNA barcodes—are used to identify specimens belonging to known species and to recognize 46 new taxa. Reflecting advances in sequencing technology, DNA barcode studies are expanding in 47 scale from analyzing single specimens to characterizing bulk samples, an approach termed 48 metabarcoding, as well as multi-marker and metagenomics approaches (Taberlet et al. 2012; 49 Cristescu 2014; Hajibabaei et al. 2016; Wilson et al. 2019). These advances are providing newly 50 detailed information on species diversity in different geographic regions and habitats (Hajibabaei 51 et al. 2012; Hebert et al. 2016; Delabye et al. 2019; Lopez-Vaamonde et al. 2019) while also 52 aiding the identification of invasive species (Brown et al. 2016; Xu et al. 2017), food web 53 analysis (Wirta et al. 2014; Kanuisto et al. 2017), and environmental monitoring (Hajibabaei et 54 al. 2016; Stat et al. 2017; Cordier et al. 2019). 55 Despite the broad adoption of DNA barcoding and metabarcoding, a fundamental 56 problem persists. Efforts to quantify biodiversity from barcode and metabarcode data can be 57 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 4 strongly affected by analytical methodology (Clare et al. 2016; Braukmann et al. 2019). For 58 example, if high-throughput sequence (HTS) data are cleaned suboptimally, the estimated 59 number of taxa may be grossly inflated as variation introduced by sequencing (technical) errors 60 are interpreted as biological variation (Hardge et al. 2018). 61 To reduce the impact of technical errors, sequence reads are often clustered into 62 operational taxonomic units (OTUs) at specific identity thresholds (Elbrecht et al. 2018). Several 63 software packages have attempted to increase the accuracy of this OTU method by separating 64 biological signal from technical noise (Rosen et al. 2012; Callahan et al. 2016; Edgar 2016; 65 Amir et al. 2017; Elbrecht et al. 2018; Kumar et al. 2018; Nearing et al. 2018). Many standard 66 denoisers, such as DADA2 (Callahan et al. 2016), Deblur (Amir et al. 2017), and UNOISE 67 (Edgar 2016), utilize cluster-based approaches, custom error models, or pre-clustering algorithms 68 to account for and correct technical errors. Comparative studies have shown that all three of 69 these methods outperform threshold-based OTU-clustering approaches (Nearing et al. 2018). It 70 has also been shown that they produce similar estimates of species richness and relative 71 abundance, but significantly different values for alpha diversity (intra-habitat diversity) and the 72 number of unique exact sequence variants (ESVs) (Nearing et al. 2018). When a highly 73 conserved protein-coding region, such as cytochrome c oxidase subunit I (COI), is employed as 74 the barcode, structural information can be leveraged to improve denoising. The adoption of this 75 approach can improve the accuracy of alpha-diversity estimates and the quality of identified 76 barcode sequences by ensuring barcodes conform to biological reality. Additionally, rare 77 sequences or important intra-species variants need not be discarded based solely on their 78 abundance and can be retained with higher confidence if they conform to the expected gene 79 structure. This latter benefit will be particularly valuable for work on hyper-diverse communities, 80 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 5 (e.g. tropical insects) and for analyses of metabarcode data, where uneven sampling is often the 81 norm and the resolution of intra-species variation is challenging (Elbrecht et al. 2018; Nearing et 82 al. 2018; Braukmann et al. 2019; Zizka et al. 2020). 83 Hidden Markov models (HMMs) are probabilistic representations of sequences that allow 84 unobserved (hidden) states to be inferred through the observation of a series of non-hidden states 85 (Durbin et al. 1998; Wilkinson 2019). HMMs have been applied widely in the analysis of 86 biological sequences, in areas such as sequence alignment and annotation (Durbin et al. 1998; 87 Eddy 1998). Profile Hidden Markov models (PHMMs) are a variant well suited for the 88 representation of biological sequences with a shared evolutionary origin (Durbin et al. 1998; 89 Eddy 1998, 2009). They are probabilistic models that contain position-specific information about 90 the likelihood of potential characters (base pairs or amino acid residues) at the given position in 91 the sequence (emission probabilities) and the likelihood of the observed character given the 92 previously observed character in the sequence (transition probabilities). Once a PHMM is trained 93 on a set of sequences, the Viterbi algorithm can be used to obtain the path of hidden states that 94 align the novel sequences to the PHMM (Durbin et al. 1998). The Viterbi path is comprised of 95 hidden match states (indicating the observed character matches to a position in the PHMM) and 96 non-match states: either inserts or deletions. In the context of error correction, hidden non-match 97 states identify the most likely positions at which novel sequences deviate from the PHMM’s 98 statistical profile. In this manner, individual sequences can be queried for evidence of insertion 99 or deletion (indel) errors and adjusted in a statistically informed manner. The conserved protein-100 coding structure of the most common animal barcode gene, COI, and the wealth of available 101 training sequences (Ratnasingham & Hebert 2007) for this region have allowed PHMMs to be 102 successfully applied in the detection of technical errors in novel barcode sequences (Nugent et 103 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 6 al. 2020). Correction of technical indel errors in data from protein-coding barcode sequences is 104 an important development as it maximizes the likelihood that both the nucleotide and amino acid 105 sequences correspond to the true biological sequence. Mitigation of indels arising due to 106 technical errors also makes sequence reads from a given specimen more directly comparable, 107 allowing low-frequency point mutations to be eliminated when multiple reads are available for a 108 given biological sequence. Here, we aim to extend the use of PHMMs in COI data processing to 109 allow for the sequence-by-sequence correction (denoising) of technical errors. 110 This study had four primary goals: (1) design a denoising tool for COI barcode data that 111 utilizes PHMMs to identify and correct insertion and deletion errors resulting from technical 112 error; (2) test the tool’s performance and optimize its default parameters by denoising a set of 113 10,000 barcode sequences with artificially introduced indel errors; (3) develop, implement, and 114 evaluate a workflow for denoising DNA barcode data produced through single-molecule, real 115 time (SMRT) sequencing of 29,525 specimens on the Sequel platform (Pacific Biosciences); and 116 (4) denoise a DNA metabarcode mock community data set using debar and evaluate the 117 improvement in quality of consensus sequences and the ability to resolve intra-OTU haplotype 118 variation. The denoiser resulting from this work, debar (DEnoising BARcodes), is a free, 119 publicly available package written in R that is available through CRAN (https://CRAN.R-120 project.org/package=debar) and GitHub (https://github.com/CNuge/debar). 121 122 Materials and Methods 123 124 Implementation 125 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://cran.r-project.org/package=debar https://cran.r-project.org/package=debar https://github.com/CNuge/debar https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 7 The debar utility includes several customizable steps which denoise DNA barcode and 126 metabarcode data (Figure 1; Supplementary File 1). Corrections with debar are based upon the 127 comparison of input sequences with a nucleotide-based profile hidden Markov model (PHMM) 128 (model training detailed in Nugent et al. 2020) using the Viterbi algorithm (Durbin et al. 1998). 129 Briefly, debar’s PHMM was trained using a curated set of 11,387 COI-5P barcode sequences 130 obtained from the Barcode of Life Data Systems (BOLD: www.boldsystems.org) public database 131 that were checked to ensure: (i) the sequence was >600 bp in length, (ii) taxonomy was known to 132 a genus level, (iii) there were no missing base pairs, (iv) the amino acid sequence did not contain 133 stop codons, and (v) BOLD’s internal check for contaminants was negative (Nugent et al. 2020). 134 The Viterbi path produced through alignment of the sequence to the PHMMs is used to match 135 the input sequence to the PHMM (by finding the first set of 10 consecutive match states which 136 indicate the absence of indels for the given 10 base pairs). The read is then adjusted to account 137 for detected insertions or deletions (Figure 1). Three consecutive nucleotide insertions or 138 deletions are permitted (not adjusted) as sequences of this kind are more likely to reflect true 139 biological variants than technical errors (they do not result in reading frame shifts and may 140 reflect an insertion or deletion of an amino acid in a functional protein-coding gene). The 141 probability of such changes through sequencing error is relatively low (i.e. for the Pacific 142 Biosciences Sequel platform the baseline probability of three consecutive deletions would be 143 0.05% (baseline delete probability) cubed, or 0.000125%). 144 The denoising of sequences with debar is controlled using a suite of parameters (Figure 145 1). The censorship parameter is most important as it controls the size of the masks (substitution 146 of nucleotides for placeholder N characters) applied around sequence adjustments. This option is 147 designed to prevent the introduction of errors that would be caused if the denoising process 148 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 8 deleted the wrong base pair or inserted a placeholder in the incorrect position. Derivation of the 149 default value for the censorship parameter is detailed in the Methods and Results sections. The 150 package also enables the translation of denoised sequences to amino acids to confirm that 151 denoised outputs conform to the expected properties of the protein-coding gene region. Because 152 debar can interface directly with fasta and fastq files, it enables file-to-file denoising in addition 153 to denoising within an R programming environment. The default PHMM used for denoising by 154 debar represents the complete 657bp barcode region of COI. The package also permits the use of 155 customized PHMMs provided by a user, which allows the denosiser to be applied to data from 156 other gene regions or for the denoiser to be targeted to a specific user-defined subsection of the 157 COI barcode. Training of a PHMM for a new barcode or gene is supported by the R package 158 aphid (Wilkinson 2019), while sub-setting of debar’s default PHMM is enabled by the R package 159 coil (Nugent et al. 2020). Details of the package’s components together with a demonstration of 160 its implementation is available in the package’s vignette (Supplementary File 1). 161 162 Quantification of package performance 163 Simulated error data 164 The debar package was tested using a phylogenetically stratified random sample of publicly 165 available COI-5P sequences with artificially introduced indels. This test was designed to assess 166 the accuracy of sequence corrections and to obtain a quantitatively informed set of default 167 parameters for the denoising process. A random sample of 10,000 animal COI-5P sequences 168 (excluding those used in PHMM model training) were obtained from BOLD and cleaned using 169 the steps described in Nugent et al. 2020 (methods section – BOLD data acquisition). Errors 170 were introduced into each sequence in accordance with the statistical error profile of the Pacific 171 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 9 Biosciences Sequel based upon the error profile for COI barcode region in Hebert et al. (2018). 172 This profile indicated a baseline indel rate of 0.1% (insertions and deletions equally likely), a 173 baseline substitution rate of 0.5%, and an elevated indel rate for long homopolymers (repeat 174 length of 6,7, and 8+ with indel probabilities of 0.75%, 1.2%, and 3.8%, respectively) (Hebert et 175 al. 2018). The location of all errors was recorded so that accuracy of subsequent corrections 176 could be evaluated. Sequences were iteratively processed, and errors were limited to a single 177 insertion or deletion error of one base pair in length (with the error introduction process being 178 repeated for the original sequence when more than one indel occurred), which allowed for the 179 accuracy of corrections to be assessed without the need to consider interaction effects. 180 The resultant sequences, each with one indel, were then denoised with debar (‘denoise’ 181 function, using the parameter censor_length = 0). The outputs of the denoise function were 182 queried to determine the number and location of indel corrections applied by debar. This 183 information was compared to the recorded ground truth error locations to quantify the following: 184 1) the frequency with which debar located and exactly corrected indels, 2) the miss distance 185 (number of nucleotide positions) between introduced errors and corrections applied in instances 186 where debar did not correct the indel errors in exactly the correct position, and 3) the frequency 187 at which debar applied an incorrect number of sequence corrections (i.e. 0 correction or 2+ 188 corrections). If one correction was made and the distance between the correction and true indel 189 position was 0, then the correction was considered accurate. Corrections were also considered 190 accurate if all base pairs between the correction location and the true indel position were the 191 same (i.e. if base pair 2 in the homopolymer "TTTTT" was an insertion, but the 5th T in the 192 sequence was removed by debar, this is functionally an exact correction as the true sequence is 193 restored). All other corrections at inexact positions were considered inaccurate, and the distance 194 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 10 (number of positions) between the correction and true indel location was recorded. The mean and 195 standard deviation of the miss distance were determined and used to select the default 196 censor_length parameter for the debar package, equal to the mean miss distance plus 2 standard 197 deviations (censor_length = ceiling( μmiss_distance + (2 x σmiss_distance)) ). This value was selected as 198 it would be expected to avoid the introduction of an error for > 95% of inexact corrections. 199 Sequences where no corrections or multiple corrections were made had their outputs inspected 200 further to determine if other parts of the denoising pipeline (e.g. the check for stop codons in the 201 translated amino acid sequence or trimming of sequence edges in the framing process) removed 202 the error or led to the complete rejection of the sequence. 203 204 False correction rate 205 The performance of debar on sequences with no indel errors was also quantified to determine the 206 frequency and cause of erroneous corrections applied to cleaned, publicly available COI-5P 207 barcode sequences with no known technical errors. A random sample of 10,000 sequences from 208 all the animal COI-5P barcode sequences available on BOLD was obtained (Supplementary File 209 2) meeting the following criteria was obtained: 1) the barcode was publicly available on the 210 BOLD database, 2) the barcode was > 600bp in length, 3) the barcode did not contain missing 211 characters (“N”) in the Folmer region, 4) the corresponding amino sequence did not contain stop 212 codons, 5) the result of BOLD’s internal check for contaminants was negative, and 6) the 213 sequence was not used in PHMM training and the simulated error dataset. Sequences were 214 processed using debar’s denoise function (censor_length = 0). All sequences that had corrections 215 applied, or that were flagged for rejection, were counted and examined in detail to search for 216 evidence of the proximal cause of the false correction. To search for evidence of taxonomic bias, 217 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 11 the taxonomy associated with all falsely corrected sequences were tallied at the order level, and 218 manually examined for evidence of bias. 219 220 Denoising PacBio Sequel data 221 We quantified the performance of debar on raw DNA barcode sequence data by interfacing with 222 the existing mBRAVE workflow (http://www.mbrave.net) used to process DNA barcode circular 223 consensus sequences (CCS) obtained with the Sequel platform. A custom analysis pipeline 224 (Supplementary File 3) was constructed to analyze and denoise the final set of CCS barcodes 225 produced by the mBRAVE workflow (one CCS per OTU) (Figure 2). The pipeline was designed 226 to search the final barcodes produced by mBRAVE for evidence of indel errors (by considering 227 the translated amino acid sequence with the R package coil (Nugent et al. 2020)), denoise all the 228 associated CCS with detected errors using the debar package, and then regenerate a consensus 229 barcode sequence using the denoised data to produce a final, denoised barcode sequence for each 230 specimen (Figure 2). 231 The outputs of this analysis were examined to determine if the debar pipeline decreased 232 the number of technical errors in the barcode sequences and that those barcode sequences 233 resulted in likely amino acid sequences when translated. Initial quantification of the 234 improvement was conducted by comparing the number of barcode sequences whose amino acid 235 sequences were flagged by the R package coil (Nugent et al. 2020, default parameters) before 236 and after denoising. Barcodes are flagged by coil when they possess a stop codon when 237 translated to amino acids or when the resultant amino acid sequence is improbable, both 238 indicating that the sequence likely possesses an indel error. 239 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint http://www.mbrave.net)/ https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 12 Since the coil and debar packages both employ the same nucleotide profile hidden 240 Markov model (coil also utilizes an amino acid PHMM), an independent test of pipeline 241 effectiveness was also conducted. The effectiveness of the denoising pipeline was quantified by 242 submitting both the original and denoised barcode sequences to BOLD. It was used to determine 243 the number of original barcodes and denoised barcodes with evidence of stop codons after 244 aligning the sequences using the BOLD’s hidden Markov model (a model developed 245 independently of the debar PHMM) and translating the sequence using the appropriate 246 translation table corresponding to the taxonomic information accompanying the sequence record. 247 Comparison of these numbers made it possible to quantify the increase in barcode-compliant 248 sequences (i.e. those with no stop codon) produced by debar. Additionally, the Sequence Quality 249 Report on BOLD was examined to determine the number of unknown nucleotides (“N”) in the 250 barcode sequences after denoising. The report categorizes barcode quality as: high (<1% Ns), 251 medium (<2% Ns), low (<4% Ns), or unreliable (>4% Ns), and the number of barcodes in these 252 different categories was recorded. 253 254 Denoising metabarcode data 255 To characterize debar’s performance on metabarcode data, we analyzed a metabarcode dataset 256 for a mock arthropod community (Braukmann et al. 2019). These data derived from a single 257 sequencing run on an Ion Torrent S5 on COI amplicons generated by pooled DNA extracts from 258 abdomens from single specimens of 369 arthropod species (methods described in detail in 259 Braukmann et al. 2019). Sequences were from a 407bp fragment of the COI barcode region 260 targeted using the primers MlepF1 and LepR1 (Hebert et al. 2004; Braukmann et al. 2019). 261 Following amplification and sequencing on the Ion S5, quality control, sequence dereplication, 262 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 13 chimeric read filtering, matching to reference sequences, and clustering were performed on 263 mBRAVE (Braukman et al. 2019). Two sets of data resulted from this process, a set of 123,926 264 unique sequences that were assigned to 398 different Barcode Index Numbers (BINs) 265 (Ratnasingham and Hebert 2013) through the comparison to reference sequences (matched at 266 >98% similarity), and a set of 2,199 unique sequences not matching to available references that 267 were clustered into an additional 1,255 OTUs at a 97% similarity threshold (using clustering 268 algorithm described in Braukmann et al. 2019). 269 All sequences were denoised using debar’s denoise_list function and a custom nucleotide 270 PHMM. The custom PHMM was a 398bp subset of the complete COI PHMM (PHMM profile 271 positions 250 – 648), corresponding to a segment of the Folmer (Folmer et al. 1994) region 272 targeted by the metabarcoding primers. The PHMM was created using coil’s ‘subsetPHMM’ 273 function (Nugent et al. 2020). After denoising, two tests were conducted to determine if 274 denoising improved the quality of the metabarcode pipeline’s output data. 275 First, for each BIN and OTU consensus sequences were generated using denoised 276 sequences and the debar function ‘consensus_sequence’. These consensus sequences were 277 assessed for evidence of stop codons using coil and the same custom PHMMs used in denoising 278 (function coi5p_pipe with the additional parameter: trans_table = 5). This test revealed the 279 number of denoised consensus sequences which contained a stop codon when translated to 280 amino acids, indicating an indel error persisted in the nucleotide sequence. The centroid 281 sequences for the BINs and OTUs were used as a baseline metric for the number of barcode-282 compliant sequences. For each BIN, centroid sequences were obtained by clustering the 283 sequences in the group using the R package kmer’s ‘otu’ function (parameters: k = 4, threshold = 284 0.95) (Wilkinson 2018, Version 1.0.0). For the OTUs, centroids were obtained from data 285 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 14 generated by mBRAVE. All centroids were assessed with coil (Nugent et al. 2020, Version 1.0), 286 and the number of barcode-compliant representative sequences for the original centroids and the 287 final consensus sequences was compared. 288 Secondly, the individual sequences within each BIN and OTU were analyzed with coil to 289 determine the number that were likely error free, as evidenced by the absence of stop codons 290 after translation. This assessment was repeated on the denoised reads to determine the 291 effectiveness of debar in correcting errors in individual sequences and to reveal if the denoising 292 process improved the resolution of ESVs for subsequent analysis of intra-species genetic 293 variation by placing the ESVs in reading frame and reducing the frequency of identified indel 294 errors. 295 296 297 Results 298 Quantification of package performance 299 Simulated error data 300 Debar was used to correct 10,000 barcodes, each with a single indel error (Supplementary File 301 1). The denoised sequences and associated data were compared to the ground truth error 302 locations to determine the accuracy of corrections applied by debar (Figure 3). For 9,459 303 sequences (95.59%), a single correction was applied by debar, indicating that the package 304 correctly identified the type of error in these sequences. However, debar either failed to 305 recognize an indel or made too many corrections (2+) in the other 541 sequences. No correction 306 was made for most (426) of these sequences, meaning that debar’s PHMM did not identify the 307 indel error. The overlooked indels were largely restricted to the terminal regions of the sequence; 308 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 15 75% (329/426) of them were positioned within 20 base pairs of the read termini (Figure 4), 309 regions that only comprised 5% (40bp/650bp) of the sequences. The cause of this is that the 310 debar denoising algorithm uses the first observation of 10 consecutive bp matching to the 311 PHMM to establish the corrective window. Errors on the periphery of sequences therefore lead 312 to trimming of the sequence (via the keep_flanks function) instead of indel correction. A 313 substantial fraction of the remaining uncorrected indel errors (43) occurred between positions 314 452 to 465 (Figure 4), a region associated with a 3bp indel present in some animal groups and 315 absent in others. Its presence reduced the PHMM’s indel detection ability in this region due to 316 greater true variability. Not all unidentified indels were retained in the final output sequences as 317 double checks of debar (employing the keep_flanks and aa_check parameters) identified many 318 (266/426 – 62%) of the uncorrected sequences and either omit the problem region or flag the 319 sequence as likely to contain an error. Therefore, debar’s double checks allow many false 320 negatives to be trimmed or flagged as problematic. 321 For 119 sequences (1.2%), two or more corrections were applied by debar when only a 322 single indel existed (Figure 3). In contrast to the false negatives, debar’s double checks only 323 captured three of the false positives. Many of the false corrections appeared to be the presence of 324 indels near codons that are not present in all animals. Due to true biological variation in the 325 training data, these regions of the PHMM have higher probabilities of transitioning from a match 326 state to an insert or delete state, and therefore indels in these locations are sometimes handled 327 incorrectly (i.e. the sequence is characterized as having two deleted base pairs, when there was a 328 1bp insertion). Because false corrections of this type result in sequences that conform to the 329 structure of the protein-coding gene region (i.e. a lack of stop codons in the amino acid 330 sequence), they are not identified by debar’s aa_check function. 331 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 16 The 9,459 sequences for which the presence of a single indel was correctly identified 332 were further analyzed to determine how accurately they were located (Figure 3). The analysis 333 showed that debar was able to exactly locate and correct 5,847 (61.81% of sequences in single 334 correction category) of the indel errors in the dataset. For the other 3,612 sequences (38.19% of 335 the single corrections category), the indel corrections were not placed in exactly the correct 336 position (Figure 5). For these sequences, the average distance between the true indel location and 337 the applied correction was 2.31 base pairs (standard deviation = 1.9767). 338 These results were used to select a default censorship value for debar to ensure that 339 inexactly identified indel errors are masked in most sequences (Figure 1). A default censorship 340 length of 7 (the average miss distance plus two times the standard deviation, rounded up) was 341 selected in order to mask the true error in >95% of instances where inexact corrections were 342 applied, thereby successfully denoising sequences, albeit with some associated loss of 343 information in the sequences, which can be overcome by building a consensus sequence when 344 multiple reads are available for an individual. 345 Overall, denoising of the 10,000 barcodes with the default censorship parameter 346 (censor_length = 7) resulted in 9,309/10,000 (93.09%) of sequences with errors being 347 successfully denoised. The additional double check parameters (aa_check = True, keep_flanks = 348 False) captured, but did not correct, 269 (2.69%) errors. The debar package thereby corrected or 349 removed 95.74% of sequences with indel errors (Figure 3). 350 351 False correction rate 352 A set of 10,000 barcode sequences with no known indel errors was analyzed with debar to 353 determine the incidence of erroneous corrections. Nearly all sequences (99.91%) were not altered 354 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 17 nor flagged as erroneous. Nine sequences were erroneously corrected, and none were flagged for 355 rejection. These sequences included a single sequence from each of five orders and four 356 sequences from the order Diptera (flies). Interestingly, the four Diptera sequences that were 357 incorrectly altered all belonged to the same genus: Culicoides. They represented 4/58 of all 358 sequences from the family Ceratopogonidae that were in dataset, indicating that the performance 359 issue was isolated to this single genus. 360 These results indicate that debar deals well with variation in COI sequences across most 361 of the animal kingdom, but that it displays some taxonomic bias in performance. This is a 362 limitation of debar, as any genus with a COI profile that systematically deviates from the COI 363 PHMM used in debar will be erroneously denoised. The benefit of the conservative censorship 364 approach used in the package is that although these reads are erroneously adjusted, the 365 corrections made are masked by Ns, and the entire sequence is not rejected. Rather, only a small 366 section of the sequences is lost, as if it were to contain an indel error. Most of any falsely 367 corrected sequences can thereby be recovered, and in most instances, this would be sufficient to 368 identify associated taxonomy and inform biological conclusions. 369 370 Denoising PacBio Sequel data 371 We applied debar in the analysis of real DNA barcode data by developing a processing pipeline 372 (Figure 2 – hereafter ‘the debar pipeline’) and compared the amount of technical noise in the 373 barcodes before and after processing. A set of 29,525 consensus barcode sequences derived from 374 processing data from four Sequel runs were obtained from mBRAVE and were re-processed with 375 the debar pipeline (Table 1). 376 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 18 Analysis of the consensus barcodes with coil (step ii. of the debar pipeline) flagged 3,495 377 (11.8% of total) of consensus sequences due to the detection of a stop codon in the translated 378 sequence or due to the presence of an unexpected amino acid (log likelihood score below the 379 default threshold). The large number of flagged sequences is likely reflective of false positives 380 (sequences flagged by coil that lack indel errors due to the incorrect establishment of reading 381 frame). In fact, 2,418 sequences (8.1% of total, 69.2% of flagged sequences) were flagged due to 382 the presence of a stop codon, and 1,282 of them (4.3% of total, 36.7% of flagged sequences) 383 contained a stop codon in all three forward reading frames, providing extremely strong evidence 384 of an indel error (i.e. a low likelihood of being a false positive). 385 After denoising, the output sequences were again assessed with coil (step viii. of the 386 debar pipeline) and this analysis revealed that debar had corrected many indel errors (Table 1, 387 Table 2). Only 1,123 (3.8%) of the final barcode sequences were flagged by coil’s coi5p_pipe 388 function, suggesting that 66.8% (2,335) of the flagged sequences were successfully denoised. 389 When comparison was restricted to the 2,418 sequences with stop codons, only 176 were still 390 flagged as containing stop codons, indicating that 92.7% (2,242/2,418) of the sequences in this 391 subcategory were effectively denoised. A more conservative estimate of correction success was 392 provided by the subset of flagged sequences with stop codons in all reading frames. Of these 393 sequences, 1106/1282 (86.27%) passed the coil check following denoising, suggesting the 394 successful correction of an indel error and improved representation of the true sequence. 395 External quantification of the debar pipeline’s denoising ability was obtained by the submission 396 of pre- and post- pipeline barcode sequences to BOLD (http://www.boldsystems.org). The 397 sample size for this test was smaller as BOLD requires taxonomic designations and this 398 information was only provided by mBRAVE for 27,041 sequences. The total number of original 399 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 19 sequences flagged by BOLD due to its detection of a stop codon was 1,515 (6.3%), a 400 considerably lower frequency than reported by coil on the initial pipeline inputs. Of the 1,515 401 sequences with initial evidence of stop codons, 14 were rejected outright by the debar pipeline, 402 223 were flagged but not successfully corrected, 147 were unflagged and not corrected, and 403 1,131 had no evidence of errors following denoising (Table 3). Based on this assessment with 404 BOLD, the debar pipeline produced a 75% reduction in the number of errors in the dataset from 405 6.3% (1,515) to 1.6% (384). Of the remaining 384 errors, the majority (223) were detected as 406 problematic and flagged as erroneous by debar. As a consequence, the debar pipeline reduced the 407 number of unidentified errors by >90% (from 1,515 to 147) in the barcode dataset (Table 3). 408 The denoising of the barcodes with the debar pipeline did not result in sequences with 409 large amounts of missing information. Of the 29,525 output barcodes, 28,802 were high quality 410 (<1% Ns), 11 were medium quality (<2% Ns), 498 were low quality (<4% Ns), and 214 were 411 unreliable (>4% Ns). There was a strong negative relationship between the number of CCS 412 available for a sample and the amount of missing information in the final barcode sequence 413 (Figure 6). 414 415 Denoising metabarcode data 416 Consensus sequence quality 417 Metabarcode data from a mock arthropod community were also denoised followed by 418 comparison of original sequences to the denoised consensus sequences to determine if the debar 419 improved sequence quality (Table 4). Of the original centroid sequences for the 398 BINs, 420 125/398 (31.4%) contained evidence of indel errors when analyzed with coil. Following 421 denoising and consensus sequence generation via debar, the number of barcode-compliant 422 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 20 outputs was considerably higher with only 7/394 (1.8%) displaying evidence of indel errors. 423 Four BINs had all their component sequences rejected by debar so no consensus sequences were 424 generated. The rate of apparent indel errors was higher in the centroids of the 1255 OTUs; 681 425 (54%) displayed evidence of a stop codon when analyzed with coil, suggesting the presence of 426 indels in more than half of the sequences representing each OTU. The consensus sequences 427 produced through denoising and consensus sequence generation with debar were of apparent 428 higher quality as only 134 (10.6%) displayed evidence of a stop codon when analyzed with coil. 429 An additional 31 OTUs (2.5%) failed to produce a valid consensus sequence after denoising 430 because all their component sequences were rejected by debar. 431 The corrections did cause some loss of information; 46/394 (11.7%) of the consensus 432 sequences for the BIN groups contained at least one ‘N’ due to ambiguous or censored base pairs 433 in their component reads, and 861/1255 (68.6%) of the OTU consensus sequences contained at 434 least one ‘N’. The number of ‘Ns’ per sequence was generally low for the BINs (median = 0; 12 435 sequences with 14 or more ‘Ns’) but was higher for the OTUs (median number of ‘Ns’ = 15), 436 indicating there was on average one correction per OTU (correction of an indel, plus the seven 437 bp mask in either direction result in 14 (insertion) or 15 (deletion) consecutive ‘Ns’). There was 438 a positive relationship between the number of sequences within an OTU and the completeness of 439 information in the final consensus sequence. 440 441 ESV data quality 442 Data analysis on mBRAVE revealed 398 BINs represented by 123,926 unique 443 dereplicated reads as well as 1255 OTUs lacking taxonomic assignment that were represented by 444 2199 unique sequence reads. When original sequences were checked with coil, it indicated that 445 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 21 61,351/123,926 (49.5%) of BIN sequences and 1310/2199 (59.97%) of the OTU sequences 446 displayed strong evidence of an indel error as they contained a stop codon when translated. By 447 contrast, following denoising with debar the incidence of stop codons was far lower as just 448 2858/122,349 (2.3%) of the BIN sequences and 418/2,145 (19.49%) of the OTU sequences had 449 evidence of indels. This result indicated that denoising of individual sequences reduced the 450 incidence of apparent indel errors by over 95% for the BINs (58,593 fewer indel errors) and by 451 68% for the OTUs (892 fewer indel errors). Most sequences were subjected to at least one indel 452 correction by debar, with 85,298/122,349 (69.7%) of the final BIN sequences and 1387/2145 453 (64.7%) of final OTU sequences containing at least one ‘N’ character. Low abundance OTUs in 454 the data set represented by biologically valid sequences need not be discarded solely due to their 455 low abundance and could be further inspected for putative evidence of rare community members. 456 457 458 459 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 22 Discussion 460 461 This manuscript introduces debar, a PHMM-based denoiser, and demonstrates how it can 462 improve the quality of sequence data used for both DNA barcode library construction and for 463 metabarcode studies by correcting indels introduced by sequencing error. We first evaluated its 464 effectiveness through an in silico study that tested its capacity to recognize and repair reference 465 barcodes with artificially introduced indels. Debar was shown to be effective, as it corrected 466 >95.7% of the errors and applied erroneous adjustments to less than 0.01% of correct sequences. 467 This strong performance extended to real-world data sets. Debar reduced the rate of frameshift 468 indels by 75% in sequence records generated by the long-read Sequel platform, generating more 469 barcode-compliant sequences, most with little or no missing information. Debar also improved 470 the quality of metabarcode data generated by the ION S5 allowing for ESVs to be considered 471 with higher confidence and for the recovery of higher-quality representative sequences for 472 OTUs. 473 Denoising sequences with artificial errors and known ground truths showed that the 474 corrections performed by debar were imperfect, with the exact indel location being identified 475 only 61.8% of the time. The application of a default 7bp censorship on both sides of putative 476 indel corrections proved to be an effective means of masking most errors, improving the 477 denoiser’s error removal rate to >95.75%. This high error removal rate involves a tradeoff, as 478 sequence adjustments are accompanied with a loss of 14 base pairs of information. This 479 information loss is an acceptable cost, as it ensures that all remaining base pairs can be 480 considered with high confidence. The nature of high-throughput sequence data, namely that there 481 are usually multiple sequencing reads for a given specimen available, can help mitigate the loss 482 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 23 of information. Corrected sequences from a specimen or OTU can be used in conjunction with 483 one another, filling in the different censored locations and overcoming the loss of information. 484 The censorship of bases adjacent to indel corrections is an optional parameter that users may 485 alter to suit their needs. Smaller censorship values, or no censorship at all, would result in less 486 loss of information per sequence, but would come at the cost of more errors remaining in the 487 final data. 488 Denoising of real DNA barcode data obtained from sequencing of specimens on the 489 Pacific Biosciences Sequel platform resulted in higher-quality output sequences. An exact metric 490 quantifying the improvement is, however, difficult to state with certainty, as the ground truth of 491 the sequences is not known. The independent tests of the sequences through submission of 492 consensus sequences to BOLD before and after denoising provided a conservative estimate of 493 the debar package’s effectiveness. Conservatively, this test showed a 75% reduction in the 494 number of barcode sequences with technical indel errors after application of the debar pipeline 495 and a low false negative rate (147 unidentified errors out of 1,515 total putative errors). This is 496 an important improvement because the Pacific Biosciences Sequel platform is used at the Centre 497 for Biodiversity Genomics to produce high-quality reference barcodes for the barcoding research 498 community (Hebert et al. 2018). Accuracy of these sequences is therefore important; the debar 499 package is shown to improve sequence quality, yielding more biologically likely and therefore 500 reliable outputs. The generation of barcode sequences is also made more efficient. By increasing 501 the rate of barcode-compliant outputs from 93.7% to 98%, fewer samples require reprocessing or 502 resequencing. 503 Understanding within-species patterns of genetic diversity is an essential metric for 504 characterizing community health. High intra-species genetic diversity is assumed to indicate 505 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 24 healthy ecosystems, comprised of large and stable populations with the standing genetic 506 variation needed to survive environmental stressors (Zizka et al. 2020). The characterization of 507 ESVs within OTUs can provide intra-species diversity measures for member species of a 508 community (Frøslev et al. 2017). The initial check of the sub-OTU sequence data from the mock 509 community sequenced with IonTorrent revealed a high rate of putative indel errors (54% of 510 sequences), which would lead to a gross over estimation of the number of ESVs within the 511 OTUs. The reduction of the error rate after denoising with debar allows for a more accurate 512 examination of intra-OTU ESVs and therefore allows for more accurate assessments of intra-513 species diversity and community health, despite the fact that debar is not capable of eliminating 514 non-indel errors from sequences. Even with the improvements to ESV quality by debar, intra-515 species diversity estimates will likely remain inflated to some extent, as the sequence-by-516 sequence corrections applied by debar exclusively account for indel errors while substitution 517 errors could persist within the data. 518 We have demonstrated that debar is an effective means of reducing technical errors in 519 DNA barcode and metabarcode data, but the package is not without limitations. The package is 520 designed to correct insertion and deletion errors, but these are not the only technical issues that 521 can lead to inflated biodiversity estimates. The program is not an effective means of identifying 522 or correcting chimeric sequences or non-animal COI biological contaminants and should these 523 exist within an input data set they are likely to go uncorrected. Additionally, debar does not have 524 the ability to correct substitution errors on a sequence-by-sequence basis. Because of indel 525 correction, denoised sequences are aligned, and nucleotide positions become directly comparable 526 across different sequences from a given specimen or OTU. Random point substitution errors can 527 thereby be corrected in consensus sequence generation, through the ‘majority rule’ approach 528 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 25 debar uses in base calling. However, if systematic errors exist (i.e. most sequences possess the 529 same substitution), few sequences are available for consensus sequence generation, or ESVs are 530 being examined, then substitution errors may persist in the data. An additional source of error 531 unaccounted for by debar is contaminant sequences. It has been demonstrated previously that the 532 PHMM utilized in debar is not an effective means of separating animal barcode sequences from 533 off-target barcodes derived from bacteria, plant, fungi, or other origins (Nugent et al. 2020). 534 Taken together, these limitations show that debar cannot single handedly address the technical 535 challenges associated with DNA barcoding. The tool is likely most effective when applied in 536 conjunction with existing barcode and metabarcode workflows and improves the quality of final 537 sequences if the inputs have been filtered based on quality, had primers removed, and been 538 cleaned of chimeric and contaminant sequences. The sequence-by-sequence denoising approach 539 of debar means that it is a flexible tool capable of integrating into analysis pipelines for 540 sequencing data from various sources. Application of debar in tandem with conventional, 541 clustering-based denoising tools would likely lead to the highest quality assessment of 542 biodiversity. Following OTU generation with other tools, using debar to denoise all reads within 543 a given OTU prior to consensus sequence generation would maximize accuracy of the consensus 544 sequence while conforming to the conserved structure of the COI barcode region. The removal 545 of intra-OTU noise can also improve the accuracy of alpha-diversity estimates. Additionally, 546 application of debar in the denoising of rare, low-abundance sequences not present in the OTUs 547 would allow these data to be further examined with higher confidence, revealing biological 548 insights that would be overlooked in conventional workflows. 549 The PHMM denoising technique used by debar is an effective barcode-focused 550 framework that can be extended to fit a variety of needs. Data from only two sequencing 551 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 26 platforms were tested in this study: the Pacific Biosciences Sequel and Thermo IonTorrent S5. 552 Since the PHMM used in debar is barcode specific and not sequencer specific, debar can be 553 effectively applied in denoising of barcode data obtained from any sequencing platform. 554 However, the effectiveness of the denoiser will depend on the types and rates of technical errors 555 associated with a given platform. When applied to data from sequencers such as the Illumina 556 MiSeq, the rate of technical errors corrected by debar will be lower, as this platform is more 557 prone to introduction of substitution, as opposed to indel, errors (Schirmer et al. 2015). Although 558 the debar package contains a PHMM for only the common animal barcode COI, the denoising 559 algorithm can in the future be extended and applied in the correction of data for other DNA 560 barcodes with conserved structures. 561 562 Conclusion 563 This study has described debar, an R package for denoising DNA barcode data, and 564 demonstrated its ability to correct indels in both barcode and metabarcode sequences due to 565 instrument error. In each dataset, debar improved sequence quality. It reduced the apparent 566 number of indels by 75% in data generated by Sequel, increasing the proportion of sequences 567 that met the quality standards required to qualify as a reference barcode. The merits of debar for 568 metabarcode analysis were twofold, allowing more likely consensus sequences to be obtained for 569 OTUs, and for intra-OTU variation to be quantified with higher confidence. Overall, debar is a 570 robust utility for identifying deviations from the highly conserved protein-coding sequence of the 571 COI barcode region. Corrections informed by its use improve the separation of true biological 572 variation from technical noise, with low frequencies of false corrections. Integration of debar 573 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 27 into the workflows for processing barcode and metabarcode data will allow biological variation 574 to be characterized with higher accuracy. 575 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 28 Acknowledgements 576 577 This research was supported by grants from Genome Canada through Ontario Genomics and 578 from the Ontario Ministry of Economic Development, Job Creation and Trade. The funders 579 played no role in study design or decision to publish. This research was enabled in part by 580 resources provided by Compute Canada (www.computecanada.ca). We thank Tony Kuo and 581 Thomas Braukmann for aid with data acquisition and interpretation and Tony for helpful 582 comments on the manuscript. 583 584 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint http://www.computecanada.ca/ https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 29 References 585 586 Amir, A., McDonald, D., Navas-Molina, J. A., Kopylova, E., Morton, J. T., Xu, Z. Z., ... & 587 Knight, R. (2017). Deblur rapidly resolves single-nucleotide community sequence 588 patterns. MSystems, 2(2). 589 590 Baynham-Herd, Z., Amano, T., Sutherland, W. J., & Donald, P. F. (2018). Governance explains 591 variation in national responses to the biodiversity crisis. Environmental 592 Conservation, 45(4), 407-418. 593 594 Braukmann, T. W., Ivanova, N. V., Prosser, S. W., Elbrecht, V., Steinke, D., Ratnasingham, S., ... 595 & Hebert, P. D. N. (2019). Metabarcoding a diverse arthropod mock 596 community. Molecular Ecology Resources, 19(3), 711-727. 597 598 Brown E.A., Chain, F. J., Zhan, A., MacIsaac, H. J., & Cristescu, M. E. (2016). Early detection 599 of aquatic invaders using metabarcoding reveals a high number of non‐indigenous 600 species in Canadian ports. Diversity and Distributions, 22(10), 1045-1059. 601 602 Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. 603 (2016). DADA2: high-resolution sample inference from Illumina amplicon data. Nature 604 Methods, 13(7), 581. 605 606 Clare, E. L., Chain, F. J., Littlefair, J. E., & Cristescu, M. E. (2016). The effects of parameter 607 choice on defining molecular operational taxonomic units and resulting ecological 608 analyses of metabarcoding data. Genome, 59(11), 981-990. 609 610 Cordier, T., Lanzén, A., Apothéloz-Perret-Gentil, L., Stoeck, T., & Pawlowski, J. (2019). 611 Embracing environmental genomics and machine learning for routine 612 biomonitoring. Trends in Microbiology, 27(5), 387-397. 613 614 Cristescu, M. E. (2014). From barcoding single individuals to metabarcoding biological 615 communities: towards an integrative approach to the study of global biodiversity. Trends 616 in Ecology & Evolution, 29(10), 566-571. 617 618 Delabye, S., Rougerie, R., Bayendi, S., Andeime-Eyene, M., Zakharov, E. V., deWaard, J. R., ... 619 & Mavoungou, J. F. (2019). Characterization and comparison of poorly known moth 620 communities through DNA barcoding in two Afrotropical environments in 621 Gabon. Genome, 62(3), 96-107. 622 623 Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: 624 probabilistic models of proteins and nucleic acids. Cambridge University Press. 625 626 Driscoll, D. A., Bland, L. M., Bryan, B. A., Newsome, T. M., Nicholson, E., Ritchie, E. G., & 627 Doherty, T. S. (2018). A biodiversity-crisis hierarchy to evaluate and refine conservation 628 indicators. Nature Ecology & Evolution, 2(5), 775-781. 629 630 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 30 Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics (Oxford, England), 14(9), 631 755-763. 632 633 Eddy, S. R. (2009). A new generation of homology search tools based on probabilistic inference. 634 In Genome Informatics 2009: Genome Informatics Series Vol. 23 (pp. 205-211). 635 636 Edgar, R. C. (2016). UNOISE2: improved error-correction for Illumina 16S and ITS amplicon 637 sequencing. BioRxiv, 081257 638 639 Elbrecht, V., Vamos, E. E., Steinke, D., & Leese, F. (2018). Estimating intraspecific genetic 640 diversity from community DNA Metabarcoding Data. PeerJ, 6, e4644. 641 642 Folmer, O., Black M., Hoeh W., Lutz R, Vrijenhoek, R. (1994). DNA primers for amplification 643 of mitochondrial cytochrome c oxidase subunit I from diverse metazoan 644 invertebrates. Mol Mar Biol Biotechnol, 3(5), 294-9. 645 646 Frøslev, T. G., Kjøller, R., Bruun, H. H., Ejrnæs, R., Brunbjerg, A. K., Pietroni, C., & Hansen, A. 647 J. (2017). Algorithm for post-clustering curation of DNA amplicon data yields reliable 648 biodiversity estimates. Nature Communications, 8(1), 1-11. 649 650 Hajibabaei, M., Spall, J. L., Shokralla, S., & van Konynenburg, S. (2012). Assessing biodiversity 651 of a freshwater benthic macroinvertebrate community through non-destructive 652 environmental barcoding of DNA from preservative ethanol. BMC Ecology, 12(1), 28. 653 654 Hajibabaei, M., Baird, D. J., Fahner, N. A., Beiko, R., & Golding, G. B. (2016). A new way to 655 contemplate Darwin’s tangled bank: how DNA barcodes are reconnecting biodiversity 656 science and biomonitoring. Philosophical Transactions of the Royal Society B: Biological 657 Sciences, 371(1702), 20150330. 658 659 Hebert, P. D. N., Cywinska, A., Ball, S. L., & Dewaard, J. R. (2003). Biological identifications 660 through DNA barcodes. Proceedings of the Royal Society of London. Series B: 661 Biological Sciences, 270(1512), 313-321. 662 663 Hebert, P. D. N., Ratnasingham, S., Zakharov, E. V., Telfer, A. C., Levesque-Beaudin, V., Milton, 664 M. A., ... & DeWaard, J. R. (2016). Counting animal species with DNA barcodes: 665 Canadian insects. Philosophical Transactions of the Royal Society B: Biological 666 Sciences, 371(1702), 20150333. 667 668 Hebert, P. D. N., Braukmann, T. W., Prosser, S. W., Ratnasingham, S., DeWaard, J. R., Ivanova, 669 N. V., ... & Zakharov, E. V. (2018). A Sequel to Sanger: amplicon sequencing that 670 scales. BMC Genomics, 19(1), 219. 671 672 Hubert, N., & Hanner, R. (2015). DNA barcoding, species delineation and taxonomy: a historical 673 perspective. DNA Barcodes, 3(1), 44-58. 674 675 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 31 Kaunisto, K. M., Roslin, T., Sääksjärvi, I. E., & Vesterinen, E. J. (2017). Pellets of proof: First 676 glimpse of the dietary composition of adult odonates as revealed by metabarcoding of 677 feces. Ecology and Evolution, 7(20), 8588-8598. 678 679 Kumar, V., Vollbrecht, T., Chernyshev, M., Mohan, S., Hanst, B., Bavafa, N., ... & Golden, M. 680 (2019). Long-read amplicon denoising. Nucleic Acids Research, 47(18), e104-e104. 681 682 Lopez-Vaamonde, C., Sire, L., Rasmussen, B., Rougerie, R., Wieser, C., Allaoui, A. A., ... & 683 Lees, D. C. (2019). DNA barcodes reveal deeply neglected diversity and numerous 684 invasions of micromoths in Madagascar. Genome, 62(3), 108-121. 685 686 Nearing, J. T., Douglas, G. M., Comeau, A. M., & Langille, M. G. (2018). Denoising the 687 Denoisers: an independent evaluation of microbiome sequence error-correction 688 approaches. PeerJ, 6, e5364. 689 690 Nugent, C. M., Elliott, T. A., Ratnasingham, S., & Adamowicz, S. J. (2020). Coil: an R package 691 for cytochrome C oxidase I (COI) DNA barcode data cleaning, translation, and error 692 evaluation. Genome. 63(6):291-305. 693 694 Ratnasingham, S., & Hebert, P. D. N. (2013). A DNA-based registry for all animal species: the 695 Barcode Index Number (BIN) system. PloS One, 8(7). 696 697 Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., & Sokhansanj, B. (2008). Metagenome 698 Fragment Classification Using 𝑁-Mer Frequency Profiles. Advances in 699 Bioinformatics, 2008. 700 701 Schirmer, M., Ijaz, U. Z., D’Amore, R., Hall, N., Sloan, W. T., & Quince, C. (2015). Insight into 702 biases and sequencing errors for amplicon sequencing with the Illumina MiSeq 703 platform. Nucleic Acids Research, 43(6), e37-e37. 704 705 Sogin, M. L., Morrison, H. G., Huber, J. A., Welch, D. M., Huse, S. M., Neal, P. R., … & Herndl, 706 G. J. (2006). Microbial diversity in the deep sea and the underexplored “rare 707 biosphere”. Proceedings of the National Academy of Sciences, 103(32), 12115-12120. 708 709 Stat, M., Huggett, M. J., Bernasconi, R., DiBattista, J. D., Berry, T. E., Newman, S. J., ... & 710 Bunce, M. (2017). Ecosystem biomonitoring with eDNA: metabarcoding across the tree 711 of life in a tropical marine environment. Scientific Reports, 7(1), 1-11. 712 713 Taberlet, P., Coissac, E., Hajibabaei, M., & Rieseberg, L. H. (2012). Environmental 714 DNA. Molecular Ecology, 21(8), 1789-1793. 715 716 Wilkinson SP. (2018) kmer: an R package for fast alignment-free clustering of biological 717 sequences. R package version 1.0.0. https://cran.r-project.org/package=kmer 718 719 Wilkinson, S. P. (2019). aphid: an R package for analysis with profile hidden Markov 720 models. Bioinformatics, 35(19), 3829-3830. 721 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://cran.r-project.org/package=kmer https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 32 722 Wilson, J. J., Brandon-Mong, G. J., Gan, H. M., & Sing, K. W. (2019). High-throughput 723 terrestrial biodiversity assessments: mitochondrial metabarcoding, metagenomics or 724 metatranscriptomics?. Mitochondrial DNA Part A, 30(1), 60-67. 725 726 Wirta, H. K., Hebert, P. D. N., Kaartinen, R., Prosser, S. W., Várkonyi, G., & Roslin, T. (2014). 727 Complementary molecular information changes our perception of food web 728 structure. Proceedings of the National Academy of Sciences, 111(5), 1885-1890. 729 730 Zizka, V. M., Weiss, M., & Leese, F. (2020). Can metabarcoding resolve intraspecific genetic 731 diversity changes to environmental stressors? A test case using river 732 macrozoobenthos. BioRxiv. 733 734 735 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 33 736 Data Accessibility Statement 737 738 DNA barcode sequences used in training of the Profile Hidden Markov Models are available in 739 the Supplementary data of the following paper: https://doi.org/10.1139/gen-2019-0206. DNA 740 barcode sequences used in model testing are available in this manuscript’s Supplementary files. 741 The R source code for the debar package is available on GitHub: 742 https://github.com/CNuge/debar. Additional data and code available on request from the authors. 743 744 745 Author Contributions 746 747 The study was conceived and designed by SJA, PDNH, SR, and CMN. The programming of the 748 debar package was performed by CMN. Analyses of package performance were performed by 749 CMN with resources, design, and other assistance provided by TAE, SR, and SJA. The initial 750 draft of the manuscript was written by CMN and SJA. All authors contributed to the editing of 751 the manuscript. 752 753 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1139/gen-2019-0206 https://github.com/CNuge/debar https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 34 Tables and Figures 754 755 Table 1. Summary of the results for the 29,525 barcode sequences (produced from PacBio 756 Sequel data analyzed using the mBRAVE platform) after processing with the debar pipeline. 757 758 PacBio Sequel run Run 1 Run 2 Run 3 Run 4 Total Consensus sequences generated 7,518 7,373 7,235 7,399 29,525 Consensus sequences flagged by coil for indel error 869 817 900 909 3,495 (11.8%) Rejected by debar denoising 8 4 16 9 37 (0.1%) Sequences flagged by coil post-denoising 256 285 305 277 1,123 (3.8%) Sequences corrected 605 528 579 623 2,335 (66.8% of flagged sequences) 759 760 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 35 Table 2. Assessment of the correction ability of the debar pipeline for the subset of sequences in 761 the high-confidence error set. This set of sequences was flagged by coil and produced a stop 762 codon when translated within all reading frames. The top half of the table indicates the number 763 of sequences flagged by coil as likely to be erroneous, based on the log likelihood values of the 764 sequences. Results are shown for sequences both before and after the denoising process. The 765 bottom half of the table contains the number of sequences flagged by coil as likely to be 766 erroneous, based on the presence of a stop codon in the amino acid sequence resulting from the 767 censored translation of the framed nucleotide sequence. This high success for the stop-codon 768 metric (86.3% of errors removed) indicates that the pipeline is an effective means of correcting 769 frameshift-causing insertion or deletion errors. The relatively lower success at correcting 770 sequences with low log likelihood values suggests that frameshift-causing errors are not the only 771 set of errors being flagged by coil, and that non-frameshift errors are not effectively corrected by 772 the debar pipeline. 773 774 PacBio Sequel run Run 1 Run 2 Run 3 Run 4 Total Original flagged 551 547 609 610 2,317 Flagged post- denoising 254 280 300 271 1,105 Corrected 53.9% 48.8% 50.7% 55.6% 52.3% PacBio Sequel run Run 1 Run 2 Run 3 Run 4 Total Original stop codon 319 295 318 350 1,212 Stop codon post- denoising 43 42 36 55 176 Corrected stop codons 86.5% 85.7% 88.7% 84.2% 86.3% 775 776 777 778 779 780 781 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 36 Table 3. Result of the BOLD Data System evaluation of debar denoising workflow’s 782 effectiveness. The number of sequences identified by BOLD as containing stop codons, before 783 and after processing with the denoising pipeline (Figure 2). Only the 27,041 specimens with 784 barcodes and taxonomic information produced through the processing of PacBio Sequel data on 785 the MBRAVE platform were considered, as BOLD requires taxonomic information for assessing 786 the presence of stop codons. The rows break the sequences down into categories, which indicate 787 the source of the post-denoising sequence that was submitted to BOLD for assessment. 788 789 Sequence category Total Sequence count Stop codon count Percent error reduction Original Post-denoising Unaltered 23,992 88 88† - Denoised, altered 2,265 1,190 59† 95% Flagged for potential error, unaltered 701 223 223* - Flagged and rejected 16 14 14 - Labelled as Wolbachia by MBRAVE 67 0 0 - Total 27,041 1,515 (6.3%) 384 (1.6%) 74.66% Total, non- flagged only 26,257 1,278 (4.8%) 147 (0.6%) 88.5% † The sum of these categories (shown in the final row of the column) represents the false 790 negative rate for the denoising pipeline. These are the 0.6% (147/27,041) of sequences that 791 appear to contain true stop codons that were not flagged for denoising, or that were denoised 792 unsuccessfully and not flagged as potential errors. 793 * The false positive rate of the denoising pipeline is the number of sequences in this category 794 that do not in fact contain a stop codon. There is a total of 478 (701-223) false positives and an 795 overall false positive rate of 1.8% (478/27,041). Since this set of sequences are flagged for 796 potential errors, as opposed to being outright rejected, additional inspection of sequences in this 797 category can separate the unsuccessfully denoised sequences with true errors from those that do 798 not contain an error. 799 800 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 37 Table 4. Assessment of the sequence quality for data from a mock community of arthropods 801 sequenced in bulk using a Thermo Fisher Ion Torrent and processed on the mBRAVE platform. 802 Sequencing and processing results in two sets of data, groups of sequences assigned to BINs and 803 groups of sequences clustered into OTUs. The representative sequences (centroids before 804 denoising, consensus after denoising) and all individual sequences were checked with the R 805 package coil for evidence of frameshifts (stop codons in amino acid sequence) before and after 806 denoising to see if processing the data with the debar package resulted in higher quality barcode 807 sequences. 808 809 Original After debar denoising Sequences analyzed Sequence data source Total count Stop codon count Total count Stop codon count Representative sequences Assigned to BINs 398 125 (31.4%) 394 7 (1.8%) OTUs 1,255 681 (54%) 1,224 134 (10.6%) ESVs Assigned to BINs 123,926 61351 (49.5%) 122,349 2858 (2.3%) OTUs 2,199 1310 (59.57%) 2,145 418 (19.49%) 810 811 812 813 814 815 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 38 816 817 Figure 1. Diagram demonstrating the debar package’s denoising workflow. Blue indicates 818 nucleotides that are part of the barcode region and orange nucleotides in bold font indicate 819 technical errors or sequence from outside of the barcode region. 820 A. The debar package operates on a sequence-by-sequence basis, taking each input and 821 constructing a custom DNAseq object. A DNAseq object can receive a DNA sequence, an 822 identifier, and optionally a sequence of corresponding PHRED quality scores. Although not 823 utilized in the denoising, indel-correcting adjustments to the sequence are applied to the PHRED 824 scores as well, so that quality information can be carried from input to output. 825 B. Following DNAseq object construction, the sequence is compared to the PHMM using the 826 Viterbi algorithm. By default, the full length (657bp) COI-5P PHMM contained in debar is used 827 to evaluate the sequence. When required, a user may pass a custom PHMM corresponding to a 828 subsection of the COI-5P region (specified using the coil package’s subsetPHMM function) or a 829 custom PHMM trained on user-defined data (Wilkinson 2019). The frame function isolates the 830 correction window, which is the section of the sequence matching the PHMM (the first 10 831 consecutive base pairs matching to the PHMM on the leading and trailing edges of the sequence 832 establish the section of the input on which subsequent corrections are applied). 833 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 39 C. The adjust function traverses the section of the sequence and Viterbi path defined by the 834 frame function. When evidence of an inserted base pair (‘2’ label in the Viterbi path) is 835 encountered, the corresponding base pair is removed. When evidence of a deleted base pair is 836 encountered (a ‘0’ label in the Viterbi path) a placeholder ‘N’ nucleotide is inserted. Exceptions 837 are made for triple inserts or triple deletes (three consecutive ‘0’ or ‘2’ labels), which are skipped 838 by the adjustment algorithm, as they are indicative of mutations that would not have a large 839 impact on the structure of the protein-coding gene region and could reflect biological amino acid 840 indels. The total number of adjustments made by debar is limited by the parameter ‘adjust_limit’ 841 (default = 5), sequences requiring adjustments in excess of this number are flagged for rejection, 842 as this high frequency of indels is likely not the result of technical error, but rather other sources 843 of noise such as pseudogenes. Following adjustment, a mask of placeholder ‘N’ nucleotides is 844 applied to base pairs flanking the corrected indel (default is 7bp in each direction, see Figure 3. 845 For derivation of default). Masking of 7bp flanks adjacent to each correction allows imprecise 846 corrections to effectively correct sequence length and also mask true indel locations in the 847 majority of instances. 848 D. Following adjustment, the denoised sequences are output by debar. By default, the outputs 849 will include trailing sequence outside of the correction window. Leading information outside of 850 the correction window is dropped, so that sequences are aligned with a common starting position. 851 A user can choose to keep only the correction window, or have both flanking regions appended 852 back on to the sequence output. 853 E. If multiple denoised sequences are available (for either a given specimen in the case of 854 barcoding or a given OTU in metabarcoding) then the consensus of the denoised sequences can 855 be taken. The consensus function assumes the sequences have been denoised and their left flanks 856 removed; as a result, they are aligned to one another. The modal base pair for each position is 857 then taken to generate a consensus sequence, and in the case of ties, a placeholder “N” character 858 is added to the consensus. 859 860 861 862 863 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 40 864 Figure 2. Diagram of the denoising workflow used to improve the quality of barcodes produced 865 by processing Pacific Biosciences Sequel circular consensus data on the mBRAVE platform. (i) 866 Pacific Biosciences Sequel data are processed on the mBRAVE platform, and an initial set of 867 barcode sequences is produced. (ii) The set of consensus barcode sequences produced by the 868 mBRAVE platform are obtained and analyzed with the coil package, using the ‘coi5p_pipe’ 869 function (default parameters). Sequences displaying evidence of an indel (either the presence of a 870 stop codon when translated to amino acids or an amino acid sequence with a low likelihood 871 score) are retained for further denoising. (iii) For each barcode with evidence of an error, all 872 component CCS reads (and associated metadata) derived from the given specimen are obtained 873 from mBRAVE. (iv) Based on the mBRAVE metadata, sequences are trimmed to remove 874 primers, MID tags, and adapter sequence. The reverse complement of reads are taken when 875 required. (v) The ‘denoise_list’ function of debar is used to denoise all CCS reads (options: 876 dir_check = FALSE, keep_flanks = ‘right’, censor_length = 7). Rejected reads (those flagged by 877 the denoise_list function) are removed from the dataset. (vi) For each specimen, the reads are 878 clustered into OTUs using the R package kmer (clustering threshold = 0.975). This is done to 879 mitigate the influence of outlier CCS or contaminant sequences. (vii) For each OTU, a consensus 880 sequence is generated using debar’s ‘consensus’ function. For each specimen, OTUs are ranked 881 based on the number of component CCS reads they contain. (vii) The consensus sequences are 882 reassessed with coil. If the top-ranked consensus sequence now passes the coil check, it is 883 deemed to have been successfully denoised, and it is selected as the output barcode. If not, the 884 check is repeated for the second-ranked consensus sequence (when available), and this output is 885 retained if it is barcode compliant. If neither the first nor second highest ranked consensus 886 sequence passes the coil check, then the original (pre-denoising process) barcode is retained, as 887 no meaningful improvement was made. In this situation the barcode is flagged as likely to 888 contain an error. 889 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 41 890 Figure 3. The debar package’s denoising of 10,000 COI sequences containing single 891 insertion or deletion errors. So that exact error positions were known, errors were artificially 892 introduced in accordance with known probabilities for COI DNA barcode data from the 893 PacBio Sequel platform (Hebert et al. 2018). Denoising was accomplished through altering 894 sequences in accordance with the Viterbi path yielded by comparison to the PHMM. The 895 correct number of adjustments was made for 9,455 sequences, and 61.8% of these corrections 896 located the indel exactly. Masking of 7bp flanks adjacent to each correction allowed 897 imprecise corrections to correct sequence length and mask the true indel location 96% of the 898 time. For the 545 instances where an incorrect number of adjustments were made, 269 were 899 caught through query of the amino acid sequence for stop codons and the trimming of 900 spurious matches at the edge of sequences. Overall, 95.74% of errors were effectively 901 corrected or identified as erroneous. 902 903 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 42 904 905 Figure 4. Histogram indicating the position in the COI-5P region of the 426 uncorrected indel 906 errors from the 10,000-sequence artificial error dataset. The x axis indicates the base pair 907 position in the COI-5P profile, and the y axis displays the number of sequences that contained an 908 uncorrected error at the given range of positions (bins of 10 base pair positions). 909 910 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 43 911 912 Figure 5. Histogram showing number of base pairs between inexact corrections applied by debar 913 and the ground truth error location for the given sequence. In total 3,612 sequences (36.12%) had 914 errors that were denoised inexactly, and corrections were an average of 2.31 bp (sd = 1.9767) 915 away from the exact ground truth error location. 916 917 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 44 918 919 Figure 6. Relationship between the amount of missing data in the final denoised barcode 920 sequences (number of Ns divided by the total length of the sequence) and the number of CCS 921 reads that contributed to the generation of the barcode. The figure displays only the 1,008 922 denoised barcode sequences submitted to BOLD that contained at least one “N” (the remaining 923 28,517 barcode sequences in the BOLD submission did not contain an “N”). 924 925 926 927 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 45 Supplementary Information 928 929 Supplementary File 1 ('S1-single_errors_in_10k_sequences.csv') The 10,000 COI barcode 930 sequences with single introduced indel errors that were used to test debar and calibrate the 931 default parameters. 932 933 Supplementary File 2 ('S2-control_denoising_no_errors.csv') The 10,000 COI barcode 934 sequences with no known indel errors used to assess the false correction rate of debar 935 936 Supplementary File 3 ('S3-single_file_pipeline') Scripts and example data for the denoising 937 pipeline developed to process COI DNA barcode sequence data produced using the Pacific 938 BioSciences Sequel sequencer and mBRAVE platform 939 940 Supplementary File 4 Scripts and example data for the denoising pipeline developed to process 941 COI DNA metabarcode sequence data produced using the IonTorrent S5 sequencer and the 942 mBRAVE platform 943 944 Supplementary File 5 Vignette demonstrating the functionality of the debar package. The 945 vignette is also available as part of the R package 946 (https://github.com/CNuge/debar/tree/master/vignettes) 947 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425285doi: bioRxiv preprint https://github.com/CNuge/debar/tree/master/vignettes https://doi.org/10.1101/2021.01.04.425285 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_01_04_425288 ---- Predicting chemotherapy response using a variational autoencoder approach i i “output” — 2021/1/4 — 20:53 — page 1 — #1 i i i i i i Bioinformatics doi.10.1093/bioinformatics/xxxxxx Advance Access Publication Date: Day Month Year Original Paper Topic Area: Biomedical Informatics Predicting chemotherapy response using a variational autoencoder approach Qi Wei 1∗ and Stephen A. Ramsey 2∗ 1School of EECS, Oregon State University, Corvallis, Oregon 97333, USA 2Department of Biomedical Sciences and School of EECS, Oregon State University, Corvallis, Oregon 97333, USA. ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Motivation: Multiple studies have shown the utility of transcriptome-wide RNA-seq profiles as features for machine learning-based prediction of response to chemotherapy in cancer. While tumor transcriptome profiles are publicly available for thousands of tumors for many cancer types, a relatively modest number of tumor profiles are clinically annotated for response to chemotherapy. The paucity of labeled examples and high dimension of the feature data limit performance for predicting therapeutic response using fully-supervised classification methods. Recently, multiple studies have established the utility of a deep neural network approach, the variational autoencoder (VAE), for generating meaningful latent features from original data. Here, we report first study of a semi-supervised approach using VAE-encoded tumor transcriptome features and regularized gradient boosted decision trees (XGBoost) to predict chemotherapy drug response for five cancer types: colon adenocarcinoma, pancreatic adenocarcinoma, bladder carcinoma, sarcoma, and breast invasive carcinoma. Results: We found: (1) VAE-encoding of the tumor transcriptome preserves the cancer type identity of the tumor, suggesting preservation of biologically relevant information; and (2) as a feature-set for supervised classification to predict response-to-chemotherapy, the unsupervised VAE encoding of the tumor’s gene expression profile leads to better area under the receiver operating characteristic curve (AUROC) classification performance than either the original gene expression profile or the PCA principal components of the gene expression profile, in four out of five cancer types that we tested. Availability: github.com/ATHED/VAE_for_chemotherapy_drug_response_prediction Contact: ramseyst@oregonstate.edu Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction Although chemotherapy is a mainstay of treatment for aggressive cancers, many agents have serious side effects (Airley, 2009). Whether or not chemotherapy will provide a net benefit to a patient depends in large part on whether the malignancy responds to the treatment. Chemotherapy is often administered in cycles (Skeel, 2003), leading to multiple opportunities where treatment appropriateness may be (re- )assessed (Chabner and Longo, 2005). Currently, the medical cost-benefit of chemotherapy (versus a non-pharmaceutical approach) is assessed in light of patient health status, expected therapeutic tolerance, and tumor pathological classification (Kaestner and Sewell, 2007; Gurney, 2002). For many cancer types, there is a broad spectrum of cases where the decision of whether or not to undergo or continue chemotherapy is difficult (Corrie, 2008; Whelan et al., 2003; Malfuson et al., 2008). The development of a quantitative model that could predict—based on a specific tumor’s molecular signature—whether or not the tumor will respond to chemotherapy would have significant clinical utility and would potentially improve patient quality-of-life. Moreover, an advance in machine-learning methods for the response-to-chemotherapy prediction problem (Chiu et al., 2019; Geeleher et al., 2014) would have potential crossover benefits for other prediction problems in precision medicine. Oncogenesis is driven by alterations in the somatic genome and epigenome in cancer cells (Weir et al., 2004); however, the somatic genetic or epigenetic determinants of response to chemotherapy are also thought © The Author 2021. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 1 .CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint https://github.com/ATHED/VAE_for_chemotherapy_drug_response_prediction ramseyst@oregonstate.edu Weiqi https://doi.org/10.1101/2021.01.04.425288 http://creativecommons.org/licenses/by/4.0/ i i “output” — 2021/1/4 — 20:53 — page 2 — #2 i i i i i i to exert measurable effects on gene expression in the tumor. Consistent with this theory, studies of various cancer types have demonstrated that biomarkers identified from systematic measurement of the patient’s cancer transcriptome or proteome correlate with the probability that a tumor will respond to chemotherapy, for example, a five-protein signature in breast cancer (Gámez-Pozo et al., 2017), 13- and 14-gene signatures in rectal cancer (Casado et al., 2011; Del Rio et al., 2007), and a 63-gene signature in liver cancer (Kurokawa et al., 2004). Taken together, the findings from such “omics” biomarker studies suggest that RNA sequencing- (RNA- seq (Wang et al., 2009))-based transcriptome measurements of tumor samples labeled with clinical response can be used to train machine- learning classifiers for predicting response to chemotherapy. However, the accuracy of such models is presently limited by the small number of available training cases that are labeled for clinical outcome, given the large size of the transcriptome (∼60k genes Frankish et al., 2018) and the significant intertumoral variance of gene expression. For typical cancers, most of the profiled tumor transcriptomes are not labeled for chemotherapeutic response; the ratio of such unlabeled to labeled tumor datasets in the Cancer Genome Atlas (TCGA) dataset (Hutter and Zenklusen, 2018) ranges from 10–20, depending on the cancer type. While using (exclusively) supervised learning methods for the response-to- chemotherapy prediction problem has been a sensible first step, unlabeled data are a substantial resource that could—in the context of a semi- supervised approach—reveal multivariate structure or patterns that could ultimately improve predictive accuracy. Semi-supervised approaches that fuse unsupervised data reduction methods (such as principal components analysis, or PCA) for low-dimensional embedding with supervised methods (such as decision trees) for prediction have proved beneficial in problems where large unlabeled datasets are available, for example, a PCA-XGBoost method has been previously used in finance (Wen and Huang, 2020), and an independent components analysis-based method has been used to classify electroencephalographic signals (Qin et al., 2006). Multiple studies (An and Cho, 2015; Li and She, 2017; Bouchacourt et al., 2017; Kipf and Welling, 2016) have established the power of the variational autoencoder (VAE; Kingma and Welling (2013); Jimenez Rezende et al. (2014))—an unsupervised nonlinear data embedding model with two deep neural networks oppositely connected through a low-dimensional probabilistic latent space—for finding meaningful and useful latent features in high-dimensional data. In the context of cancer bioinformatics, VAEs have been variously used to (i) model cancer gene expression and capture biologically-relevant features using the TCGA Pan-cancer Project RNA-seq dataset (Way and Greene, 2018); (ii) find encodings that correlate with biological features such as patient sex and tumor type (Titus et al., 2018); (iii) find encodings that can be used to predict gene inactivation in cancer (Way and Greene, 2017); and (iv) obtain an encoding that is predictive of chemotherapy resistance (George and Lio, 2019). Based on their exploration of multiple VAE architectures for predicting gene inactivation in a pan-cancer dataset, Way & Greene reported (2017) biological insights obtained from the latent-space embeddings learned by VAEs. George and Lio (2019) used a VAE-based, fully unsupervised approach to encode ovarian tumor transcriptomes and obtained latent-space features that were associated with response to chemotherapy. These studies suggest that a tumor transcriptome VAE may be broadly useful for the response-to-chemotherapy prediction problem and they set the stage for the present multi-cancer investigation of the utility of the tumor transcriptome VAE in precision oncology. Given previous reports of success using a VAE to obtain useful low-dimensional encodings of transcriptome data (Dong et al., 2020; Way and Greene, 2018; Way and Greene, 2017), in this work, we first sought to ascertain to what extent a VAE encoding of tumor transcriptome data would preserve biological characteristics—spanning multiple genes at a time that have coordinated variation across tumors— that are associated with distinct cancer types. To answer this question, we trained a pan-cancer transcriptome VAE and used it to encode TCGA tumor RNA-seq data from 9,310 tumors comprising 32 different cancer types, focusing on the top 5,000 most variable genes. We trained the VAE using an efficient contemporary optimization engine (Adam) to find the VAE coefficient values that together balance reconstruction loss and desired latent-space distributional shape. We applied an unsupervised two-dimensional embedding method (t-distributed stochastic neighbor embedding, or t-SNE) directly to tumor transcriptome and to the VAE- embedded tumor transcriptome data, and mapped clusters of tumors by cancer type across the two t-SNE embeddings. We found (Sec. 2.1) that the VAE preserves the clustering of tumors of the same cancer type, suggesting biological fidelity in the components of the VAE embedding. Next, to set the stage for a semi-supervised approach for predicting cancer response to chemotherapy, we selected five cancer types (breast, bladder, colon, pancreatic, and sarcoma) based on sufficient availability of clinically labeled data and then defined three different VAE architectures: VAE-1, which we used to obtain feature data for bladder, breast, and pancreatic cancer; VAE-2, for sarcoma; and VAE-3, for colon cancer. In order to train a VAE, it is necessary to specify a reconstruction loss function; both L2 and L1 reconstruction loss have been used for training VAEs in machine-learning, and we sought to clarify which is best for this application. Thus, we trained each of the three VAE architectures on 2,606 tumor transcriptomes from TCGA, in an unsupervised fashion, separately using L1 loss and L2 loss. Next, in order to label tumors for response to chemotherapy, we analyzed the available TCGA clinical data regarding the outcome of pharmaceutical therapy (in most cases including chemotherapy) for each of the patients, and thereby assigned a label “responded” or “progressive” to 806 out of the 2,606 tumors (Sec. 2.2); the remainder of the tumors were unlabeled and thus used only during VAE training. For the 806 labeled tumors, we used the VAE- encoded latent vectors as feature data for supervised prediction of the binary label using gradient boosted decision trees (XGBoost; Chen and Guestrin (2016)). Using this semi-supervised “VAE-XGBoost” approach, we found (Sec. 2.3) that a VAE trained using L1 reconstruction loss yields features that result in better classification performance (by area under the receiver operating characteristic, AUROC) than a VAE trained using L2. In the main part of this work, using XGBoost, we measured response-to-chemotherapy prediction performance for each of three tumor transcriptome-derived feature sets: (i) expression levels of the top 20% of genes, by intertumoral variance (a fully supervised approach); (ii) the first 387 principal components of expression levels of “top 20%” genes (“semi- supervised PCA-XGBoost”); and (iii) VAE-encoded expression levels of the top 20% genes (“semi-supervised VAE-XGBoost”, our new method, Fig. 1). Within a cross-validation framework for AUROC performance evaluation, we found (Sec. 2.4) that for four out of five cancer types, the semi-supervised VAE-XGBoost approach outperformed the fully- supervised approach. Moreover, for four out of the five cancer types, semi- supervised VAE-XGBoost outperformed semi-supervised PCA-XGBoost. Finally, for the one cancer type for which PCA-XGBoost outperformed VAE-XGBoost, we investigated their relative performance through the lens of XGBoost feature importance (Sec. 2.5). Below, we describe our results (Sec. 2) and the VAE-XGBoost method in detail (Sec. 5). 2 Results 2.1 VAE encoding preserves cancer type features Given multiple reports (Dolezal et al., 2018; Esteva et al., 2017) that t-SNE can be used to visualize the grouping of cancer types from high- dimensional molecular tumor data, we investigated the extent to which .CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425288 http://creativecommons.org/licenses/by/4.0/ i i “output” — 2021/1/4 — 20:53 — page 3 — #3 i i i i i i Gene Expression Data (Original Input, )x Reconstructed Gene Expression Data ( Output, )x̃ Encoder Network Eq. 2 mean vector ( )μ ̂θ Variance vector ( )σ ̂θ Sampled latent vector ( )z Decoder Network g ̂ϕ Add labeled input ( )y Latent vector + Label as input (z, y) XGBoost Classifier Eq. 12 & Eq. 13 Probability of predicated Label P(ỹ | z) Reparameterize Sampling Eq. 5 & Eq.6 Fig. 1: Overview of the VAE-XGBoost method that we used for predicting tumor response to chemotherapy. For each tumor t, the encoder’s input vector xt contains the levels of the top 20% of genes by intertumoral gene expression variance (Sec. 5.1). Each network has multiple fully connected dense layers (Sec. 5.5). The encoder outputs two vectors of configurable latent variable dimension h � m (Sec. 5.5): a vector of means µ and a vector of standard deviations σ that parameterize the multivariate normal latent-space vector Z|xt (Sec. 5.3). The sampled encoding Z|xt = zt is passed to the decoding neural network (decoder), whose architecture is identical to (with inversion) that of the encoder network. The sampled latent-space vector zt is passed to XGBoost for supervised classification to predict response to chemotherapy (training label y, prediction ỹ). VAE encoding of tumor transcriptomes preserves data-space features that determine cancer type-specific groupings. In order to do so, we obtained (Sec. 5.1) from the TCGA data portal RNA-seq transcriptome data for 9,310 tumors labeled for 32 different cancer types (listed in Fig. 2). As a baseline view of transcriptome-based cancer type groupings, we generated a two-dimensional embedding of the 9,310 tumor samples by applying t-SNE (Sec. 5.2) to the expression levels of the top 5,000 most variable genes, yielding 32 distinct clusters (Fig. 2A). Next, we trained (Sec. 5.3) a VAE to encode the expression levels of the 5,000 most variable genes in each of 9,310 tumors into 9,310 points in a 50-dimensional latent space. An unsupervised t-SNE visualization (Fig 2B) of the VAE-encoded tumor transcriptome data was remarkably similar in structure to the t-SNE visualization of the 5,000-dimensional original dataset, with intercluster distances for all pairs of clusters correlated between of the two t-SNE plots (R = 0.49; see Fig. S1). This analysis indicated that the VAE encoding preserves data-space features that distinguish individual cancer types. 2.2 Obtaining a labeled tumor transcriptome dataset Having demonstrated that the VAE can efficiently encode tumor transcriptomes while preserving features that distinguish different cancer types, and to set the stage for implementing a semi-supervised approach for predicting response to chemotherapy, we obtained a five-cancer- type tumor transcriptome dataset with a significant subset of the tumors labeled for “response to chemotherapy”, as described below. We obtained transcriptomes of 806 tumors across five cancer types [colon adenocarcinoma (COAD), pancreatic adenocarcinoma (PAAD), bladder carcinoma (BLCA), sarcoma (SARC), and breast invasive carcinoma (BRCA); see Table 1] that we selected based on availability of a sufficient amount of labeled data in TCGA (see Sec. 5.1) and generated binary clinical labels for them corresponding to “responded” or “progressive” (see Sec. 5.4). Among these tumors, the class balance ratio, i.e., the ratio of responding tumors to progressive disease tumors, ranged from a low of 0.77 for pancreatic cancer to a high of 8.61 for breast cancer. 2.3 L1 loss is better than L2 loss for this application Having obtained 2,606 tumor transcriptomes across five cancer types with 806 of the tumors labeled for response to chemotherapy, we next sought to determine which type of VAE reconstruction loss function—L1 loss or L2 loss—would yield transcriptome encodings that are most amenable to accurate XGBoost-based prediction of response to chemotherapy. On the 2,606 tumor transcriptomes, we trained two sets of cancer type-specific VAEs (see Sec. 5.5) using L1 and L2 loss functions, respectively. We used the L1 and L2 VAEs to encode the 806 labeled tumor transcriptomes (the top 20% most variable genes in each cancer type, merged across the five cancers, for a total of 13,584 genes) spanning the five cancer types, yielding (for each cancer type) two feature matrices (one for L1 loss and one for L2 loss) that we separately evaluated for XGBoost prediction (Sec. 5.6) of the binary response-to-chemotherapy class label. By test-set area under the receiver operating characteristic (AUROC; Sec. 5.7), averaged across the five cancers, we found (Fig. 3) that the features that were generated by the L1 VAEs led to 6.2% better (p < 10−9, Welch’s t-test) classification performance than the features generated by the L2 VAEs, and thus, for all subsequent analyses, we used VAEs trained with L1 loss. 2.4 Chemotherapy drug response classification result Having selected L1 reconstruction loss for training VAEs to encode tumor transcriptomes for predicting response-to-chemotherapy, we focused on the key question of whether (and to what extent) a semi-supervised approach using the VAE can outperform (in terms of predictive accuracy) a fully supervised approach or a semi-supervised approach based on a traditional dimensional reduction technique (principal components analysis, PCA). In brief, our VAE-based semi-supervised approach involves three steps: (i) training a VAE to encode clinically unlabeled tumor transcriptomes (for the top 20% most variable genes) for a single cancer type, into a low-dimensional space (Sec. 5.5); (ii) using that VAE to obtain latent-space encodings for the tumor transcriptomes that are labeled for a relevant clinical endpoint (in this work, response to chemotherapy); and (iii) training and testing a supervised classifier (in this work, XGBoost binary classification) using the latent-space encodings as feature data. To address the question of whether this VAE-based, semi-supervised (VAE-XGBoost) approach can outperform a fully supervised approach, we compared the performance of the VAE-XGBoost method to a fully supervised approach in which we applied XGBoost directly to the tumor expression levels of the top 20% most variable genes (13,584 genes) as feature data. In the same analysis, to address the question of whether .CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425288 http://creativecommons.org/licenses/by/4.0/ i i “output” — 2021/1/4 — 20:53 — page 4 — #4 i i i i i i Table 1. Table of numbers of samples with chemotherapy response record for each cancer type (n.b., the total number of labeled tumor samples exceeds the total number of patients because some patients had multiple tumors). After each cancer type, its TCGA abbreviation is shown in parentheses. cancer type total number of samples (labeled and unlabeled) number of labeled samples proportion of labeled samples class balance ratio (responding/progressive) breast invasive carcinoma (BRCA) 1,217 394 0.324 8.61 colon adenocarcinomas (COAD) 512 117 0.229 1.72 bladder carcinoma (BLCA) 430 115 0.267 0.95 pancreatic adenocarcinoma (PAAD) 182 115 0.632 0.77 sarcoma (SARC) 265 65 0.245 0.82 sum 2,606 806 Table 2. Quantitative AUROC performances of XGBoost (“Raw data”), PCA-XGBoost (“PCA”), and VAE-XGBoost (“VAE”), along with pairwise comparisons. AUROC (mean) p (Welch’s t-test) p (Wilcoxon signed-rank test) Cancer type VAE PCA Raw data VAE versus Raw data VAE versus PCA VAE versus Raw data VAE versus PCA BRCA 0.674 0.614 0.649 8.07 × 10−4 3.80 × 10−12 6.08 × 10−4 5.59 × 10−9 COAD 0.694 0.726 0.674 3.74 × 10−3 1.38 × 10−3 6.64 × 10−3 2.37 × 10−3 BLCA 0.630 0.593 0.626 4.26 × 10−1 1.13 × 10−4 5.05 × 10−1 1.53 × 10−4 PAAD 0.738 0.710 0.694 6.99 × 10−6 5.04 × 10−3 3.24 × 10−6 4.67 × 10−3 SARC 0.704 0.682 0.679 3.49 × 10−2 2.91 × 10−2 6.06 × 10−2 3.82 × 10−2 the VAE-XGBoost method could outperform a semi-supervised approach based on PCA dimensional reduction, we compared the VAE-XGBoost method to the PCA-XGBoost method. We carried out this analysis for each of the five cancer types separately, using the set of cancer type-specific labeled tumors (totaling 806 labeled tumors). We measured performance using test-set AUROC in a cross-validation framework (Sec. 5.7). For four out of five cancer types (breast, colon, pancreatic, and sarcoma), in terms of test-set AUROC, the VAE-XGBoost approach outperformed the fully-supervised approach of applying XGBoost directly to the expression levels of the tumors’ top 20% most variable genes (Fig. 4), by both Welch’s t-test and Wilcoxon’s signed-rank test (Table 2); for BLCA, the semi-supervised VAE-XGBoost and fully-supervised models’ performances were statistically indistinguishable. Additionally, for four out of five cancer types (bladder, breast, pancreatic, and sarcoma), the semi-supervised VAE-XGBoost method significantly outperformed the semi-supervised PCA-XGBoost method (Fig. 4 and Table 2). The five- cancer average AUROC for VAE-XGBoost was 0.682, a performance gain of 5.4% over the five-cancer average AUROC for PCA-XGBoost (0.646) and a gain of 3.6% over the fully-supervised model’s average (0.658). Notably, a single deep VAE architecture (VAE-1, which had a 50- dimensional latent space and six layers in the encoder; see Sec. 5.5) yielded latent-space encodings that outperformed semi-supervised PCA-XGBoost for three cancer types (bladder, breast, and pancreatic). 2.5 PCA & VAE feature importance scores, for COAD Having established that the semi-supervised VAE-XGBoost outperforms the semi-supervised PCA-XGBoost approach for tumor transcriptome- based prediction of response to chemotherapy for four out of five cancer types, we sought to understand the basis for the higher performance of PCA-XGBoost over VAE-XGBoost on the fifth cancer type, colon adenocarcinoma (COAD). Specifically, we investigated whether the strong performance of PCA-XGBoost on COAD is attributable to differences in the distributions of XGBoost feature importance scores (Sec. 5.6) of the PCA features versus VAE latent-space features. We found that the distribution of feature importance scores (as a function of rank) was more sharply peaked at lowest-ranked features in the VAE than in the PCA (Fig. 5), suggesting that the performance gain with PCA reflects a broader spectrum of informative features for that particular cancer type. 3 Discussion As far as we are aware, this work is the first report of a broad (five- cancer) investigation of the potential for a VAE-based, semi-supervised approach for predicting response to chemotherapy. Across the five cancer types that we studied, the ratio of responding tumors to progressive disease tumors ranged from a low of 0.77 for pancreatic cancer to a high of 8.61 for breast cancer, reflecting a broad range of resistances to standard-of-care chemotherapy. Our results clearly demonstrate the utility of the VAE for compressing high-dimensional data to a continuous, low-dimensional latent space while retaining features that are essential for distinguishing different cancer types and for predicting response to chemotherapy. Nevertheless, three limitations of this work bear noting. The first limitation concerns the type(s) of tumor “omics” data from which features are derived for the predictive model. While in this work we focused on tumor transcriptome data which can be measured with high precision over a wide dynamic range of transcript abundances by RNA- seq, we note that TCGA datasets of tumor somatic mutations and copy number alteration events are also available (Hutter and Zenklusen, 2018). Given the voluminous literature on the use of tumor somatic genomic data for precision cancer diagnosis (Mitchel et al., 2019; Zhang et al., 2020; Lee et al., 2019), tumor DNA datasets are fertile ground for developing a semi- supervised, multi-omics model for predicting response to chemotherapy. Second, we noted for decision tree-based response-to-chemotherapy prediction, the performance of VAE-encoded transcriptome features is somewhat sensitive to the type of normalization used for the input data (data not shown). We explored various types of normalization for the RNA- seq data including standardization of log counts and using FPKM data, we ultimately chose min-max-normalized log2 total-count-normalized counts (Sec. 5.1) for the gene expression levels to be used to derive features. However, there are additional transcript quantification methods (Evans et al., 2017) that could be explored in the context of finding optimal tumor transcriptome VAE encodings for precision oncology. A similar comment applies to the specific form of the reconstruction loss function: in our analysis, features from the VAE trained with L1 loss clearly (across five cancers) outperformed those from the VAE trained with L2 loss, and thus, consistent with Way and Greene (2017), we used L1 loss for the VAE that we used to address the main question of this work (Sec. 2.4) as well as the pan-cancer t-SNE analysis (Sec. 2.1) .CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425288 http://creativecommons.org/licenses/by/4.0/ i i “output” — 2021/1/4 — 20:53 — page 5 — #5 i i i i i i Fig. 2: Marks represent tumor transcriptomes visualized using t-SNE, with colors representing cancer types. (A) Original gene expression data of the top 5,000 most variable genes. (B) VAE compressed gene expression data. Red rectangles denote the five cancer types selected for chemotherapy response classification (Sec. 2.4). The third limitation relates to the VAE architecture. While it is promising that a single deep VAE architecture (VAE-1, with a 50- dimensional latent space and six fully-connected layers) yielded features that outperformend PCA and the original RNA-seq feature data for three different cancer types (bladder, breast, and pancreatic), for 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 L1_loss L2_loss A U R O C Fig. 3: Average AUROC results over five different types of cancer, by loss type. Squares, mean values; bars, 95% confidence interval (c.i.). colon cancer and sarcoma, it was necessary to use shallower (two- layer) VAE architectures with bigger latent space dimensions (650 and 500, respectively). Of the five cancers studied, colon cancer and sarcoma had the lowest proportions of labeled samples (0.229 and 0.245 respectively; see Table 1). Our findings suggest that for some cancers, a deep, low-latent-dimension VAE architecture yields optimal features for predicting response, while for other cancers, a shallow, medium-sized- latent-dimension VAE architecture is more effective. More study with larger datasets will be required in order to determine whether a single VAE architecture could be successfully used for general-purpose tumor transcriptome feature extraction for precision oncology. While our results show promise for the VAE in the context of a semi- supervised approach for response-to-chemotherapy prediction, for colon cancer, the VAE-XGBoost method did not outperform PCA-XGBoost (though it did outperform the fully supervised approach of XGBoost trained on the unencoded gene expression data). One possible explanation for the colon cancer-specific superior performance of PCA features over VAE features for predicting response to chemotherapy may be due to the fact that while (for COAD) feature importance for the VAE features is sharply peaked for the first few features and falls off fairly rapidly with feature rank, the PCA features have a much flatter distribution of relative feature importance (Fig. 5). Follow-on studies with larger datasets will be required to delineate under what circumstances transcriptome VAE encodings will prove superior to linear principal components. 4 Conclusions For four of the five cancer types that we studied, the semi-supervised VAE-XGBoost approach significantly outperformed a semi-supervised PCA-XGBoost approach for tumor transcriptome-based prediction of response to chemotherapy, reaching a top AUROC of 0.738 for pancreatic adenocarcinoma. For four out of five cancer types, the semi-supervised VAE-XGBoost approach significantly outperformed a fully-supervised approach consisting of XGBoost applied to the expression levels of the top 20% most variably expressed genes. Given high-dimensional “omics” data, the VAE is a powerful tool for obtaining a nonlinear low-dimensional embedding; it yields features that retain biological patterns that distinguish between different types of cancer and that enable more accurate tumor transcriptome-based prediction of response to chemotherapy than would be possible using the original data or their principal components. .CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425288 http://creativecommons.org/licenses/by/4.0/ i i “output” — 2021/1/4 — 20:53 — page 6 — #6 i i i i i i SARC COAD PAAD BLCA BRCA Raw(13,584) PCA(387) VAE(500) Raw(13,584) PCA(387) VAE(650) Raw(13,584) PCA(387) VAE(50) Raw(13,584) PCA(387) VAE(50) Raw(13,584) PCA(387) VAE(50) 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 A U R O C Fig. 4: Test-set performance of the three models for predicting response to chemotherapy, across five cancer types. Group abbreviations: “PCA(387)”, the PCA-XGBoost semi-supervised method (387: number of principal components used as features); “Raw(13,584)“, the fully-supervised XGBoost method (13,584: number of genes used as features); and “VAE(n)”, the VAE-XGBoost semi- supervised method (n: dimension of the latent feature space). Marks correspond to individual replications of five-fold cross- validation; solid squares denote mean; bars indicate 95% c.i; colors denote the type of feature-set (Sec. 5.5): red, “PCA”; olive, “Raw”; cyan, VAE-1; magenta, VAE-2; green, VAE-3. 5 Methods We carried out all data processing and machine-learning tasks on a Dell XPS 8700 workstation equipped with Nvidia Titan RTX GPU and running the Ubuntu GNU/Linux operating system version 16.04. All of the analysis code that we implemented was executed in Python version 3.5.5 except that we used R version 3.3.3 for statistical analysis of AUROC values (Sec. 5.7), gene-level MAD calculations (Sec. 5.1) and plotting (Sec. 5.2). We carried out all statistical tests using the R computing environment (version 3.3.3) and using the R software package stats version 3.4.4. 5.1 Gene expression data From the Xena data portal (Goldman et al., 2019), we obtained TCGA Level 3 tumor RNA-seq transcriptome data of 32 cancer types (totaling 9, 310 tumors) and, for the response-to-chemotherapy prediction problem, five cancer types [colon adenocarcinomas (COAD), pancreatic adenocarcinoma (PAAD), bladder carcinoma (BLCA), sarcoma (SARC), and breast invasive carcinoma (BRCA)] totaling 2, 606 tumors. We selected the five cancer types based on two criteria: (i) a sufficient number (at least 65) of paired tumor-transcriptome and clinical data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 200 400 600 Sum of importance R a n k o f fe a tu re s Group PCA VAE Fig. 5: Bars indicate the sum (over 30 replications) of XGBoost feature importance scores. “Group” indicates the low-dimensional embedding method used (VAE or PCA). Bars separately ordered from highest to lowest (only top 20 most important features shown). samples available for the cancer type; and (ii) a sufficient number (at least 180) of tumor transcriptome samples available (regardless of the clinical data availability) for the cancer type. We obtained both the RNA- seq (gene-level) total-read-count-normalized log2(1 +C) read counts and normalized (fragments per kilobase of transcript per million mapped reads, FPKM (Dillies et al., 2013)) expression data for for 60,483 human genes. To focus the machine-learning on the portion of the tumor transcriptome that had the most variation from tumor to tumor, we identified the top 20% most variable genes as measured by the median absolute deviation (MAD) across tumors, of gene expression in terms of FPKM (we used FPKM for this purpose in order to mitigate bias due to read length and tumor-specific depth of sequencing). For deriving feature-sets for XGBoost prediction directly based on transcript abundances or based on VAE- or PCA encoding, the 20% criterion applied to each of the five cancer types yielded a set of 13,584 genes. We computed MAD using the R package stats version 3.4.4 (R Core Team, 2013) with default parameters. After the variance-filtering step, we used the log2(1 + C) of total-count-normalized count values for the top-20% highest-variance genes (that were selected as described above) to obtain (or encode) feature values. We compared the performance—in terms of minimizing the VAE reconstruction loss (see Sec. 5.3)—of different feature scaling methods (no scaling, min-max normalization, and standardization (Kreyszig et al., 2011)) and selected min-max normalization as the method that we used to rescale gene-level count data for input into the VAE. 5.2 t-distributed stochastic neighbor embedding (t-SNE) We computed t-SNE embedding components of the tumors using the function sklearn.decomposition.manifold.TSNE from the python software package scikit-learn version 0.19.1 with parameters init = “pca′′, perplexity = 20, learning_rate = 300, and n_iter = 400. For plotting the tumor transcriptome t-SNE embeddings, we used the R software package ggplot2 version 3.1.1. 5.3 Variational autoencoder (VAE) An autoencoder is a type of model that combines “encoder” and “decoder” neural networks to learn a low-dimensional continuous data encoding from which the input signal can be approximately reconstructed (Kramer, 1991). A key advantage of an autoencoder is that it is unsupervised, i.e., it can .CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425288 http://creativecommons.org/licenses/by/4.0/ i i “output” — 2021/1/4 — 20:53 — page 7 — #7 i i i i i i be trained without labeled examples. Unlike classical autoencoders (e.g., sparse or denoising autoencoders), the variational autoencoder (VAE) is a generative probabilistic model which maps an input vector to a latent-space random variable (r.v.). Below, we mathematically define the VAE. Let T denote the set of tumors for which the VAE is to be fit to the tumor transcriptomes (with n ≡ |T|) and let m denote the number of genes for which transcript abundances are used to represent the tumor transcriptome. After min-max transformation of the tumor transcriptome measurements (Sec. 5.1), each tumor’s transcriptome is represented as a vector x ∈ [0, 1]m. Let X denote the random variable representing the population distribution from which tumor transcriptomes are sampled, and let X ∈ [0, 1]m×n represent the composite matrix of all sampled tumor transcriptomes). We aim to learn a VAE that will comprise an encoder and decoder, with the encoder consisting of mean and variance functions µ : [0, 1]m → Rh and σ : [0, 1]m → Rh+, respectively. Together, µ and σ map the tumor transcriptome vector xt to a h-dimensional r.v. Z|xt, Z|xt ∼N(µ(xt), diag(σ(xt))), (1) where diag(m) is a matrix whose diagonal elements are the elements of the vector m. The decoder is a function g : Rh → [0, 1]m that, for an outcome Z|xt = zt ∈ Rh, maps g : zt 7→ g(zt) ≡ x̃t; (2) the tilde on x̃t denotes that it is the decoded data for the tumor transcriptome xt. A good autoencoder should have low reconstruction error L, which is convenient to define in terms of the p-norm of the difference between the tumor transcriptome data xt and the reconstructed data x̃t, i.e., ||xt−x̃t|| p p , where || ||p denotes the p-norm. However, this definition of the reconstruction error is only deterministic in the context of a specific outcome Z|xt = zt. Thus, it is conventional to define the reconstruction error as an expectation value over outcomes of Z|xt, L|(X =xt) ≡ E Z|xt=zt (||xt −g(zt)|| p p ), (3) where EΩ represents an expectation value over a space of outcomes Ω. It should be noted the above representation of the reconstruction error is in terms of the outcome, zt, of a r.v. (Z|xt) whose distributional parameter functions µ and σ have hyperparameters (neural network coefficients) that will be fitted. Because Eq. 3 is ill-suited to backpropagation, it is helpful to recast it in terms of a new random variable Et that depends on Z|xt by Et ≡ (diag(σ(xt)))− 1 2 (Zt|xt −µ(xt)). (4) It follows from Eq. 4 and Eq. 1 that Et is standard multivariate normal, Et ∼N(0,I), (5) where I is the h×h identity matrix, and thus, Et does not depend on µ, σ, or t. We therefore drop the subscript t and simply denote the rescaled latent-space random variable as E. Solving Eq. 4 for Z|xt and applying it to Eq. 3, the reconstruction error L|(X =xt) can be represented by L|(X =xt) = EE (∣∣∣∣∣∣xt−g(µ(xt) +√diag(σ(xt)) E)∣∣∣∣∣∣p p ) , (6) which is amenable to backpropagation because the only r.v. in it is E, whose distributional parameters do not depend on the neural network coefficients that we will be varying. In practice, rather than computing the multivariate integral over outcomes of E, L|(X = xt) is typically approximated by averaging over a limited number J of samples from E, L|(X =xt) ' 〈(∣∣∣∣∣∣xt−g(µ(xt)+√diag(σ(xt)) �j))∣∣∣∣∣∣p p )〉 j , (7) where 〈〉j denotes average over j ∈{1, . . . ,J} and �j is sample j from E. Following Way and Greene (2017), we used a number of samples that is equivalent to the dimension of the transcriptome, i.e., J = m. For the case of p = 2 (i.e., L2 norm), minimizing L|(X = xt) as defined above is equivalent to maximizing the expectation value of the log- likelihood log(P(g(Z) = xt | X = xt)). However, following Way and Greene (2017) and consistent with empirical evidence (Sec. 2.3), for our five-cancer study of the utility of a VAE-based approach for response- to-chemotherapy prediction, as well as for the pan-cancer t-SNE analysis (Sec. 2.1), we chose to use L1 reconstruction loss, i.e., p = 1 in Eq. 3. The reconstruction loss measures bias error, whose minimization must be balanced against the simultaneous goal of controlling variance error through regularization. In the VAE, regularization requires incentivizing (in the learning of µ, σ, and g) the latent space distributions of Z|x to be close to standard multivariate normal. This is accomplished by assigning a penalty based on the Kullback-Leibler divergence between the distribution of Z|xt and the target distribution E, represented by DKL(P(Z|xt) ||P(E)). This regularization is analytically tractable (Duchi, 2007), and for a given tumor t yields (see Supplementary Note, Eq. S2) the following regularization function: DKL ( P(Zt|xt) ∣∣∣∣ P(E)) = ||µ(xt)||22 + ||σ(xt)|| 2 2 −|| log(σ(xt))||1 − 1, (8) where log(σt) denotes an element-wise log and || ||1 is the L1 norm. Fitting the VAE to X requires selecting µ, σ, and g from their respective function spaces; in practice, we search over functions that can be represented using a neural network for µ and σ (parameterized by the vector θ)1 and a neural network for the function g (parameterized by the vector φ). Exploring the space of functions µθ, σθ, and gφ corresponds to computationally searching for the vector pair (θ̂,φ̂) that together minimize the joint (over all tumors) sum of the tumor-specific reconstruction loss and the regularization penalty, (θ̂,φ̂) = argmin (θ,φ) ∑ t∈T [ L|(X = xt)+DKL ( P(Z|xt) ∣∣∣∣P(E))]. (9) Applying Eqs. 6, 7, and 8, and setting p = 1 as discussed above, we obtain the explicit formula for fitting a VAE to X, (θ̂,φ̂) = argmin (θ,φ) ∑ t∈T [ 1 J J∑ j=1 (∣∣∣∣∣∣xt −gφ(µθ(xt) + √diag(σθ(xt)) �j)∣∣∣∣∣∣ 1 ) + ||µθ(xt)|| 2 2 + ||σθ(xt)|| 2 2 −|| log(σθ(xt))||1 − 1 ] . (10) We implemented Eq. 10 in Tensorflow version 1.4.1 with Keras version 2.1.3 as the model-level library. We solved Eq. 10 using the Adam optimization algorithm (Kingma and Ba, 2014) (with batch normalization) from the python package keras-gpu version 2.1.3 with parameters learning_rate = 2 × 10−3, beta_1 = 0.9, and beta_2 = 0.999, to obtain (θ̂,φ̂). Then, for each tumor t, we used a single sample Z|xt = zt from the distribution N(µ θ̂ (xt), diag(σθ̂(xt))) as the final latent-space encoding of the tumor to be used for supervised learning (Sec. 5.6). 5.4 Labeling tumors based on response to chemotherapy From Xena and cBioPortal (Cerami et al., 2012; Gao et al., 2013), we obtained and combined TCGA clinical data (where available) for 1 Note, functions µ and σ are just two different outputs of the encoding neural network, differing only at the final layer, and thus for simplicity of notation we represent them as having a common parameter vector θ. .CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425288 http://creativecommons.org/licenses/by/4.0/ i i “output” — 2021/1/4 — 20:53 — page 8 — #8 i i i i i i the patients whose tumor transcriptomes we acquired (see Sec. 5.1). From Xena, we extracted the variables submitter_id.samples, therapy_type, and measure_of_response; from cBioPortal, we extracted the variables Sample_ID, Disease.Free.Status, and Pharmaceutical.Therapy.Indicator. We co-analyzed the Xena- and cBioPortal-obtained clinical data to label tumors “responded” (y = 0) or ”progressive” (y = 1), by assigning y = 0 when the clinical record had Complete response or partial response in the measure_of_response column of the clinical data from Xena, or with value DiseaseFree in the Disease.Free.Status column of the clinical data from cBioPortal while therapy type is recorded as Chemotherapy in both. We assigned y = 1 to tumors whose clinical records had values Radiographic progressive disease, Clinical progressive disease, or stable disease in the Xena clinical data column measure_of_response, or had value Recurred/ progressed in the cBioPortal data column Disease.Free.Status while the therapy_type is recorded as Chemotherapy in both files. This yielded 806 labeled tumors out of 2,606 total. A total of 39 different drugs were used to treat the 794 patients (see Supplementary Note, Table S1). 5.5 VAE model architectures We trained six transcriptome-encoding VAEs based on four VAE architectures, the pan-cancer VAE architecture (for the 32-cancer unsupervised analysis, see Sec. 2.1) and three cancer type-specific VAE architectures for response-to-chemotherapy prediction (Sec. 2.4) (one of which was used for three different cancer types, BLCA, BRCA, and PAAD, and the others of which were cancer type specific for COAD and SARC). For the pan-cancer VAE, we used a latent space dimension h = 50 and three fully connected layers each for the encoder and decoder. For the cancer type-specific VAE architectures, we again used the same number of fully-connected layers in the encoder as in the decoder (Table 3). Table 3. VAE architectures used for predicting chemotherapy response (h, latent space dimension; “layers”, # of layers used in the encoder/decoder). Name Cancer types h Layers VAE-1 BLCA, BRCA, PAAD 50 Six VAE-2 COAD 650 Two VAE-3 SARC 500 Two 5.6 Regularized gradient boosted decision trees (XGBoost) For predicting whether or not (based on its transcriptome-derived feature- set: raw, PCA, or VAE) a tumor would respond to chemotherapy, we used XGBoost (Chen and Guestrin, 2016), an efficient implementation of regularized gradient boosted decision trees. We used the binary classifier function XGBClassifier from the python software package xgboost version 0.80, with gamma=0. We tuned eight hyper- parameters (Table 4) by exhaustive grid-search with five-fold cross- validation, using sklearn.model_selection.GridSearchCV from scikit-learn version 0.19.1. To obtain feature importance scores, we used get_score with importance_type = cover. 5.7 Area Under ROC Curve (AUROC) For computing the AUROC (i.e., sensitivity versus false positive error rate curve), we used the function metrics.roc_auc_score from the python software package scikit-learn version 0.19.1 with parameter average=“weighted”. We logit-transformed AUROC values before testing (using two-tailed Welch’s t-test and the Wilcoxon signed rank test) For the L1 vs. L2 analysis (Fig. 2.3), we carried out 30 replications of five-fold cross-validation; within each replication, across the five folds, we obtained prediction scores for each tumor from the fold in which the tumor was in the test set, enabling us to compute an overall AUROC within each replication. For each training data set, we have done 30 replications of five-fold cross-validation by altering the random seed used for assign split of data during cross-validation. We have conducted the same procedure for five different cancer types (BLCA, BRCA, COAD, PAAD, SARC) as shown in the panel names of Figure 4. 5.8 Principal component analysis (PCA) For PCA, we used the function decomposition.PCA (with parameters svd_solver = “full′′) and n_components = 0.9 (90% variance, yielding 387 components) from the python package scikit-learn version 0.19.1. For plotting, we used matplotlib version 2.1.2. Funding SAR acknowledges support from the Animal Cancer Foundation. References Airley, R. (2009). Cancer chemotherapy. Wiley-Blackwell, NY, NY. An, J. and Cho, S. (2015). Variational Autoencoder based Anomaly Detection using Reconstruction Probability. Technical Report SNUDM- TR-2015-03, Seoul National University. Bouchacourt, D. et al. (2017). Multi-Level Variational Autoencoder: Learning Disentangled Representations from Grouped Observations. arXiv:1705.08841. Casado, E. et al. (2011). A Combined Strategy of SAGE and Quantitative PCR Provides a 13-Gene Signature that Predicts Preoperative Chemoradiotherapy Response and Outcome in Rectal Cancer. PLOS ONE, 17, 4145–54. Cerami, E. et al. (2012). The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery, 2, 401. Chabner, B. A. and Longo, D. L. (2005). Cancer Chemotherapy and Biotherapy: Principles and Practice. Lippincott Willians & Wilkins, Philadelphia, PA, fourth edition. Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. arXiv:1603.02754. Chiu, Y.-C. et al. (2019). Predicting drug response of tumors from integrated genomic profiles by deep neural networks. BMC Medical Genomics, 12(1), 18. Corrie, P. G. (2008). Cytotoxic chemotherapy: clinical aspects. Medicine, 36(1), 24–28. Del Rio, M. et al. (2007). Gene expression signature in advanced colorectal cancer patients select drugs and response for the use of leucovorin, fluorouracil, and irinotecan. Journal of clinical oncology : official journal of the American Society of Clinical Oncology, 25(7), 773–78. Dillies, M.-A. et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14(6), 671–683. Dolezal, J. M. et al. (2018). Diagnostic and prognostic implications of ribosomal protein transcript expression patterns in human cancers. BMC Cancer, 18(1), 275. Dong, H. et al. (2020). Variational Autoencoder for Anti-Cancer Drug Response Prediction. arXiv:2008.09763. Duchi, J. (2007). Derivations for linear algebra and optimization. Technical report, Standford University. .CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425288 http://creativecommons.org/licenses/by/4.0/ i i “output” — 2021/1/4 — 20:53 — page 9 — #9 i i i i i i Table 4. XGBoost classification algorithm hyperparameters and hyperparameter ranges used in grid-search tuning. Hyperparameter name Hyperparameter description Hyperparameter range n_estimators number of trees to fit (1, 2, 3, . . ., 40) max_depth maximum tree depth (1, 2, 3, . . ., 10) learning_rate boosting learning rate (0.05, 0.1, 0.2, 0.4, 0.6, 0.8) min_child_weight minimum sum of instance weight needed in a child (1, 2, 3, . . ., 10) subsample sub-sample ratio of the training instance (0.1, 0.2, 0.3, . . ., 1.0) colsample_bytree sub-sample ratio of columns when constructing each tree (0.1, 0.2, 0.3, . . ., 1.0) reg_alpha coefficient of L1 regularization for the node weights (0, 1, 2, 3) reg_lambda coefficient of L2 regularization for the node weights (1, 2, . . ., 100) Esteva, A. et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115–118. Evans, C. et al. (2017). Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Briefings in Bioinformatics, 19(5), 776–792. Frankish, A. et al. (2018). GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research, 47, D766–D773. Gao, J. et al. (2013). Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science Signaling, 6, 11. Geeleher, P. et al. (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitrodrug sensitivity in cell lines. Genome Biology, 15(3), R47. George, T. M. and Lio, P. (2019). Unsupervised Machine Learning for Data Encoding applied to Ovarian Cancer Transcriptomes. bioRxiv; doi:10.1101/855593. Goldman, M. et al. (2019). The UCSC Xena platform for public and private cancer genomics data visualization and interpretation. bioRxiv; doi:10.1101/326470. Gurney, H. (2002). How to calculate the dose of chemotherapy. British Journal of Cancer, 86, 1297–1302. Gámez-Pozo, A. et al. (2017). Prediction of adjuvant chemotherapy response in triple negative breast cancer with discovery and targeted proteomics. PLOS ONE, 12, 6. Hutter, C. and Zenklusen, J. C. (2018). The Cancer Genome Atlas: Creating Lasting Value beyond Its Data. Cell, 173(2), 283–285. Jimenez Rezende, D. et al. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. arXiv:1401.4082. Kaestner, S. A. and Sewell, G. J. (2007). Chemotherapy Dosing Part I: Scientific Basis for Current Practice and Use of Body Surface Area. Clinical Oncology, 19, 23–37. Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980. Kingma, D. P. and Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv, page arXiv:1312.6114. Kipf, T. N. and Welling, M. (2016). Variational Graph Auto-Encoders. arXiv:1611.07308. Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37(2), 233–243. Kreyszig, E. et al. (2011). Advanced Engineering Mathematics. Wiley, Hoboken, NJ, tenth edition. Kurokawa, Y. et al. (2004). Molecular Prediction of Response to 5- Fluorouracil and Interferon-α Combination Chemotherapy in Advanced Hepatocellular Carcinoma. AACR, 10(18), 6029–38. Lee, K. et al. (2019). CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. Scientific Reports, 9(1), 16927. Li, X. and She, J. (2017). Collaborative Variational Autoencoder for Recommender Systems. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 305–314, New York, NY. ACM. Malfuson, J.-V. et al. (2008). Risk factors and decision criteria for intensive chemotherapy in older patients with acute myeloid leukemia. Haematologica, 93(12), 1806–1813. Mitchel, J. et al. (2019). A translational pipeline for overall survival prediction of breast cancer patients by decision-level integration of multi- omics data. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1573–1580. Qin, J. et al. (2006). ICA based semi-supervised learning algorithm for BCI systems. In J. Rosca, D. Erdogmus, J. C. Príncipe, and S. Haykin, editors, Independent Component Analysis and Blind Signal Separation, pages 214–221, Berlin. Springer. R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation, Vienna, Austria. ISBN 3-900051-07-0. Skeel, R. T. (2003). Handbook of Cancer Chemotherapy. Lippincott Williams & Wilkins, Philadelphia, PA, sixth edition. Titus, A. J. et al. (2018). Unsupervised deep learning with variational autoencoders applied to breast tumor genome-wide DNA methylation data with biologic feature extraction. bioRxiv; doi:10.1101/433763. Wang, Z. et al. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1), 57–63. Way, G. P. and Greene, C. S. (2017). Evaluating deep variational autoencoders trained on pan-cancer gene expression. arXiv:1711.04828. Way, G. P. and Greene, C. S. (2018). Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pacific Symposium on Biocomputing, 23, 80–91. Weir, B. et al. (2004). Somatic alterations in the human cancer genome. Cancer Cell, 6(5), 433–438. Wen, H. and Huang, F. (2020). Personal loan fraud detection based on hybrid supervised and unsupervised learning. In 2020 5th IEEE International Conf. on Big Data Analytics (ICBDA), pages 339–343. Whelan, T. et al. (2003). Helping Patients Make Informed Choices: A Randomized Trial of a Decision Aid for Adjuvant Chemotherapy in Lymph Node-Negative Breast Cancer. JNCI: Journal of the National Cancer Institute, 95(8), 581–587. Zhang, Y. et al. (2020). A novel xgboost method to identify cancer tissue- of-origin based on copy number variations. Front Genet, 11, 1319. .CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425288 http://creativecommons.org/licenses/by/4.0/ Introduction Results VAE encoding preserves cancer type features Obtaining a labeled tumor transcriptome dataset L1 loss is better than L2 loss for this application Chemotherapy drug response classification result PCA & VAE feature importance scores, for COAD Discussion Conclusions Methods Gene expression data t-distributed stochastic neighbor embedding (t-SNE) Variational autoencoder (VAE) Labeling tumors based on response to chemotherapy VAE model architectures Regularized gradient boosted decision trees (XGBoost) Area Under ROC Curve (AUROC) Principal component analysis (PCA) 10_1101-2021_01_04_425315 ---- Sample-wise unsupervised deconvolution of complex tissues 1 Sample-wise unsupervised deconvolution of complex tissues Lulu Chen1, Chia-Hsiang Lin2,*, Chiung-Ting Wu1,*, Robert Clarke3, Guoqiang Yu1, Jennifer E. Van Eyk4, David M. Herrington5, and Yue Wang1,# * Equal contribution 1Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA; 2Department of Electrical Engineering, National Cheng Kung University, Tainan City, Taiwan 70101; 3The Hormel Institute, University of Minnesota, Austin, MN 55912; 4Advanced Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los Angeles, CA 90048, USA; 5Department of Internal Medicine, Wake Forest University, Winston-Salem, NC 27157, USA Article Format: bioRxiv Running title: Sample-wise unsupervised deconvolution Open source software: R code is available at https://github.com/Lululuella/swCAM #Author for correspondence: Yue Wang, Ph.D. Virginia Polytechnic Institute and State University 900 N. Glebe Road, Arlington, VA 22203 E-mail: yuewang@vt.edu (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint mailto:yuewang@vt.edu https://doi.org/10.1101/2021.01.04.425315 2 Abstract We report a sample-wise fully unsupervised deconvolution method, namely sample-wise Convex Analysis of Mixtures (swCAM), that can estimate constituent proportions and subtype-specific expressions in individual samples using tissue-level bulk data (Chen 2019). The swCAM software tool enables statistically-principled subtype-level downstream analyses, such as detecting subtype- specific differentially expressed genes (sDEG) and differential dependency networks (DDN) (Zhang, Li et al. 2009, Chen, Lu et al. 2020). Significantly different from population-level deconvolution, individual-level deconvolution is mathematically an underdetermined problem because there are more variables than observations. We therefore extend the existing CAM framework by adding an extra term of between-sample variations and formulate swCAM as a nuclear-norm regularized low-rank matrix factorization problem (Wang, Hoffman et al. 2016). We determine hyperparameter value by random entry exclusion based cross-validation scheme and obtain swCAM solution using a modified efficient alternating direction method of multipliers (ADMM). Experimental results on realistic simulation data sets show that swCAM can accurately perform sample-wise unsupervised deconvolution of complex tissues and successfully recover subtype-specific correlation networks that are otherwise unobtainable using existing methods. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 3 Introduction Decades of research on molecule regulatory mechanisms have provided a rich framework with which we can extract molecule expression patterns to gain insight into the organization and structure of the large biological networks (Zhang and Horvath 2005, Zhang, Li et al. 2009, Tian, Zhang et al. 2014). However, most discoveries are concluded from measured molecule expressions in heterogeneous tissues, in which the underlying changes of constituent components could obscure molecule regulations and corrupt network inference that only occur in particular tissue subtypes. The inference of subtype-specific molecule expression patterns becomes an essential problem for understanding complex molecule functions and the role of each subtype during the dynamic biological process, such as cell fate specification (Sonawane, Platig et al. , Chasman and Roy 2017, Gal, London et al. 2017). The ability to obtain sample-wise expression variation in each subtype is critical prior to infer subtype-specific molecular networks (Shen-Orr, Tibshirani et al. 2010, Junttila and de Sauvage 2013). Single-cell expression profiling techniques have become popular to investigate cell-type-specific network but may lose critical information of cell-cell interactions and is prone to cell-cycle/state confounders (Buettner, Natarajan et al. 2015, Gal, London et al. 2017). While the current CAM tool can dissect mixed signals of multiple samples into the ‘averaged’ expression profiles of subtypes, many subsequent molecular analyses of complex tissues require sample-specific signal deconvolution where each sample is a mixture of ‘individualized’ subtype- specific expression profiles. Here we propose a new algorithm called swCAM as an extension of CAM to solve sample-specific Blind Source Separation (sBSS) problem. The sBSS problem, because the number of variables is much larger than the number of observations, is ill-posed and underdetermined. As a result, simple yet highly regularized approaches often become the methods of choice (Hastie, Tibshirani et al. 2001). The sBSS problems have received increasing interest in hyperspectral imagery area where the spatially smooth and variation sparsity regularization can be exploited to unmix spectral signals (Thouvenin, Dobigeon et al. 2016). In the context of biological process, transcriptional regulatory networks connect regulatory proteins, such as transcription factors (TFs) and signaling proteins, to target genes and thus form co-expressed gene sets as function modules in each subtype. Based on such underlying cellular mechanisms, we impose and exploit the low-rank assumption on between-sample variations of molecule expressions in each (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 4 subtype. While estimating subtype-specific signals from a single mixture for each gene independently is an underdetermined problem, the low-rank assumption can aggregate information from gene sets within function modules to help find a biologically plausible solution. General rank minimization is a challenging nonconvex optimization problem for which all existing finite-time algorithms have at least doubly exponential running times in both theory and practice (Recht, Fazel et al. 2010). Minimizing the nuclear norm, or the sum of the singular values of the matrix, over the affine subset, have multiple advantages. The nuclear norm is a convex function and can be optimized efficiently, thus can provide the best convex approximation of the rank function over the unit ball of matrices with norm less than one (Recht, Fazel et al. 2010, Candes, Sing-Long et al. 2013). Nuclear norm regularization has been successfully applied in many practical applications with low-rank modeling, such as image denoising (Candes, Sing-Long et al. 2013) and matrix completion (Cai, Candès et al. 2010). swCAM will adopt nuclear norm regularization to optimize the estimation of between-sample variations in each subtype to recover sample-specific signals for each subtype. This Chapter introduces mathematical modeling of sample-specific deconvolution and optimization solver used in swCAM algorithm, followed by validation in simulations and discussion on further possible improvement by sparsity regularization. Method Problem formulation and nuclear norm regularization A fundamental assumption for the conventional linear mixing model, 𝑿 = 𝑨𝑺 + 𝑬, is that all the mixture samples share a common source matrix 𝑺 . However, each sample may have its ‘individualized’ sample-specific sources as the sampled realizations in additional sample-specific subtype proportions (Fig. 1): 𝑺𝑖 = �̅� + ∆𝑺𝑖 , 𝑖 = 1, … , 𝑀. The associated sample-specific BSS (sBSS) model is given by (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 5 𝒙𝑖 = 𝒂𝑖 (�̅� + ∆𝑺𝑖 ) + 𝒏𝑖 ∈ ℝ 𝐿 , ∀𝑖 = 1, … , 𝑀 (1) where 𝒂𝑖 is the row vector in proportion matrix 𝑨, and 𝒏𝑖 is noise term. ∆𝑺𝑖 is expected to have small-valued entries so that source matrices among different samples are similar. Fig. 1. Sample-specific deconvolution problem formulation and the assumption of hidden low-rank pattern in each source. (For convenient illustration, 𝑻 matrix in all figures are the transposed version of those in the text and equations.) If some expression units (molecular features) are highly correlated in a particular source, the rank of the matrix consisting of ∆𝑺𝑖 entries in this source will be low, leading to the following swCAM objective function: min 𝑨,{∆𝑺𝑖}𝑖=1 𝑀 1 2 ∑‖𝒙𝑖 − 𝒂𝑖 (�̅� + ∆𝑺𝑖 )‖2 2 𝑀 𝑖=1 + 𝜆 ∑‖𝑻𝑘 ‖∗ 𝐾 𝑘=1 (2) 𝑠. 𝑡. 𝒂𝑖 ≽ 𝟎𝐾 , 𝒂𝑖 𝟏𝐾 = 1, �̅� + ∆𝑺𝑖 ≽ 𝟎𝐾×𝐿 , 𝑻𝑘 = [∆𝑺1 𝑇 (𝑘), … , ∆𝑺𝑀 𝑇 (𝑘)] ∈ ℝ𝐿×𝑀, 𝑘 = 1, … , 𝐾, where 𝑻𝑘 consists of the 𝑘 th column in all ∆𝑺𝑖 , 𝑖 = 1, … , 𝑀 , representing between-sample variation in source 𝑘, namely the variation matrix for the 𝑘th subtype. ‖𝑻𝑘 ‖∗ is the nuclear norm of 𝑻𝑘; and 𝜆 > 0 is the regularization parameter of nuclear norms. The hyperparameter λ actually (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 6 can be different among different source 𝑘 and thus the regularizer in (2) can be replaced by ∑ 𝜆𝑘 ‖𝑻𝑘 ‖∗ 𝐾 𝑘=1 if necessary. Please see supplementary information for the reason to select subtype- specific regularization terms. Optimization of swCAM objective function The objective function in (2) is bi-convex w.r.t. the two block-wise variables, i.e. 𝑨 ≜ [𝒂1 𝑇 , … , 𝒂𝑀 𝑇 ]𝑇 and 𝑻 ≜ [𝑻1 𝑇 , … , 𝑻𝑀 𝑇 ]𝑇 ∈ ℝ𝐾𝐿×𝑀 . Accordingly, we can solve (2) by alternatively solving the following two convex subproblems until convergence: 𝑻𝑝+1 ∈ argmin ∆𝑺i≽−𝑺,∀𝒊 𝒥(𝑨𝑝, 𝑻) (3) 𝑨𝑝+1 ∈ argmin 𝑨≽0𝑀×𝐾,𝑨𝟏𝐾=𝟏𝑀 𝒥(𝑨, 𝑻𝑝+1) (4) where 𝒥(𝑨, 𝑻) ≜ 1 2 ∑‖𝒙𝑖 − 𝒂𝑖 (�̅� + ∆𝑺i)‖2 2 𝑀 𝑖=1 + 𝜆 ∑‖𝑻𝑘 ‖∗ 𝐾 𝑘=1 CAM-estimated subtype-specific expression matrix serves as the initial reference 𝑺. Note that in (3) (4), we have implicitly used the following relationship for concise representation: 𝑻 ≜ [𝑣𝑒𝑐(Δ𝑺1 𝑇 ), … , 𝑣𝑒𝑐(Δ𝑺𝑀 𝑇 )], where (4) can be decoupled w.r.t each row of 𝑨: 𝒂𝑖 𝑝+1 ∈ argmin 𝒂𝑖≽𝟎𝐾,𝒂𝑖𝟏𝐾=1 1 2 ‖𝒙𝑖 − 𝒂𝑖 (�̅� + ∆𝑺𝑖 𝑝+1 )‖ 2 2 which can be solved using quadratic programming. If a prior proportion matrix or CAM-estimated proportion matrix has already been of high quality, we can skip the alternative optimization on 𝑨 matrix, and obtain 𝑻 matrix by optimizing the subproblem (3) only once. To solve (3), we notice that the main bottleneck is its huge dimension of variables (typically, L is several ten thousand), preventing conventional convex solvers from being readily applicable here. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 7 We propose to solve (3) by adapting the alternating direction method of multipliers (ADMM), which has been widely applied to many large-scale problems in areas such as statistical learning, image processing and computational biology (Boyd, Parikh et al. 2011). ADMM naturally allows decoupling the non-smooth regularization term from the smooth loss term, which is computationally advantageous. Specifically, we reformulate (3) in the form that the primal variable can be “split” into several parts, with the associated objective function “separable” across this splitting (Boyd, Parikh et al. 2011). We will use the following definitions: 𝑻 ≜ [𝑣𝑒𝑐(Δ𝑺1 𝑇 ), … , 𝑣𝑒𝑐(Δ𝑺𝑀 𝑇 )] = [ 𝑻1 … 𝑻𝐾 ] ∈ ℝ𝐾𝐿×𝑀 𝑺 ≜ [𝑣𝑒𝑐(𝑺1 𝑇 ), … , 𝑣𝑒𝑐(𝑺𝑀 𝑇 )] ∈ ℝ𝐾𝐿×𝑀 𝑽 ≜ 𝑿𝑇 ∈ ℝ𝐿×𝑀 𝑾 ≜ [ 𝑻 𝑺 ] ∈ ℝ2𝐾𝐿×𝑀 𝑪1 ≜ [ 𝑰𝐾𝐿 𝑰𝐾𝐿 ] ∈ ℝ2𝐾𝐿×𝐾𝐿 𝑪2 ≜ −𝑰2𝐾𝐿 ∈ ℝ 2𝐾𝐿×2𝐾𝐿 𝑪3 ≜ [ 𝟏𝑀 𝑇 ⨂𝑣𝑒𝑐(�̅�𝑇) 𝟎𝐾𝐿×𝑀 ] ∈ ℝ2𝐾𝐿×𝑀 𝑩0 ≜ [𝟎𝐾𝐿×𝐾𝐿 , 𝑰𝐾𝐿 ] ∈ ℝ 𝐾𝐿×2𝐾𝐿 𝑩𝑘 ≜ [𝟎𝐿×(𝑘−1)𝐿 , 𝑰𝐿 , 𝟎𝐿×(𝐾−𝑘)𝐿 , 𝟎𝐿×𝐾𝐿 ] ∈ ℝ 𝐿×2𝐾𝐿 , 𝑘 = 0, … , 𝐾 Then we can simplify (3) as the equivalent form: min 𝑼∈ℝ𝐾𝐿×𝑀,𝑾∈ℝ2𝐾𝐿×𝑀 1 2 ‖𝒜(𝑼) − 𝑽‖𝐹 2 + 𝜆 ∑‖𝑩𝑘 𝑾‖∗ 𝐾 𝑘=1 + 𝐼+(𝑩0𝑾) (5) 𝑠. 𝑡. 𝑪1𝑼 + 𝑪2𝑾 = 𝑪3, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 8 where 𝐼+(∙) is the indicator function for the non-negative orthant; 𝐼+(𝑩0𝑾) = 𝐼+(𝑺) = 0 if 𝑺 ≽ 𝟎𝐾𝐿×𝑀 ( 𝐼+(𝑼) = +∞ , otherwise). The linear transformation in the first term is 𝒜(𝑼) = 𝒜([𝒖1, … , 𝒖𝑀]) = [𝑯1𝒖1, … , 𝑯𝑀𝒖𝑀] with 𝑯𝑖 = [𝒂𝑖 𝑝 ⨂𝐼𝐿 ], 𝑖 = 1, … , 𝑀 . Note that (5) has been with the ADMM form w.r.t. the two split block variables 𝑼 and 𝑾, and, as (5) is solved, the solution of (3) can be obtained by 𝑻𝑝+1 = [ 𝑰𝐾𝐿 , 𝟎𝐾𝐿×𝐾𝐿 ]𝑾 ∗. Given a penalty parameter 𝛾 > 0 (empirically, 𝛾 ≔ 1 generally guarantees good convergence speed), the augmented Lagrangian (ignoring some irrelevant terms) of problem (5) is defined by ℒ(𝑼, 𝑾, 𝒁) = 1 2 ‖𝒜(𝑼) − 𝑽‖𝐹 2 + 𝜆 ∑‖𝑩𝑘 𝑾‖∗ 𝐾 𝑘=1 + 𝐼+(𝑩0𝑾) + 𝛾 2 ‖𝑪1𝑼 + 𝑪2𝑾 − 𝑪3 − 𝒁‖𝐹 2 where “−𝛾𝒁”∈ ℝ2𝐾𝐿×𝑀 is the dual variable (or Lagrange multiplier) associated with the constraint 𝑪1𝑼 + 𝑪2𝑾 = 𝑪3. Then, ADMM solves (5) via the following iterative procedure: 𝑼𝑞+1𝜖 argmin 𝑼∈ℝ𝐾𝐿×𝑀 ℒ(𝑼, 𝑾𝑞 , 𝒁𝑞 ) (6𝑎) 𝑾𝑞+1𝜖 argmin 𝑾∈ℝ2𝐾𝐿×𝑀 ℒ(𝑼𝑞+1, 𝑾, 𝒁𝑞 ) (6𝑏) 𝒁𝑞+1 = 𝒁𝑞 − (𝑪1𝑼 𝑞+1 + 𝑪2𝑾 𝑞+1 − 𝑪3) (6𝑐) where 𝑾0 can be initialized by [𝑻0 𝑇 , 𝑼0 𝑇 ]𝑇 with 𝑻0 = 𝟎𝐾𝐿×𝑀 and 𝑼0 = 𝟏𝑀 𝑇 ⨂𝑣𝑒𝑐(�̅�𝑇 ); 𝒁0 can be simply initialized by 𝟎2𝐾𝐿×𝑀. As we will show, both (6a) and (6b) can be solved with closed-form expressions, thanks to the decomposability of ADMM. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 9 Fig. 2. The objective function of swCAM for sample-specific deconvolution problem and its reformulation by ADMM. (For convenient illustration, 𝑻 matrix in all figures are the transposed version of those in the text and equations.) Notice that (6a) is a column-wise separable optimization problem, so we can decouple w.r.t each column of 𝑼: 𝒖𝑖 𝑞+1 ∈ argmin 𝒖𝑖∈ℝ 𝐾𝐿 1 2 ‖𝑯𝑖 𝒖𝑖 − 𝒗𝑖 ‖2 2 + 𝛾 2 ‖𝑪1𝒖𝑖 + 𝒚𝒊 𝑞 ‖ 𝐹 2 (7) where [𝒚1 𝑞 , … , 𝒚𝑀 𝑞 ] ≜ 𝑪2𝑾 𝑞 − 𝑪3 − 𝒁 𝑞 . The subproblem (7) is an unconstrained quadratic problem, which can be solved by 𝒖𝑖 𝑞+1 = (𝑯𝑖 𝑇 𝑯𝑖 + 𝛾𝑪1 𝑇 𝑪1) −1(𝑯𝑖 𝑇 𝒗𝑖 − 𝛾𝑪1 𝑇 𝒚𝒊 𝑞 ). (8) The matrix inversion can speed up by (𝑯𝑖 𝑇 𝑯𝑖 + 𝛾𝑪1 𝑇 𝑪1) −1 = ((𝒂𝑖 𝑝 ) 𝑇 𝒂𝑖 𝑝 + 2𝛾𝑰𝐾 ) −1 ⨂𝑰𝐿 . The right term in (8) can also be simplified as (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 10 𝑯𝑖 𝑇 𝒗𝑖 − 𝛾𝑪1 𝑇 𝒚𝒊 𝑞 = (𝒂𝑖 𝑝 ) 𝑇 ⨂𝒙𝑖 𝑇 − 𝛾 (𝒚 𝒊 𝑞 + 𝒚𝒊 𝑞 ), where 𝒚𝒊 𝑞 = [(𝒚 𝒊 𝑞 ) 𝑇 , (𝒚𝒊 𝑞 ) 𝑇 ] 𝑇 with 𝒚 𝒊 𝑞 ∈ ℝ𝐾𝐿 and 𝒚𝒊 𝑞 ∈ ℝ𝐾𝐿 being the first and second half vector of 𝒚𝒊 𝑞 , respectively. Finally, the column vectors of 𝑼𝑞+1 in (6a) can be computed fast by 𝒖𝑖 𝑞+1 = 𝑣𝑒𝑐 {𝑑𝑒𝑣𝑒𝑐 {(𝒂𝑖 𝑝 ) 𝑇 ⨂𝒙𝑖 𝑇 − 𝛾 (𝒚 𝒊 𝑞 + 𝒚𝒊 𝑞 ) |𝐿, 𝐾} ((𝒂𝑖 𝑝 ) 𝑇 𝒂𝑖 𝑝 + 2𝛾𝑰𝐾 ) −1 } (9) To solve (4.6b), we remove some irrelevant terms from its objective function: min 𝑾∈ℝ2𝐾𝐿×𝑀 𝜆 ∑‖𝑩𝑘 𝑾‖∗ 𝐾 𝑘=1 + 𝐼+(𝑩0𝑾) + 𝛾 2 ‖𝑪1𝑼 𝑞+1 + 𝑪2𝑾 − 𝑪3 − 𝒁 𝑞 ‖𝐹 2 , (10) And then, by defining 𝑼𝑘 𝑞+1 ∈ ℝ𝐿×𝑀, 𝑘 = 1, … , 𝐾 as block matrices from top to bottom in 𝑼𝑞+1 ∈ ℝ𝐾𝐿×𝑀 , 𝒁𝑘 ∈ ℝ 𝐿×𝑀, 𝑘 = 1, … , 𝐾 and 𝒁0 ∈ ℝ 𝐾𝐿×𝑀 as block matrices from top to bottom in 𝒁 ∈ ℝ2𝐾𝐿×𝑀 , respectively (i.e., 𝒁 ≜ [𝒁1 𝑇 , … , 𝒁𝐾 𝑇 , 𝒁0 𝑇 ]𝑇 ), we decouple the objective function (10) as functions of 𝑻𝑘 , 𝑘 = 1, … , 𝐾 and 𝑺: min 𝑾∈ℝ2𝐾𝐿×𝑀 ∑ {𝜆‖𝑻𝑘 ‖∗ + 𝛾 2 ‖𝑼𝑘 𝑞+1 − 𝑻𝑘 − 𝟏𝑀 𝑇 ⨂�̅�𝑘 − 𝒁𝑘 𝑞 ‖ 𝐹 2 } 𝐾 𝑘=1 + {𝐼+(𝑺) + 𝛾 2 ‖𝑼𝑞+1 − 𝑺 − 𝒁0 𝑞 ‖ 𝐹 2 } Therefore, 𝑾𝑞+1 can be solved by the proximal point algorithm (PPA) (Parikh and Boyd 2014). Specifically, we have 𝑾𝑞+1 = [(𝑻1 𝑞+1 ) 𝑇 , … , (𝑻𝐾 𝑞+1 ) 𝑇 , (𝑺𝑞+1)𝑇 ] 𝑇 in which 𝑻𝑘 𝑞+1 ∈ argmin 𝑻∈ℝ𝐾𝐿×𝑀 𝜆‖𝑻𝑘 ‖∗ + 𝛾 2 ‖𝑼𝑘 𝑞+1 − 𝑻𝑘 − 𝟏𝑀 𝑇 ⨂�̅�𝑘 − 𝒁𝑘 𝑞 ‖ 𝐹 2 (11𝑎) 𝑺𝑞+1 ∈ argmin 𝑻∈ℝ𝐾𝐿×𝑀 𝐼+(𝑺) + 𝛾 2 ‖𝑼𝑞+1 − 𝑺 − 𝒁0 𝑞 ‖ 𝐹 2 (11𝑏) (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 11 Note that (4.11a) and (4.11b) are exactly the proximal operators of ‖𝑻𝑘 ‖∗ and 𝐼+(𝑺), respectively (Parikh and Boyd 2014), and their closed-form solutions are given by 𝑻𝑘 𝑞+1 = ∑ (𝜎𝑘ℓ − 𝜆 𝛾 ) + 𝝁𝑘ℓ𝝂𝑘ℓ 𝑇 𝑟 ℓ=1 , 𝑘 = 1, … , 𝐾, (12) 𝑺𝑞+1 = [𝑼𝑞+1 − 𝒁0 𝑞 ] + , (13) where the singular value decomposition (SVD) of is performed ahead of the computation of (12), i.e. 𝑼𝑘 𝑞+1 − 𝑻𝑘 − 𝟏𝑀 𝑇 ⨂�̅�𝑘 − 𝒁𝑘 𝑞 = ∑ 𝜎𝑘ℓ𝝁𝑘ℓ𝝂𝑘ℓ 𝑇𝑟 ℓ=1 . A reasonable termination criterion is that the primal residual, 𝑝𝑟𝑖 = ‖𝑪1𝑼 + 𝑪2𝑾 − 𝑪3‖2, and dual residual, 𝑑𝑢𝑎𝑙 = ‖𝛾𝑪1 𝑇 𝑪2(𝑾 𝑞+1 − 𝑾𝑞 )‖2, are smaller than a predefined tolerance. Model parameter tuning In noisy scenarios, the penalty parameter 𝜆 setting is critical to determine how much variation is persevered as patterns of interest or ignored as noise. An extremely large 𝜆 will coerce the individual variation to be zero. Decreasing 𝜆 will allow more subtype-specific patterns to be detected until overfitting. Cross-validation is a popular strategy in parameter tuning for the balance of underfitting and overfitting. One round of cross-validation excludes a certain portion of samples and uses the model learned from other samples to predict the excluded ones. Then every model is assessed by summarizing prediction performances across multiple rounds. However, our sample-specific deconvolution estimates the individual expression of each sample in each subtype, which cannot be used to predict the excluded samples directly. Thus, we proposed to randomly exclude entries rather than samples in 𝑿 matrix (Fig. 3), similar to the strategy used in missing value imputation. The foundation of success is that the low-rank patterns in 𝑻𝑘 matrix are detectable by only a portion of 𝑿 entries and able to predict the excluded 𝑿 entries. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 12 Fig. 3. 10-fold cross-validation strategy for model parameter tuning. A part of entries is randomly removed before applying swCAM. The removed entries are reconstructed by estimated 𝑻 matrix and compared to observed expressions for computing RMSE to decide the optimal parameter 𝜆. Specifically, we fix the 𝑨 and 𝑺 at the initialization values (from CAM-estimation or a priori knowledge) and randomly remove entries in 𝑿 matrix, leading to the objective function w.r.t ∆𝑺𝑖 , 𝑖 = 1, … , 𝑀: min {∆𝑺𝑖}𝑖=1 𝑀 1 2 ∑‖𝑃Ω𝑖 (𝒙𝑖 ) − 𝑃Ω𝑖 (𝒂𝑖 (�̅� + ∆𝑺𝑖 ))‖2 2 𝑀 𝑖=1 + 𝜆 ∑‖𝑻𝑘 ‖∗ 𝐾 𝑘=1 (14) 𝑠. 𝑡. �̅� + ∆𝑺𝑖 ≽ 𝟎𝐾×𝐿 , 𝑻𝑘 = [∆𝑺1 𝑇 (𝑘), … , ∆𝑺𝑀 𝑇 (𝑘)] ∈ ℝ𝐿×𝑀, 𝑘 = 1, … , 𝐾, where 𝑃Ω𝑖 (𝒙𝑖) ∈ ℝ 𝐿 denote a vector with the entries in Ω𝑖 left alone, and all other entries set to zero. The workflow of our proposed 10-fold cross-validation strategy is: (1) Randomly split all entries into 10 folds; (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 13 (2) Remove one fold of entries and use the remaining 9 folds of entries to solve (14) with different 𝜆 values [𝜆1, 𝜆2, …]; (3) Use estimated ∆𝑺𝑖 (𝜆𝜃 ), 𝑖 = 1, … , 𝑀, 𝜃 = 1,2, …, together with fixed 𝑨 and 𝑺 matrix to reconstruct 𝑿 matrix and only record the reconstructed values for the removed entries in 𝑿; (4) Repeat Step (2)-(3) and obtained a reconstructed �̃�(𝜆𝜃 ) matrix in which all entry values are reconstructed when their original values are absent in optimization processes with 𝜆 = 𝜆𝜃. (5) Calculate Root Mean Square Error (RMSE) by 𝑅𝑀𝑆𝐸(𝜆𝜃 ) = √ 1 𝑀𝐿 ∑ ∑ (𝑿𝑖𝑗 − �̃�𝑖𝑗 (𝜆𝜃 )) 2 𝐿 𝑗=1 𝑀 𝑖=1 (15) (6) Choose the 𝜆𝜃 yielding the minimum RMSE. Warm start can be used in Step (2) with the decreasing parameter 𝜆1 > 𝜆2 > ⋯, which use the estimation with 𝜆𝜃 as the initialization of next optimization with 𝜆𝜃+1. The optimization problem (14) can be solved using a similar ADMM algorithm in (5-13) that have solved (3). The only modification is that (7) becomes 𝒖𝑖 𝑞+1 ∈ argmin 𝒖𝑖∈ℝ 𝐾𝐿 1 2 ‖𝑃Ω𝑖 ′ (𝑯𝑖 𝒖𝑖 ) − 𝑃Ω𝑖 ′ (𝒗𝑖 )‖ 2 2 + 𝛾 2 ‖𝑪1𝒖𝑖 + 𝒚𝒊 𝑞 ‖ 𝐹 2 (16) where 𝑃Ω𝑖 ′ (∙) = [𝟏𝐾 𝑇 ⨂ 𝑃Ω𝑖 (∙) 𝑇 ] 𝑇 ∈ ℝ𝐾𝐿 makes all excluded-entry related variables be optimized only by the second term, which is still an unconstrained quadratic problem that can be solved easily. The remaining variables unrelated to excluded entries can still be optimized following (8-9). Sparsity regularization In addition to low-rank assumption, we could also reasonably assume only limited genes are involved in functional modules and thus impose a row-sparsity regularization by ℓ2,1 -norm minimization. The alternative swCAM formulation will be: min 𝑨,{∆𝑺𝑖}𝑖=1 𝑀 1 2 ∑‖𝒙𝑖 − 𝒂𝑖 (�̅� + ∆𝑺𝑖 )‖2 2 𝑀 𝑖=1 + 𝜆 ∑‖𝑻𝑘 ‖∗ 𝐾 𝑘=1 + 𝛿‖𝑻‖2,1 (17) (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 14 where 𝛿 > 0 is the regularization parameter of ℓ2,1 norm of 𝑻, defined as ‖𝑻‖2,1 ≜ ∑‖𝒕𝑖 ‖2 𝐾𝐿 𝑖=1 accounting for the row-sparsity of 𝑻. If necessary, the parameter 𝛿 actually can be varied for different rows based on the character of each gene, such as mean-variance trend. The supplementary information gives more details on the optimization of (17) by ADMM method. The ℓ1 or ℓ2-norm minimization, as common-used sparsity regularization methods, could impose the entry sparsity in 𝑻 matrix. We also provide ADMM optimization for sample-specific deconvolution with ℓ1 or ℓ2-norm minimization, which could be useful in other sBSS problems. Results As swCAM focuses on subtype-specific variation estimation, simulating biological variance within each subtype and technical variance for each observation is important for validating swCAM performance. We conduct two sets of simulations. The first is in an ideal scenario where the variance is not related to mean value. The second is more realistic where genes with larger mean usually have larger variance. Validation on ideal simulations In the first simulations, we design twelve function modules, with four in each of three subtypes. The observations for 300 genes in 50 samples were simulated with subtype-specific expression baseline, �̅� , sampled from the purified cell populations in real benchmark microarray gene expression data GSE19380 (Kuhn, Thu et al. 2011). 𝒂𝒊, 𝑖 = 1, … , 𝑀, are drawn randomly from a flat Dirichlet distribution. Between-sample variation, ∆𝑺𝑖 (𝑘, 𝑗), 𝑖 = 1, … , 𝑀, for the kth subtype and jth gene was drawn from normal distribution 𝒩(0, 𝜎𝑘𝑗 (𝑠) ) if the jth gene was involved in a function module in the kth subtype; otherwise zero (Fig. 4a). The genes in the same function module has pairwise correlation coefficient equal to one, thus generating a highly correlated gene set in each module. 𝜎𝑘𝑗 (𝑠) are drawn from uniform distribution 𝑈[50, 300]. The technical noise, 𝒏𝑖 , 𝑖 = 1, … , 𝑀, was drawn from zero-mean normal distribution with the variance 𝜎𝑖𝑗 (𝑛) =10. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 15 The twelve functional modules can be recognized in the variation matrix from swCAM when 𝜆 falls into a certain range (Fig. 4b~4i). Increasing the penalty parameter of the nuclear norm will filter more noise but at the cost of the possibility of missing the true variation signal. RMSE derived by 10-fold cross-validation strategy is relatively small when 𝜆 = 1~50 and reach the minimum at 𝜆 = 5 (Fig. 5a). The estimated variation matrix looks quite similar when 1 ≤ 𝜆 ≤ 50 (Fig. 4e~4g), with 12 clear patterns and some artifacts. The artifacts are formed when the signal variation in one subtype spreads to other subtypes for the same genes, which are much lower than detected true signals if 𝜆 is not extremely small. (As shown in the supplementary information, the nuclear norm minimization for each subtype’s variation matrix is a good option to reduce artifacts compared to other regularization terms.) It is interesting to find 𝜆 = 5 is also the point where both primal and dual residuals surge in ADMM algorithm (Fig. 5c~5f). It is because larger 𝜆 tends to train an over-simplified model and thus approach the optimum solution more easily in ADMM. The recovery of sample-specific signals in a subtype is also affected by the mixing proportions of this subtype within the sample. When a subtype accounts for a very small portion in a certain sample, its true signal in this sample will be very weak and thus underestimated (green points in Fig. 6). On the contrary, the major subtype in a sample can be estimated very well by CAM-SS (red points in Fig. 6). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 16 Fig. 4. Heatmap of estimated 𝑇 matrix with varied 𝜆 parameters compared to ground truth in the ideal simulation. Increasing the penalty parameter of the nuclear norm will filter more noise but at the cost of the possibility of missing true signal variation. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 17 Fig. 5. 10-fold cross-validation results under different 𝜆 parameter in the ideal simulation. (a) RMSE; (c) Residuals for primal feasibility condition; (e) Residuals for dual feasibility condition; (b), (d), (f) are zoomed curves of (a), (c), (e). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 18 Fig. 6. Estimated 𝑻 matrix versus ground truth when 𝜆=5 in the ideal simulation. The mixing proportions associated with estimated entries are colored to show the sample-specific expression estimations for high- proportion subtypes can be estimated more accurately than those for low-proportion ones Validation on realistic simulations Mean-variance trend is widely existing in molecular expression data. In our second simulation, all settings are the same as above except that the variance of subtype-specific expression, 𝜎𝑘𝑗 (𝑠) , and the technical variance of observations, 𝜎𝑖𝑗 (𝑛) , are proportional to the subtype-specific expression mean and mixed expression level, respectively. The coefficient of variation (CV), as the ratio of the standard deviation to the mean, is drawn from uniform distribution 𝑈[0.15, 0.3] and 𝑈[0.02, 0.05], respectively. 10-fold cross-validation strategy still obtains the minimum RMSE at 𝜆 = 5 (Fig. 8a~8b) when both primal and dual residuals also surge (Fig. 8c~8f). However, the estimated variation matrix by (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 19 swCAM is blurred by artifacts trained from noise (Fig. 7). Some high-expressed genes have relatively large variance, which could be falsely modeled as subtype-specific signal variations. As shown in Fig. 9, the entries with zero value in Ground Truth variation matrix could be overestimated. Though the absolute expression values estimated by swCAM could deviate from Ground Truth, we can still clearly detect 12 functional modules defined by the Weighted Gene Correlation Network Analysis (WGCNA) (Zhang and Horvath 2005, Langfelder and Horvath 2008) on the estimated sample-specific expressions (Fig. 10). WGCNA constructs weighted networks based on correlation patterns among genes across samples and thus detects function modules of highly- correlated gene sets. In Fig. 10, the second and third subtype finds the exact four true modules with very few genes are missed. The first subtype detects an extra false module, but it is a less significant pattern compared to other modules and can be undetectable with stricter tree height cut threshold. More importantly, without swCAM based deconvolution (Fig. 10d), WGCNA on mixture expression profiles can find none of the true modules, but three false modules that are related to the mixing process of three subtypes. Incorporation of L21-norm regularization In the above simulations, the deconvoluted sample-specific signals contain artifacts trained from signals of other subtypes and artifacts trained from noise (Fig. 4 and Fig. 7). We can use a ℓ2,1- norm regularization to enforce the sparsity of genes that have signal variation across samples. It is supposed to reduce artifacts while it also follows the assumption that genes contributing to source variation in hidden modules are limited. Figure 11 shows the alleviated artifacts with 𝜆 = 5 and 𝛿 = 10, 1, or 0.1. The true function modules are correctly detected with 𝜆 = 5 and 𝛿 = 1 or 0.1, where the false module in the first subtype is suppressed when 𝛿 = 1 (Fig. 12). Increasing the penalty parameter 𝛿 will force more genes to have zero variance, which suppresses the artifacts and false function modules but brings the risk of missing the true signals. It is critical to propose a parameter tuning method for 𝛿. However, the cross-validation strategy with randomly excluding entries for tuning parameter 𝜆 is based on the low-rank assumption, where the hidden low-rank patterns can be trained from a part of entries and then used to reconstruct the remaining entries. This strategy is not applicable to 𝛿 selection, which needs further study. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 20 Fig. 7. Heatmap of estimated 𝑻 matrix scaled by associated means compared to ground truth in the realistic simulation with varied 𝜆 parameters. Increasing the penalty parameter of the nuclear norm will filter more noise but at the cost of the possibility of missing true variation signal. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 21 Fig. 8. 10-fold cross-validation results under different 𝜆 parameter in the realistic simulation. (a) RMSE; (c) Residuals for primal feasibility condition; (e) Residuals for dual feasibility condition; (b), (d), (f) are zoomed curves of (a), (c), (e). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 22 Fig. 9. Estimated 𝑻 matrix scaled by associated means versus ground truth in the realistic simulation (𝜆=5). The mixing proportions associated with estimated entries are colored to show the sample-specific expression estimations for high-proportion subtypes can be estimated more accurately than those for low- proportion ones. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 23 Fig. 10. Gene co-expressed function modules detected by WGCNA on swCAM estimated sample-specific expression for each subtype (a~c) or on originally observed expressions without deconvoluton (d). (Network interconnectedness is measured by topological overlap; cutHeight = 0.7; minSize = 8.) (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 24 Fig. 11. Heatmap of estimated T matrix scaled by associated means compared to ground truth in the realistic simulation with 𝜆 = 5 and varied 𝛿. Increasing the penalty of L21 norm will enforce more zero columns in 𝛥𝑆𝑘 matrix. Fig. 12. Gene co-expressed function modules detected by WGCNA on swCAM estimated sample- specific expression for each subtype with λ=5 and δ=1 or 0.1. (Network interconnectedness is measured by topological overlap; cutHeight = 0.7; minSize = 8.) (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 25 Discussion Most existing tissue deconvolution methods ignore the expression variability of subtypes across individual samples. swCAM will significantly expand the utility of CAM by producing subtype- specific expression profiles in each sample. The success of swCAM depends on the low-rank assumption, which takes advantage of biologically expected cooperation among genes and thus sheds light on solving the seemingly underdetermined sample-specific deconvolution problem. The low-rank assumption holds naturally in molecule expression data when there exist activated functional modules required by particular biological processes or pathways in different subtypes. The detection of such subtype-specific associations or networks is one of the major targets in the analysis of molecule expression profiles. After our sample-specific deconvolution by swCAM, conventional network analysis methods can be applied directly to the estimated sample-subtype- specific signals to construct subtype-specific networks, e.g. weighted correlation network analysis (WGCNA (Zhang and Horvath 2005, Langfelder and Horvath 2008)) and differential dependency network analysis (DDN (Zhang, Li et al. 2009, Zhang, Tian et al. 2011, Tian, Zhang et al. 2014, Tian, Zhang et al. 2015)). The cross-validation strategy of excluding entries randomly is inspired by the similar ideas in matrix imputation methods that commonly assume the matrix to be recovered has a low rank. Our results consistently show a U-curve over parameter 𝜆, demonstrating the feasibility of the proposed cross-validation strategy. Meanwhile, CAM is not sensitive to the choice of 𝜆, as the U-curve has a wide platform where the recovered sample-subtype-specific signals are similar and detected modules are close. It is also reasonable to assume that genes involved in biological associations or networks are sparse. Therefore, it deserves our further study to use ℓ2,1-norm regularization for reducing artifacts and improving function module detection. When group information is available, we can also apply basic CAM algorithm to each group to obtain group-wise expression profiles of subtypes. Compared to sample-specific deconvolution, group-specific deconvolution aims at a lower resolution of underlying subtype signals and thus could obtain more robust results. If grouping is fine enough, group-specific deconvolution can also (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 26 acquire signal variation in each subtype and thus help detect function modules and construct biological networks. Though swCAM can solve a seemingly underdetermined problem theoretically based on a low- rank assumption. It still needs improvement and validations. First, the improvement of swCAM by sparsity regularization. The sparsity assumption is practically reasonable, and we already show some preliminary results after imposing ℓ2,1 norm regularization. However, introducing one more regularization term will increase the difficulty of parameter tuning. Besides, the current cross- validation strategy with matrix entry sampling is not applicable to selecting the coefficient of ℓ2,1 norm term. Therefore, the integration of sparsity regularization still needs our further study. Second, the improvement of function module detection based on swCAM estimated sample- specific signals in each subtype. Recovering the exact values of sample-specific signals is impossible unless there are more strong assumptions. Luckily, our goal is to detect function module or networks from the between-sample variations in each subtype. Thus, increasing the accuracy of estimated intercorrelations among molecules can be regarded as our target of further efforts. Third, the validation of Validate swCAM in real data analysis. We have demonstrated the capacity of swCAM to estimate sample-specific signals in each subtype using simulations where the between-sample variation matrices are low-rank. Validation of swCAM in real molecule expression data would be difficult, as the benchmark datasets with true subtype-specific signals are unavailable. One possible direction is to verify the constructed subtype-specific networks through biological experiments. Conclusion We propose a sample-specific deconvolution algorithm to estimate simple-specific molecule expressions for each subtype, from which between-sample variation can be used to detect biological associations and construct networks in each subtype. The contributions of this work include: We formulate the objective function for swCAM with a penalty term to minimize the nuclear norm of between-sample variation matrix in each subtype, based on our expectation on the existence of subtype-specific networks. We design an efficient method based on ADMM to solve swCAM’s optimization problem in large-scale biological data. We design a 10-fold cross- validation strategy to select the coefficient of nuclear norm term, and demonstrate its feasibility in (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 27 simulations where a U-curve of RMSE is obtained to determine the optimal selection. We validate swCAM in simulations to demonstrate sample-specific signals can be well estimated when low- rank assumption holds. Even though artificial signal variances exist in swCAM estimations, the intercorrelations among genes can still be well preserved for function module detection and biological network construction. We propose to use extra ℓ2,1 norm regularization to enforce the sparsity of genes involved in networks and thus reduce the artifacts trained from noise or from signals of other subtypes. ACKNOWLEDGMENTS This work has been supported by the National Institutes of Health under Grants HL111362- 05A1, HL133932, NS115658-01, and the Department of Defence under Grant W81XWH-18-1- 0723 (BC171885P1). COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 28 Reference Boyd, S., N. Parikh, E. Chu, B. Peleato and J. Eckstein (2011). "Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers." Found. Trends Mach. Learn. 3(1): 1-122. Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni and O. Stegle (2015). "Computational analysis of cell-to-cell heterogeneity in single- cell RNA-sequencing data reveals hidden subpopulations of cells." Nat Biotechnol 33(2): 155-160. Cai, J.-F., E. J. Candès and Z. Shen (2010). "A Singular Value Thresholding Algorithm for Matrix Completion." SIAM Journal on Optimization 20(4): 1956-1982. Candes, E. J., C. A. Sing-Long and J. D. Trzasko (2013). "Unbiased Risk Estimates for Singular Value Thresholding and Spectral Estimators." Trans. Sig. Proc. 61(19): 4643-4657. Chasman, D. and S. Roy (2017). "Inference of cell type specific regulatory networks on mammalian lineages." Current Opinion in Systems Biology 2(Supplement C): 130-139. Chen, L. (2019). Mathematical Modeling and Deconvolution for Molecular Characterization of Tissue Heterogeneity. Ph.D. Doctoral Dissertation, Virginia Polytechnic Institute and State University. Chen, L., Y. Lu, C.-T. Wu, R. Clarke, G. Yu, J. E. Van Eyk, D. Herrington and Y. Wang (2020). "Data-driven detection of subtype-specific differentially expressed genes." Scientific Reports. Gal, E., M. London, A. Globerson, S. Ramaswamy, M. W. Reimann, E. Muller, H. Markram and I. Segev (2017). "Rich cell-type-specific network topology in neocortical microcircuitry." Nature Neuroscience 20: 1004. Hastie, T., R. Tibshirani and J. Friedman (2001). The Elements of Statistical Learning. New York, NY, USA, Springer New York Inc. Junttila, M. R. and F. J. de Sauvage (2013). "Influence of tumour micro-environment heterogeneity on therapeutic response." Nature 501: 346. Kuhn, A., D. Thu, H. J. Waldvogel, R. L. Faull and R. Luthi-Carter (2011). "Population-specific expression analysis (PSEA) reveals molecular changes in diseased brain." Nat Methods 8(11): 945-947. Langfelder, P. and S. Horvath (2008). "WGCNA: an R package for weighted correlation network analysis." BMC Bioinformatics 9: 559. Parikh, N. and S. Boyd (2014). "Proximal Algorithms." Foundations and Trends® in Optimization 1(3): 127-239. Recht, B., M. Fazel and P. A. Parrilo (2010). "Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization." SIAM Review 52(3): 471-501. Shen-Orr, S. S., R. Tibshirani, P. Khatri, D. L. Bodian, F. Staedtler, N. M. Perry, T. Hastie, M. M. Sarwal, M. M. Davis and A. J. Butte (2010). "Cell type-specific gene expression differences in complex tissues." Nat Methods 7(4): 287-289. Sonawane, A. R., J. Platig, M. Fagny, C.-Y. Chen, J. N. Paulson, C. M. Lopes-Ramos, D. L. DeMeo, J. Quackenbush, K. Glass and M. L. Kuijjer "Understanding Tissue-Specific Gene Regulation." Cell Reports 21(4): 1077-1088. Thouvenin, P. A., N. Dobigeon and J. Y. Tourneret (2016). "Hyperspectral Unmixing With Spectral Variability Using a Perturbed Linear Mixing Model." IEEE Transactions on Signal Processing 64(2): 525-538. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 29 Tian, Y., B. Zhang, E. P. Hoffman, R. Clarke, Z. Zhang, I.-M. Shih, J. Xuan, D. M. Herrington and Y. Wang (2014). "Knowledge-fused differential dependency network models for detecting significant rewiring in biological networks." BMC Systems Biology 8(1): 87. Tian, Y., B. Zhang, E. P. Hoffman, R. Clarke, Z. Zhang, M. Shih Ie, J. Xuan, D. M. Herrington and Y. Wang (2015). "KDDN: an open-source Cytoscape app for constructing differential dependency networks with significant rewiring." Bioinformatics 31(2): 287-289. Wang, N., E. P. Hoffman, L. Chen, L. Chen, Z. Zhang, C. Liu, G. Yu, D. M. Herrington, R. Clarke and Y. Wang (2016). "Mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissues." Scientific Reports 6: 18909. Zhang, B. and S. Horvath (2005). "A general framework for weighted gene co-expression network analysis." Stat Appl Genet Mol Biol 4: Article17. Zhang, B., H. Li, R. B. Riggins, M. Zhan, J. Xuan, Z. Zhang, E. P. Hoffman, R. Clarke and Y. Wang (2009). "Differential dependency network analysis to identify condition-specific topological changes in biological networks." Bioinformatics 25(4): 526-532. Zhang, B., Y. Tian, L. Jin, H. Li, M. Shih Ie, S. Madhavan, R. Clarke, E. P. Hoffman, J. Xuan, L. Hilakivi-Clarke and Y. Wang (2011). "DDN: a caBIG(R) analytical tool for differential network analysis." Bioinformatics 27(7): 1036-1038. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425315doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425315 10_1101-2021_01_05_425266 ---- DeepStrain: A Deep Learning Workflow for the Automated Characterization of Cardiac Mechanics DeepStrain: A Deep Learning Workflow for the Automated Characterization of Cardiac Mechanics Manuel A. Morales, Maaike van den Boomen, Christopher Nguyen, Jayashree Kalpathy-Cramer, Bruce R. Rosen, Collin M. Stultz, David Izquierdo-Garcia*, and Ciprian Catana* Abstract—Myocardial strain analysis from cinematic magnetic resonance imaging (cine-MRI) data could provide a more thorough characterization of cardiac mechanics than volumetric parameters such as left-ventricular ejection fraction, but sources of variation including segmentation and motion estimation have limited its wide clinical use. We designed and validated a deep learning (DL) workflow to generate both volumetric parameters and strain measures from cine-MRI data, including strain rate (SR) and regional strain polar maps, consisting of segmentation and motion estimation convolutional neural networks developed and trained using healthy and cardiovascular disease (CVD) subjects (n=150). DL-based volumetric parameters were correlated (>0.98) and without significant bias relative to parameters derived from manual segmentations in 50 healthy and CVD subjects. Compared to landmarks manually-tracked on tagging-MRI images from 15 healthy subjects, landmark deformation using DL-based motion estimates from paired cine-MRI data resulted in an end- point-error of 2.9 ± 1.5 mm. Measures of end-systolic global strain from these cine-MRI data showed no significant biases relative to a tagging-MRI reference method. On 4 healthy subjects, intraclass correlation coefficient for intra- scanner repeatability was excellent (>0.95) for strain, moderate to excellent for SR (0.690-0.963), and good to excellent (0.826-0.994) in most polar map segments. Absolute relative change was within ~5% for strain, within ~10% for SR, and <1% in half of polar map segments. In conclusion, we developed and evaluated a DL-based, end- to-end fully-automatic workflow for global and regional myocardial strain analysis to quantitatively characterize cardiac mechanics of healthy and CVD subjects based on ubiquitously acquired cine-MRI data. Index Terms—cardiac cine-MRI, deep learning, motion estimation, myocardial strain, segmentation. Submitted for review on Dec 20, 2020. This work was supported in part by the U.S. National Cancer Institute under Grant 1R01CA218187-01A1. (Asterisk indicates D. Izquierdo-Garcia and C. Catana contributed equally to this work). (Corresponding authors: D. Izquierdo-Garcia; C. Catana). M.A. Morales, D. Izquierdo-Garcia and B.R. Rosen, with Athinoula A. Martinos Center for Biomedical Imaging, MGH, HMS, 149 13th St, Boston, MA 02129 (email: moralesq@mit.edu; davidizq@nmr.mgh.harvard.edu; brrosen@mgh.harvard.edu) and with Harvard-MIT Health Science and Technology, 77 Massachusetts Ave, Cambridge, MA, 02139. M.V.D. Boomen and C. Nguyen, with Cardiovascular Research Center and Martinos Center for Biomedical Imaging, MGH, HMS, 149 13th St, Boston, MA 02129, with Department of Radiology, and M.V.D. Boomen also with University Medical Center Groningen, 9713 GZ Groningen (email: mvandenboomen@mgh.harvard.edu; Christopher.nguyen@mgh.havard.edu). C.M. Stultz, with Electrical Engineering and Computer Science, with Harvard-MIT Health Science and Technology, 77 Massachusetts Ave, Cambridge, MA, 02139, and with Division of Cardiology, MGH, 55 Fruit St, Boston, MA, 02114 (cmstultz@mit.edu). J. Kalpathy-Cramer, and Ciprian Catana, with Athinoula A. Martinos Center for Biomedical Imaging, MGH, HMS, 149 13th St, Boston, MA 02129 (jkalpathy-cramer@mgh.harvard.edu; ccatana@mgh.harvard.edu). I. INTRODUCTION ARDIAC mechanics reflects the precise interplay between myocardial architecture and loading conditions that is essential for sustaining the blood pumping function of the heart. The ejection fraction (EF) is often used as a left- ventricular (LV) functional index, but its value is limited when mechanical impairment occurs without an EF reduction [1]. Alternatively, tissue tracking approaches for strain analysis provide a more thorough characterization through non-invasive evaluation of myocardial deformation from echocardiography or cinematic magnetic resonance imaging (cine-MRI) data [2], and could be used to identify dysfunction before EF is reduced [3]. Unfortunately, various sources of discrepancies have limited the wide clinical applicability of these techniques, including factors related to imaging modality, algorithm, and operator [4]. More accurate measures could be obtained from tagging-MRI data widely regarded as the reference standard for strain quantification [5], [6], but collection of these data requires highly specialized and complex sequences that have mainly remained research tools, whereas echocardiography and cine-MRI data are ubiquitously acquired in clinical practice. Irrespective of algorithm or modality, e.g., speckle tracking for echocardiography or feature tracking for cine-MRI, the main challenge is to estimate motion within regions along the myocardial wall [2]. Operator-related discrepancies are introduced when the myocardial wall borders are delineated manually, a time-consuming process that requires considerable expertise and results in significant inter- and intra-observer variability [7], [8]. Automatic delineation approaches have been implemented within computational pipelines [9], but other factors related to motion tracking algorithms also influence strain assessment, including the appropriate selection of tuneable parameters whose optimal values can differ between C .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ patient cohorts and acquisition protocols (e.g., the size of the search region in block-matching methods [10]). Further, these algorithms often make assumptions about the properties of the myocardial tissue (e.g., incompressible and elastic [11], [12]), or use registration methods to drive the solution towards an expected geometry. However, recent evidence has shown the validity of these assumptions varies between healthy and diseased myocardium [13], [14], suggesting these approaches may not accurately reflect the underlying biomechanical motion [14]. Lastly, modality-related image quality could complicate interpretation of abnormal strain values since these could reflect either real dysfunction or artifact-related inaccuracies, leading to some degree of subjectivity or non-conclusive results [3]. Deep Learning (DL) methods have demonstrated the advantage of allowing real-world data guide learning of abstract representations that can be used to accomplish pre-specified tasks, and have been shown to be more robust to image artifacts than non-learning techniques for some applications [15], [16]. DL segmentation methods have been proposed [17]–[20] and implemented within strain computational pipelines [21], [22], and recent studies have shown that cardiac motion estimation can also be recast as a learnable problem [23]–[26]. These methods usually consist of an intensity-based loss function and a constrain term [23], [27], the latter using common machine learning techniques (e.g., L2 regularization of all learnable parameters [24]) or direct regularization of the motion estimates (e.g., smoothness penalty [23], anatomy-aware [26]). However, because ground-truth cardiac motion is challenging to acquire, whether these constrains improve the accuracy of motion or strain estimates is not yet clear. Further, the added-value of DL- based regional strain estimation has not been demonstrated. We have recently developed a learning method for cardiac motion estimation that produces more accurate estimates than various techniques, including B-spline, diffeomorphic, and mass-preserving algorithms [28], and showed these estimates could potentially be used to detect regional dysfunction. Thus, incorporating our method within a strain analysis framework could potentially enable accurate, user-independent, and quantitative characterization of cardiac mechanics at a both global and regional level. Once trained, such method would not necessitate further parameter tunning or optimization, which is time-consuming for larger datasets and daily clinical practice. While this framework could be based on echocardiography images [29], these data remain limited for strain mapping tasks by their low reproducibility of acquisition planes [4] and temporal stability of tracking patterns [30]. In contrast, cine- MRI offers the most accurate and reproducible assessment of cardiac anatomy and function, thus providing a more thorough set of data for learning-based motion models. We propose DeepStrain, an automated workflow that derives global and regional strain measures from cine-MRI data by decoupling motion estimation and segmentation tasks. After verifying the effects of smoothing and anatomical regularizers on motion and strain, convolutional neural networks for pre- processing (i.e., centering and cropping), segmentation, and motion estimation were implemented, trained, validated, and compared to state-of-the-art methods. Finally, accuracy of strain values was assessed using a tagging-MRI algorithm as reference standard, intra-scanner repeatability was measured using subjects with repeated scans, and potential clinical applications of global and regionals myocardial strain measures were demonstrated on patient populations. II. METHOD A. Datasets For development we used the Automated Cardiac Diagnosis Challenge (ACDC) dataset [31], consisting of cine-MRI data from 150 subjects evenly divided into five groups: healthy and patients with hypertrophic cardiomyopathy (HCM), abnormal right ventricle (ARV), myocardial infarction with reduced ejection fraction (MI), and dilated cardiomyopathy (DCM). These data were publicly available as train (n=100) and test (n=50) sets, with manual segmentations included for the train set only. For validation of motion and strain measures we used the Cardiac Motion Analysis Challenge (CMAC) dataset [32], consisting of paired tagging- and cine-MRI data from 15 healthy subjects. To assess intra-scanner repeatability, four healthy volunteers were recruited to undergo repeated scans on a 3T MRI scanner. All cine-MRI frames and corresponding segmentations were resampled to a 256×256×16 volume grid with 1.25 mm × 1.25 mm in-plane resolution and variable slice thickness (4-7 mm). See supplementary section S1 for acquisition protocol. B. Myocardial Strain Definitions Strain represents percent change in myocardial length per unit length. The three-dimensional (3D) analog for MRI is given by the Lagrange strain tensor 𝝐 𝑡 = 𝛻𝒖 𝑡 + 𝛻𝒖 𝑡 ( + 𝛻𝒖 𝑡 ( 𝛻𝒖 𝑡 /2, (1) where 𝒖 𝑡 denotes myocardial displacement from a fully- relaxed end-diastolic phase at t=0, to a contracted frame at t>0. Radial and circumferential strain are the diagonal components of the tensor 𝝐 evaluated in cylindrical coordinates. Strain rate (SR) is the time derivative of (1). Global strain is defined as the average of 𝝐 over the whole LV myocardium (LVM) volume. Regional strain is defined as the average of 𝝐 over the volume of specific LVM segments defined by the American Heart Association (AHA) polar map [33], which requires labels of the right ventricle to construct. Specific parameters based on timing and magnitude are extracted from the measures evaluated over a whole cardiac cycle: end-systolic strain (ESS), defined as the global strain value at end-systole; systolic strain rate (SRs), defined as the peak (i.e., maximum) absolute value of global SR during systole; early-diastolic strain rate (SRe), defined as the peak absolute value of global SR during diastole. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ C. Centering, Segmentation, and Motion Estimation DeepStrain (Fig. 1) consists of a series of convolutional neural networks that perform three tasks: a ventricular centering network (VCN) for automated centering and cropping, a cardiac motion estimation network (CarMEN) to generate 𝒖, and a cardiac segmentation network (CarSON) to generates tissue labels. Estimates of 𝒖 are used to calculate myocardial strain, and segmentations are used to derive volumetric parameters, identify a cardiac coordinate system for strain analysis, and generate tissue labels used for anatomical regularization of the motion estimates at training time. Let 𝑉- be a cine-MRI frame at time t defined over a 3D spatial domain 𝛺 ⊂ ℝ3. Using a pair of frames 𝑉4,𝑉- as an input, VCN centers and crops the images around the center of mass of the LV, CarSON generates segmentations 𝑀4,𝑀- of the LV, RV, and LVM, and CarMEN estimates the motion 𝒖- of the heart from 𝑉4 to 𝑉-. Thus, for each voxel 𝑝 ∈ 𝛺, 𝒖- 𝑝 is an approximation of the myocardial displacement during contraction such that 𝑉4(𝑝) and (𝒖- ∘ 𝑉-)(𝑝) correspond to similar cardiac regions. The operator ∘ refers to application of a spatial transform to 𝑉- using 𝒖- via trilinear interpolation [34]. 1) Architectures All networks have a common encoder-decoder architecture consisting primarily of convolution, batch normalization [35], and PReLU [36] layers with residual connections [37] (see supplementary section S2). Briefly, VCN is a 3D architecture that uses a single-channel array 𝑉 with size 256×256×16 to generate a single-channel array 𝐺<=>? of equal size, where 𝐺<=>? corresponds to a Gaussian distribution with mean defined as the LVM center of mass. V is centered and cropped around the voxel with the highest value in 𝐺<=>? to generate a new cropped array of size 128×128×16, which is then the input to segmentation and motion estimation networks. CarSON is a two-dimensional (2D) architecture that uses images of size 128×128 to generate a 4-channel segmentation 𝑀<=>? of equal size, each channel corresponding to a label. CarMEN uses a 2- channel input volume, consisting of two concatenated arrays with size 128×128×16, to generate a 3-chanel array 𝒖 of equal size. Each channel in 𝒖 represents the 𝑥, 𝑦 and 𝑧 components of motion. 2) Loss Functions VCN was evaluated using the mean square error ℒDEF 𝐺G-,𝐺<=>? = H |J| 𝐺(𝑝) − 𝐺<=>? 𝑝 L <∈J . (2) For CarSON, we implemented a multi-class Dice coefficient function ℒN>G 𝑀G-,𝑀<=>? = − H O 2 bNa-c that evaluates CarMEN using the input volumes and generated motion estimates ℒab->bNa-c 𝑉4,𝑉-,𝑢- = H J 𝑉4 𝑝 − (𝑢- ∘ 𝑉- 𝑝<∈J . (4) Second, we used a supervised function ℒebe-fgahei that leverages segmentations of the input volumes at training time to impose an anatomical constrain on the estimates Fig. 1. Overview of proposed DeepStrain workflow. VCN centers and crops the input pair of cine-MRI frames. Tissue labels generated by CarSON are used to build an anatomical model. Motion estimates derived from CarMEN are used to calculate strain measures, and these estimates are combined with the anatomical model to enable global and regional strain analyses. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ ℒebe-fgahei 𝑀4,𝑀-,𝑢- = ℒN>G 𝑀4,𝑢- ∘ 𝑀- . (5) Third, smooth estimates were encouraged by using a diffusion regularizer ℒNgff-jb>NN(𝑢-) = 𝛻𝑢- 𝑝 ⋅ 𝑑𝑟 L<∈J (6) where 𝑑𝑟 is the spatial resolution of 𝑉 and accounts for differences between in-plane and slice resolution. Thus, the loss function for CarMEN is a linear combination of (4), (5), and (6), weighted by 𝜆a, 𝜆e,𝜆N, accordingly. We conducted optimization experiments using synthetic data [38], [39] to assess the impact of smoothing and anatomical regularization on motion and strain estimates (supplementary section S3). These experiments showed smoothness improves the accuracy of the motion vectors direction, and anatomical regularization improves the magnitude of the vectors relative to the ground-truth motion (see supplementary Fig. S1 and S2). The optimal values 𝜆a = 0.01 , 𝜆e = 0.5,𝜆N = 0.1 were used to train CarMEN. 3) Training and Testing Networks were trained in TensorFlow ver. 2.0 with Adam optimizer parameters beta1,2 = 0.9,0.999, batchsize = 80 (5 for CarMEN), and epochs = 1000 (300 for CarMEN). Ground-truth distributions for VCN were created using the manual segmentations. VCN and CarSON were trained using the end- diastolic and end-systolic frames of the train set, as only these included ground-truth segmentations. This provided 200 training samples for VCN and 3200 for CarSON, the latter having more samples since it is a 2D architecture and all frames were resampled to a volume with 16 slices. VCN was tested by five-fold cross-validation, whereas the accuracy of CarSON was assessed by submitting the results to the challenge website. Once CarSON was trained, we generated segmentations of the test set to train CarMEN using the entire ACDC dataset. Only the [end-diastolic, end-diastolic] and [end-diastolic, end- systolic] pairs were used. The former is essential for the network to adequately learn how to scale the motion vectors, i.e., motion should be exactly zero if the frames are equal. The entire cycle is analyzed at testing time by using sequential input pairs [𝑉4, • ] that kept the end-diastolic frame constant while we varied 𝑉- for all time frames t > 0. Using this approach 𝒖- was derived for all times. Data augmentation included random rotations and translations, random mirroring along the x and y axes, and gamma contrast correction. All data augmentation was performed only in the x-y plane. D. Evaluation 1) Segmentation and Motion Estimation CarSON and manual segmentations were compared using the Hausdorff distance (HD) and Dice Similarity Coefficient (DSC) metrics at both end-diastole and end-systole. Accuracy of LV volumetric measures derived from segmentations, including end-diastolic volume (EDV), EF, and LVM, was assessed using the correlation, bias, and standard deviation metrics. The mean absolute error (MAE) for the LV EDV and LVM were also computed for comparison against the intra- and inter-observer variability reported by [31]. We compared our results to top-3 ranked methods published for the ACDC test set as these appear in the leader-board of the challenge [17]–[20]. The CMAC organizers defined 12 landmarks at the intersection of gridded tagging lines at end-diastole on tagging images, one landmark 𝑝4 per wall per ventricular level. These landmarks were manually-tracked by two observers over the cardiac cycle. Conversion from tagging to cine coordinates was done using DICOM header information. We used the CarMEN motion estimates 𝑢- to automatically deform the landmarks at end-diastole, and the accuracy was assessed using the in-plane end-point error (EPE) between deformed 𝑝-q = 𝑢- ∘ 𝑝4 and manually-tracked 𝑝- landmarks, defined by 𝐸𝑃𝐸 𝑝,𝑝q = 𝑝t − 𝑝tq L + 𝑝c − 𝑝cq L . (7) Due to temporal misalignment between the tagging and cine acquisitions, EPE was evaluated only at end-systole (𝑡 = 𝑡FE). Specifically, let 𝑝au(𝑡) denote the manually-tracked landmarks of subject 𝑖 at frame 𝑡 by observer 𝑗. The accuracy of CarMEN was assessed using the average EPE AEPE = H Lb 𝐸𝑃𝐸(𝑝au 𝑡FE ,𝑢a(𝑡FE) ∘ 𝑝4) L u[H b a[H . (8) Our results were compared to those reported by the four groups that responded to the challenge [32], MEVIS [40], IUCL [9], UPF [11], and INRIA [12], [41]. All groups submitted tagging-based motion estimates, but only UPF and INRIA provided estimates based on cine-MRI. 2) Strain Validation and Intra-Scanner Repeatability The tagging-MRI method with the lowest AEPE was used as the reference for strain analysis. The tagging-MRI-based motion estimates were registered and resampled to the cine- MRI space. Global strain and SR values throughout the entire cardiac cycle were derived from the resampled estimates as described in [42]. Global- and regional-based analyses were performed to assess the repeatability of measures from two acquisitions. Relative changes (RC) and absolute relative changes (aRC) were calculated, taking the first acquisition as the reference. ESS and SR were calculated for the global-based analysis, and for region-based analyses, ESS values were normalized using the AHA polar map, and both RC and aRC were evaluated for each of the segments in the polar map. 3) Statistics Bland-Altman analysis was used to quantify agreement between predicted and tagging strain measures. We used the term bias to denote the mean difference and the term precision to denote the standard deviation of the differences. Differences were also assessed using a paired t-test with Bonferroni correction for multiple comparisons. For global- and regional- based analyses of intra-scanner repeatability, ICC estimates and their 95% confidence intervals (CI) were calculated based on a single-rating, absolute agreement, 2-way mixed-effects model. Analyses were performed on python v3.4 [43]. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ III. RESULTS A. Segmentation and Motion Estimation Centering, segmentation, and motion estimation for an entire cardiac cycle (~25 frames) was accomplished in <13s on a 12GB GPU and <2.2 min on a 32GB RAM CPU. VCN located the LV center of mass with a median error of 1.3 mm. Correlation of CarSON and manual LV volumetric measures was >0.98 across all measures (Table 1), and biases in EF (+0.25 ± 3.2%), end-diastolic (+0.76 ± 6.7 mL) and end-systolic (+0.19 ± 5.8 mL) volumes, and mass (+1.4 ± 10.3 g) were not significant. Further, these biases were smaller than those obtained with other methods, which were positive for LV EDV (1.5 to 3.7 mL), negative for LVM (-2.1 to -2.9 g), and close to zero (±0.5%) for EF. Simantiris et al. [17] obtained the best precision for LV EF (2.7 vs. 3.2% variance with CarSON), EDV (4.6 vs. 6.7 mm), and LVM (6.5 vs. 10.3 g). Isensee et al. [18] obtained the best results on geometric metrics, i.e., lower HD for the LV (end-diastole 5.5 vs. 5.7 mm; end-systole 6.9 vs. 7.7 mm) and LVM (7.0 vs. 8.1 mm; 7.3 vs. 9.2 mm), and higher DSC for the LVM (0.904 vs. 0.898; 0.923 vs. 0.913). The DSC for the LV was similar for all methods (~0.967, ~0.929). MAE for the LV EDV and LVM were 5.3 ± 4.1 mL and 6.8 ± 6.5 g. Fig. 2a illustrates a representative example of the tagging and cine images from a CMAC subject. Landmarks defined at end-diastole were deformed to end-systole using the CarMEN estimates and compared to manual tracking. Banding artifacts on cine images showed no clear effect on derived motion estimates or landmark deformation, as shown in end-systole (Fig. 2a, yellow arrow) or throughout the whole cardiac cycle (see supplementary video). The manual tracking inter-observer variability was 0.86 mm (Fig. 2b, dotted line). Within cine- TABLE I STATE-OF-THE-ART METHODS FOR LEFT-VENTRICULAR SEGMENTATION SHOWN AT END-DIASTOLE (ED) AND END-SYSTOLE (ES) ON THE ACDC TEST SET COMPARED TO PROPOSED APPROACH. RED ARE THE BEST RESULTS FOR EACH METRIC. Fig. 2. Validation of motion and strain. (a) Landmarks at end-diastole (unfilled green) are manually-tracked (green) and deformed with CarMEN to end-systole (red). Yellow arrow indicates a banding artifact. (b) Average end-point-error (AEPE) was assessed and compared to other methods. (c) MEVIS- and DeepStrain- based strain (top) and strain rate (SR, bottom) measures are compared. Black arrow shows strain inaccuracies with MEVIS. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ based techniques, CarMEN (2.89 ± 1.52 mm) and UPF (2.94 ± 1.64 mm) had lower (p<0.001) AEPE relative to INRIA (3.78 ± 2.08 mm), but there was no significant difference between CarMEN and UPF. All tagging-based methods had lower AEPE compared to cine approaches, particularly MEVIS (1.58 ± 1.45 mm) and UPF (1.65 ± 1.45 mm). B. Strain Analysis Table 2 shows the normal ranges (mean [95% CI]) of strain derived from cine-MRI data for all healthy subjects, including subjects from the training, validation, and repeatability cohorts. DeepStrain generated values with narrow CI for circumferential (~1%) and radial (~2%) ESS, and circumferential (~0.15 s-1) and radial (~0.25 s-1) SR. Specifically, circumferential and radial values across datasets were: -16.9% [-17.6 -16.3] and 22.6% [21.4 23.8] for ESS, -1.12 s-1 [-1.19 -1.05] and 1.30 s-1 [1.20 1.40] for SRs, and 0.76 s-1 [0.69 0.83] and -1.38 s-1 [-1.51 -1.24] for SRe, accordingly. These values were similar to those from tagging-based ones, although circumferential SRe from cine-MRI data was lower, mostly in the train set (0.7± 0.2 s-1). Comparison of tagging- and cine-based strain measures with matched subjects showed an overall agreement in timing and magnitude of strain and SR throughout the cardiac cycle, although tagging-based measures of radial ESS diverge after early diastole (Fig. 2c, black arrow), and there were visual differences in peak SR parameters. Visual inspection of image artifacts on cine data showed no clear evidence that these artifacts affected strain values derived with DeepStrain (see supplementary Fig. S3). Quantitative comparisons of tagging- and cine-based measures showed biases in circumferential ESS (-14.2 ± 2.2 vs. -15.3 ± 1.5%; bias -1.17 ± 2.93%), radial ESS (18.4 ± 5.1 vs. 19.7 ± 3.4%; +1.26 ± 5.37%) and SRe (-1.2 ± 0.5 vs. -1.4 ± 0.3; -0.21 ± 0.52 s-1) were small and not significantly different from zero (see supplementary Fig. S4). However, there were larger differences (p<0.01) in radial SRs (1.0 ± 0.2 vs. 1.3 ± 0.2 s-1; 0.32 ± 0.34 s-1), and circumferential SRs (-0.9 ± 0.1 vs. -1.2 ± 0.2 s-1; 0.30 ± 0.22 s-1) and SRe (1.2 ± 0.2 vs. 0.8 ± 0.1 s-1; 0.40 ± 0.23 s-1). Representative strain measures of a single subject derived from two acquisitions are shown in Fig. 3. The AHA polar maps from both acquisitions showed comparable regional variations in ESS, particularly for circumferential ESS in the inferoseptal wall (Fig. 3a, orange arrows). Global curves throughout the entire cardiac cycle also showed visual agreement in both timing and magnitude (Fig. 3b). From these data, circumferential (-14.1 vs. -14.3%) and radial (17.9 vs. 17%) ESS (Fig. 3b, purple), circumferential SRs (0.95 vs. 0.90 s-1) and SRe (-0.74 vs. -0.82 s-1), and radial SRs (1.03 vs. 1.08 s-1) and SRe (-1.12 vs. -1.11 s-1) global parameters were also found to be similar (Fig. 3b, yellow). In addition, while not quantified in this study, the late-diastolic filling peaks were also comparable (Fig. 3b, blue). Table 3 shows the RC, aRC, ICC, and LoA across subjects for the global parameters. The average aRC was below 5% for ESS (circumferential: 3.1 ± 1.8%; radial: 4.3 ± 3.4%), below 7% for SRs (5.7 ± 4.4%; 6.9 ± 10.4%), and below 11% for SRe parameters (10.2 ± 8.8%; 3.8 ± 3.1%). ICC results showed repeatability was excellent for ESS (0.954; 0.968), good for SRs (0.889; 0.754), moderate for circumferential SRe (0.690), and excellent for radial SRe (0.963) values. The LoA, which defines the interval where to find the expected differences in 95% of the cases assuming normally distributed data, were ~1% and ~4% for circumferential and radial ESS, and <0.05 s-1 for all SR measures. The ESS, RC, and aRC maps averaged across subjects are shown in Fig. 4. Visually, these maps (Fig. 4b) showed the average RC and aRC were marginal ( ~1%) in more than half of the polar map segments. Specifically, values were marginal for circumferential ESS (~1%) in the anterior, anteroseptal, and anterolateral walls, but were larger in the inferior region, most notably in the basal- and mid-inferoseptal segments (7%). For radial ESS the largest changes were found in the mid- anterolateral segment (6%), whereas changes in the anteroseptal, inferior and inferolateral walls were very small (~1%). The RC and aRC per subject are provided in boxplot form in supplementary Fig S5. These results showed that, in most of the segments, the RC and aRC were less than ~10%, although larger differences were noted in the inferoseptal wall for radial ESS, and anterolateral wall for circumferential. Supplementary Table S1 shows the ICC and LoA per segment, including the whole-map average. For radial ESS, the ICC results showed excellent repeatability across all segments. Circumferentially, all segments showed good to excellent repeatability, except for the basal- and mid-inferolateral segments. LoAs showed that 95% of differences occurred within ~3% and ~4% intervals for circumferential and radial ESS. C. Evaluation in Patients with Cardiovascular Disease Regional measures of ESS averaged over patient population (see supplementary figure S6), as well as global values of strain and SR across the cardiac cycle (Fig. 5) for all 100 subjects in the ACDC train set showed progressive decline in strain values TABLE II NORMAL RANGES OF STRAIN WITH DEEPSTRAIN IN HEALTHY SUBJECTS. TAGGING-BASED MEASURES ARE SHOWN FOR THE CMAC COHORT. DEEPSTRAIN REPEATABILITY IS SHOWN FOR TWO ACQUISITIONS (ACQ). TABLE III INTRA-SCANNER REPEATABILITY OF GLOBAL CIRCUMFERENTIAL (CIRC) AND RADIAL (RAD) STRAIN MEASURES. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ starting with HCM, followed by ARV, MI, and DCM. Specifically, relative to the healthy group, radial ESS was reduced in all patient populations. Radial systolic and early- diastolic SR were also reduced in all patient groups, except for systolic SR in HCM. Fig. 6 shows both the cine-MRI image and the circumferential ESS polar map of a healthy subject and two patients with MI. Strain values in the healthy polar map have a homogeneous distribution. In contrast, in one MI patient the map indicates a diffused reduction, and inspection of the myocardium on the cine-MRI image shows an anteroseptal infarct that coincides in location with segments with more prominent decreases in strain. In a different MI patient with an infarct located in a similar septal region, strain changes are focal and localized to the anteroseptal wall. IV. DISCUSSION Learning-based methodologies have the potential to meet the technical challenges associated with myocardial strain analysis. In this study we developed a fast DL framework for strain analysis based on cine-MRI data that does not make assumptions about the underlying physiology, and we benchmarked its segmentation, motion, and strain estimation components against the state-of-the-art. We compared our segmentations to other DL methods, motion estimates to other non-learning techniques, and strain measures to a reference tagging-MRI technique. We also presented the intra-scanner repeatability of DeepStrain-based global and regional strain measures, and showed that these measures were robust to image artifacts in some cases. Global and regional applications were also presented to demonstrate the potential clinical utilization of our approach. A. Volumetric Measures Segmentation from MRI data is a task particularly well suited for convolutional networks given the excellent soft-tissue contrast, thus all top performing methods on the ACDC test set were based on DL approaches (Table 1). Isensee et al. [18] had remarkable success on geometric metrics, but this and other approaches result in a systematic overestimation of the LV EDV and thus underestimation of LVM. In contrast, CarSON generated less biased measures of LV volumes and mass, which were not significant. Although Simantiris et al. [17] obtained the most precise measures, possibly due to their extensive use of augmentation using image intensity transformations, across methods the precision of EF was within the ~3-5% [46] needed when it is used as an index of LV function in clinical trials [47]. Lastly, we showed that the error in our measures of LV EDV and LVM was almost half the inter-observer (~10.6 mL, 12.0 g), and comparable to the intra-observer (~4..6 mL, 6.2 g) MAE reported in [31], but further investigations are required to assess the performance on more heterogeneous populations. B. Strain Measures The application of myocardial strain to quantify abnormal deformation in disease requires accurate definition of normal ranges. However, previously reported normal ranges vary largely between modalities and techniques, particularly for radial ESS [4]. In this study we showed DeepStrain generated strain measures with narrow CI in healthy subjects from across three different datasets (Table 1). Although direct comparison with the literature is difficult due to differences in the datasets, overall our strain measures agreed with several reported results. Specifically, circumferential strain is in agreement with studies Fig. 3. Global and regional strain measures of representative subject. (a) Regional end-systolic strain measures show visual agreement (orange arrow). (c) Global strain and strain rate (SR) measures also show visual agreement. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ in healthy participants based on tagging (-16.6%, n=129) and speckle tracking echocardiography (-18%, n=265) datasets [48], [49], as well a recently proposed (-16.7% basal, n=386) tagging-based DL method [42]. Radial strain is in agreement with tagging-based (26.5%, n=129; 23.8% basal, n=386) studies [42], [48], but are lower than most reported values [4]. This is a result of smoothing regularization used during training to prevent overfitting. However, lowering the regularization without increasing the size of the training set would lead to increased EPE and wider CI. SR measures derived with DeepStrain were also in good agreement with previous tagging- based studies [48] . The CMAC dataset enabled us to compare our results to non-learning methods using a common dataset. We found that AEPE was lower with tagging-based techniques, reflecting the advantage of estimating cardiac motion from a grid of intrinsic tissues markers (i.e., grid tagging lines). Further, the tagging techniques also benefited from the fact that landmarks were placed at the center of the ventricle, whereas motion estimation from tagging data at the myocardial borders and in thin-walled regions of the LV is less accurate due to the spatial resolution of the tagging grid [4]. In addition, some of the tagging-MRI images did not enclosed the whole myocardium and some contained imaging artifacts, which resulted in strain artifacts towards the end of the cardiac cycle (Fig. 2c, black arrow). We found that MEVIS had the lowest AEPE, which could be a result of their image term (4) that penalizes phase shifts in the Fourier domain instead of intensity values, an approach that is less affected by desaturation (i.e., fading) of the tagging grid over time. The UPF approach also achieved a low AEPE using multimodal integration and 4D tracking to leverage the strengths of both modalities and improve temporal consistency [11]. Although this approach could in principle be recast as DL technique using recurrent neural networks [50], this would require a significant increase in the number of learnable parameters, therefore very large datasets would be needed to avoid overfitting. Using MEVIS as the tagging reference standard, we found no significant differences in measures of radial and circumferential ESS (Fig. 2c). Temporally, we found significant differences in SR measures between the two techniques that could be due to drift errors in the MEVIS implementation, i.e., errors that accumulate in sequential implementations in which motion is estimated frame-by-frame [32]. Although we did not observe considerable improvements in AEPE compared to tagging- and cine-based methods, an important advantage of our approach is the reduced computational complexity (~13 sec in GPU) relative to the proposed MEVIS (1-2 h), IUCL (3-6 h), UPF (6 h) and INRIA (5 h) approaches [32]. Specifically, because once trained our network does not optimize for a specific test subject (i.e., it does not iterate on the cine-data to generate the desired output), centering, segmentation, and motion estimation for the entire cardiac cycle can be accomplished much faster (<2 min in CPU). An additional advantage of non-iterative implementations is that we obtain deterministic results. Since this implies the exact same motion estimates are generated given the same input, we expect strain measures not to vary meaningfully if the anatomy and function remain fixed. Here we studied this property by evaluating the intra-scanner repeatability, an important aspect to consider when assessing the potential clinical utility of DeepStrain. Global measures of ESS showed excellent repeatability with narrow LoAs and with absolute Fig. 4. Intra-scanner repeatability of regional myocardial strain measures. (a) Average of subject-specific regional end-systolic strain (ESS) maps during two acquisitions. (b) Average changes between acquisitions. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ RCs of less than 5% on average, and regional analyses also showed the average RC and aRC was less than 1% in more than half of the polar map segments, with the maximum difference being 7%. Finally, all SR measures showed good to excellent repeatability, except for SR which was moderate. C. Clinical Evaluation DeepStrain could be applied in a wide range of clinical applications, e.g., automated extraction of imaging phenotypes from large-scale databases (e.g., UK biobank [51]). Such phenotypes include global and regional strain, which are important measures in the setting of existing dysfunction with preserved EF [3]. DeepStrain generated measures of global strain and SR over the entire cardiac cycle from a cohort of 100 subjects in <2 min (Fig. 5). These results showed that radial SRe was reduced in patients with HCM and ARV, despite having a normal or increased LV EF. Decreased SRe with normal EF is suggestive of subclinical LV diastolic dysfunction, which is in agreement with previous findings [52], [53]. Our results also showed DeepStrain-based maps could be used to characterize regional differences between groups (supplementary Fig. S6). At an individual level (Fig. 6), we showed that in MI patients, polar segments with decreased circumferential strain matched myocardial regions with infarcted tissue. Further, we showed that the changes in regional strain due to MI can be both diffuse and focal. These abnormalities could be used to discriminate dysfunctional from functional myocardium [54], or as inputs for downstream classification algorithms [55]. More generally, DeepStrain could be used to extract interpretable features (e.g., strain and SR) for DL diagnostic algorithms [56], which would make understanding of the pathophysiological basis of classification more attainable [57]. D. Study Limitations A limitation of our study was the absence of important patient information (e.g., age), which would be needed for a more complete interpretation of our strain analysis results, for example to assess the differences in strain values found between the healthy subjects from the ACDC and CMAC datasets. However, using publicly available data enables the scientific community to more easily reproduce our findings, and compare our results to other techniques. Another limitation was the absence of longitudinal analyses, i.e., longitudinal strain was not reported because it is normally derived from long-axis cine- MRI data not available in the training dataset. The size of the datasets is another potential limitation. The number of patients used for training is much smaller than the number of trainable parameters, potentially resulting in some degree of overfitting. To correct this, the training set for motion estimation could be expanded by validating the proposed segmentation network on more heterogeneous populations. Also, while our repeatability results were promising despite testing in only a small number of subjects, repeatability in patient populations was not shown. E. Conclusion We developed an end-to-end learning-based workflow for strain analysis that is fast, operator-independent, and leverages real-world data instead of making explicit assumptions about myocardial tissue properties or geometry. This approach enabled us to derive strain measures from new data without further training or parameter finetuning, and our measures were robust to image artifacts, repeatable, and comparable to those derive from dedicated tagging data. These technical and practical attributes position DeepStrain as an excellent candidate for use in routine clinical studies or data-driven research. ACKNOWLEDGMENT We acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. We also thank P. Jodoin (ACDC) and C. Tobon-Gomez (CMAC) for their assistance with the challenge datasets. REFERENCES [1] M. A. Konstam and F. M. Abboud, “Ejection Fraction: Misunderstood and Over-rated (Changing the Paradigm in Categorizing Heart Failure),” Circulation, vol. 135, no. 8, pp. 717–719, Feb. 2017. [2] P. Claus, A. M. S. Omar, G. Pedrizzetti, P. P. Sengupta, and E. Nagel, “Tissue Tracking Technology for Assessing Cardiac Mechanics,” JACC: Cardiovascular Imaging, vol. 8, no. 12, pp. 1444–1460, Dec. 2015. [3] O. A. Smiseth, H. Torp, A. Opdahl, K. H. Haugaa, and S. Urheim, “Myocardial strain imaging: how useful is it in clinical decision making?,” Eur Heart J, vol. 37, no. 15, pp. 1196–1207, Apr. 2016. [4] M. S. Amzulescu, M. De Craene, H. Langet, A. Pasquet, D. Vancraeynest, A. C. Pouleur, J. L. Vanoverschelde, and B. L. Gerber, Fig. 5 Strain and strain rate measures computed on the ACDC train set. Fig. 6. Regional strain in healthy and patients with MI. Myocardial infarction can result in diffused (center) and focal (right) strain reduction. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ “Myocardial strain imaging: review of general principles, validation, and sources of discrepancies,” European Heart Journal - Cardiovascular Imaging, Mar. 2019. [5] N. F. Osman, S. Sampath, E. Atalar, and J. L. Prince, “Imaging longitudinal cardiac strain on short-axis images using strain-encoded MRI,” Magn. Reson. Med., vol. 46, no. 2, pp. 324–334, Aug. 2001. [6] D. Kim, W. D. Gilson, C. M. Kramer, and F. H. Epstein, “Myocardial Tissue Tracking with Two-dimensional Cine Displacement-encoded MR Imaging: Development and Initial Evaluation,” Radiology, vol. 230, no. 3, pp. 862–871, Mar. 2004. [7] N. Risum, S. Ali, N. T. Olsen, C. Jons, M. G. Khouri, T. K. Lauridsen, Z. Samad, E. J. Velazquez, P. Sogaard, and J. Kisslo, “Variability of Global Left Ventricular Deformation Analysis Using Vendor Dependent and Independent Two-Dimensional Speckle-Tracking Software in Adults,” Journal of the American Society of Echocardiography, vol. 25, no. 11, pp. 1195–1203, Nov. 2012. [8] A. Schuster, V.-C. Stahnke, C. Unterberg-Buchwald, J. T. Kowallick, P. Lamata, M. Steinmetz, S. Kutty, M. Fasshauer, W. Staab, J. M. Sohns, B. Bigalke, C. Ritter, G. Hasenfuß, P. Beerbaum, and J. Lotz, “Cardiovascular magnetic resonance feature-tracking assessment of myocardial mechanics: Intervendor agreement and considerations regarding reproducibility,” Clinical Radiology, vol. 70, no. 9, pp. 989– 998, Sep. 2015. [9] Wenzhe Shi, Xiahai Zhuang, Haiyan Wang, S. Duckett, D. V. N. Luong, C. Tobon-Gomez, KaiPin Tung, P. J. Edwards, K. S. Rhode, R. S. Razavi, S. Ourselin, and D. Rueckert, “A Comprehensive Cardiac Motion Estimation Framework Using Both Untagged and 3-D Tagged MR Images Based on Nonrigid Registration,” IEEE Trans. Med. Imaging, vol. 31, no. 6, pp. 1263–1275, Jun. 2012. [10] G. Pedrizzetti, P. Claus, P. J. Kilner, and E. Nagel, “Principles of cardiovascular magnetic resonance feature tracking and echocardiographic speckle tracking for informed clinical use,” Journal of Cardiovascular Magnetic Resonance, vol. 18, no. 1, p. 51, Dec. 2016. [11] M. De Craene, G. Piella, O. Camara, N. Duchateau, E. Silva, A. Doltra, J. D’hooge, J. Brugada, M. Sitges, and A. F. Frangi, “Temporal diffeomorphic free-form deformation: Application to motion and strain estimation from 3D echocardiography,” Medical Image Analysis, vol. 16, no. 2, pp. 427–450, Feb. 2012. [12] T. Mansi, X. Pennec, M. Sermesant, H. Delingette, and N. Ayache, “iLogDemons: A Demons-Based Registration Algorithm for Tracking Incompressible Elastic Biological Tissues,” Int J Comput Vis, vol. 92, no. 1, pp. 92–111, Mar. 2011. [13] R. Avazmohammadi, J. S. Soares, D. S. Li, T. Eperjesi, J. Pilla, R. C. Gorman, and M. S. Sacks, “On the in vivo systolic compressibility of left ventricular free wall myocardium in the normal and infarcted heart,” Journal of Biomechanics, vol. 107, p. 109767, Jun. 2020. [14] V. Kumar, A. J. Ryu, A. Manduca, C. Rao, R. J. Gibbons, B. J. Gersh, K. Chandrasekaran, S. J. Asirvatham, P. A. Araoz, J. K. Oh, A. C. Egbe, A. Behfar, B. A. Borlaug, and N. S. Anavekar, “Cardiac MRI demonstrates compressibility in healthy myocardium but not in myocardium with reduced ejection fraction,” International Journal of Cardiology, vol. 322, pp. 278–283, Jan. 2021. [15] B. Zhu, J. Z. Liu, B. R. Rosen, and M. S. Rosen, “Image reconstruction by domain transform manifold learning,” arXiv:1704.08841 [cs], Apr. 2017. [16] P. Dong, B. Provencher, N. Basim, N. Piché, and M. Marsh, “Forget About Cleaning up Your Micrographs: Deep Learning Segmentation Is Robust to Image Artifacts,” Microsc Microanal, pp. 1–2, Jul. 2020. [17] G. Simantiris and G. Tziritas, “Cardiac MRI Segmentation With a Dilated CNN Incorporating Domain-Specific Constraints,” IEEE J. Sel. Top. Signal Process., vol. 14, no. 6, pp. 1235–1243, Oct. 2020. [18] F. Isensee, P. Jaeger, P. M. Full, I. Wolf, S. Engelhardt, and K. H. Maier-Hein, “Automatic Cardiac Disease Assessment on cine-MRI via Time-Series Segmentation and Domain Specific Features,” arXiv:1707.00587 [cs], vol. 10663, 2018. [19] C. Zotti, Z. Luo, A. Lalande, and P.-M. Jodoin, “Convolutional Neural Network With Shape Prior Applied to Cardiac MRI Segmentation,” IEEE J. Biomed. Health Inform., vol. 23, no. 3, pp. 1119–1128, May 2019. [20] M. Baldeon Calisto and S. K. Lai-Yuen, “AdaEn-Net: An ensemble of adaptive 2D–3D Fully Convolutional Networks for medical image segmentation,” Neural Networks, vol. 126, pp. 76–94, Jun. 2020. [21] K. Hammouda, F. Khalifa, H. Abdeltawab, A. Elnakib, G. A. Giridharan, M. Zhu, C. K. Ng, S. Dassanayaka, M. Kong, H. E. Darwish, T. M. A. Mohamed, S. P. Jones, and A. El-Baz, “A New Framework for Performing Cardiac Strain Analysis from Cine MRI Imaging in Mice,” Sci Rep, vol. 10, no. 1, p. 7725, Dec. 2020. [22] E. Puyol-Anton, B. Ruijsink, W. Bai, H. Langet, M. De Craene, J. A. Schnabel, P. Piro, A. P. King, and M. Sinclair, “Fully automated myocardial strain estimation from cine MRI using convolutional neural networks,” in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, 2018, pp. 1139–1143. [23] C. Qin, W. Bai, J. Schlemper, S. E. Petersen, S. K. Piechnik, S. Neubauer, and D. Rueckert, “Joint Learning of Motion Estimation and Segmentation for Cardiac MR Image Sequences,” arXiv:1806.04066 [cs], Jun. 2018. [24] M. Qiao, Y. Wang, Y. Guo, L. Huang, L. Xia, and Q. Tao, “Temporally coherent cardiac motion tracking from cine MRI: Traditional registration method and modern CNN method,” Med. Phys., vol. 47, no. 9, pp. 4189–4198, Sep. 2020. [25] H. Yu, S. Sun, H. Yu, X. Chen, H. Shi, T. S. Huang, and T. Chen, “FOAL: Fast Online Adaptive Learning for Cardiac Motion Estimation,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 4312–4322. [26] P. Chen, X. Chen, E. Z. Chen, H. Yu, T. Chen, and S. Sun, “Anatomy- Aware Cardiac Motion Estimation,” arXiv:2008.07579 [cs, eess], Aug. 2020. [27] B. D. de Vos, F. F. Berendsen, M. A. Viergever, M. Staring, and I. Išgum, “End-to-End Unsupervised Deformable Image Registration with a Convolutional Neural Network,” arXiv:1704.06065 [cs], vol. 10553, pp. 204–212, 2017. [28] M. A. Morales, D. Izquierdo-Garcia, I. Aganj, J. Kalpathy-Cramer, B. R. Rosen, and C. Catana, “Implementation and Validation of a Three- dimensional Cardiac Motion Estimation Network,” Radiology: Artificial Intelligence, vol. 1, no. 4, p. e180080, Jul. 2019. [29] A. Østvik, E. Smistad, T. Espeland, E. A. R. Berg, and L. Lovstakken, “Automatic Myocardial Strain Imaging in Echocardiography Using Deep Learning,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, vol. 11045, D. Stoyanov, Z. Taylor, G. Carneiro, T. Syeda-Mahmood, A. Martel, L. Maier-Hein, J. M. R. S. Tavares, A. Bradley, J. P. Papa, V. Belagiannis, J. C. Nascimento, Z. Lu, S. Conjeti, M. Moradi, H. Greenspan, and A. Madabhushi, Eds. Cham: Springer International Publishing, 2018, pp. 309–316. [30] J.-U. Voigt, G. Pedrizzetti, P. Lysyansky, T. H. Marwick, H. Houle, R. Baumann, S. Pedri, Y. Ito, Y. Abe, S. Metz, J. H. Song, J. Hamilton, P. P. Sengupta, T. J. Kolias, J. d’Hooge, G. P. Aurigemma, J. D. Thomas, and L. P. Badano, “Definitions for a common standard for 2D speckle tracking echocardiography: consensus document of the EACVI/ASE/Industry Task Force to standardize deformation imaging,” European Heart Journal - Cardiovascular Imaging, vol. 16, no. 1, pp. 1–11, Jan. 2015. [31] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. Gonzalez Ballester, G. Sanroma, S. Napel, S. Petersen, G. Tziritas, E. Grinias, M. Khened, V. A. Kollerathu, G. Krishnamurthi, M.-M. Rohe, X. Pennec, M. Sermesant, F. Isensee, P. Jager, K. H. Maier-Hein, P. M. Full, I. Wolf, S. Engelhardt, C. F. Baumgartner, L. M. Koch, J. M. Wolterink, I. Isgum, Y. Jang, Y. Hong, J. Patravali, S. Jain, O. Humbert, and P.-M. Jodoin, “Deep Learning Techniques for Automatic MRI Cardiac Multi- Structures Segmentation and Diagnosis: Is the Problem Solved?,” IEEE Trans. Med. Imaging, vol. 37, no. 11, pp. 2514–2525, Nov. 2018. [32] C. Tobon-Gomez, M. De Craene, K. McLeod, L. Tautz, W. Shi, A. Hennemuth, A. Prakosa, H. Wang, G. Carr-White, S. Kapetanakis, A. Lutz, V. Rasche, T. Schaeffter, C. Butakoff, O. Friman, T. Mansi, M. Sermesant, X. Zhuang, S. Ourselin, H.-O. Peitgen, X. Pennec, R. Razavi, D. Rueckert, A. F. Frangi, and K. S. Rhode, “Benchmarking framework for myocardial tracking and deformation algorithms: An open access database,” Medical Image Analysis, vol. 17, no. 6, pp. 632– 648, Aug. 2013. [33] American Heart Association Writing Group on Myocardial Segmentation and Registration for Cardiac Imaging:, M. D. Cerqueira, N. J. Weissman, V. Dilsizian, A. K. Jacobs, S. Kaul, W. K. Laskey, D. J. Pennell, J. A. Rumberger, T. Ryan, and M. S. Verani, “Standardized Myocardial Segmentation and Nomenclature for Tomographic Imaging of the Heart: A Statement for Healthcare Professionals From the .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ Cardiac Imaging Committee of the Council on Clinical Cardiology of the American Heart Association,” Circulation, vol. 105, no. 4, pp. 539– 542, Jan. 2002. [34] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial Transformer Networks,” arXiv:1506.02025 [cs], Jun. 2015. [35] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167 [cs], Mar. 2015. [36] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical Evaluation of Rectified Activations in Convolutional Network,” arXiv:1505.00853 [cs, stat], Nov. 2015. [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” arXiv:1512.03385 [cs], Dec. 2015. [38] W. P. Segars, G. Sturgeon, S. Mendonca, J. Grimes, and B. M. W. Tsui, “4D XCAT phantom for multimodality imaging research: 4D XCAT phantom for multimodality imaging research,” Medical Physics, vol. 37, no. 9, pp. 4902–4915, Aug. 2010. [39] L. Wissmann, C. Santelli, W. P. Segars, and S. Kozerke, “MRXCAT: Realistic numerical phantoms for cardiovascular magnetic resonance,” Journal of Cardiovascular Magnetic Resonance, vol. 16, no. 1, Dec. 2014. [40] L. Tautz, A. Hennemuth, and H.-O. Peitgen, “Motion Analysis with Quadrature Filter Based Registration of Tagged MRI Sequences,” in Statistical Atlases and Computational Models of the Heart. Imaging and Modelling Challenges, vol. 7085, O. Camara, E. Konukoglu, M. Pop, K. Rhode, M. Sermesant, and A. Young, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 78–87. [41] K. McLeod, A. Prakosa, T. Mansi, M. Sermesant, and X. Pennec, “An Incompressible Log-Domain Demons Algorithm for Tracking Heart Tissue,” in Statistical Atlases and Computational Models of the Heart. Imaging and Modelling Challenges, vol. 7085, O. Camara, E. Konukoglu, M. Pop, K. Rhode, M. Sermesant, and A. Young, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 55–67. [42] E. Ferdian, A. Suinesiaputra, K. Fung, N. Aung, E. Lukaschuk, A. Barutcu, E. Maclean, J. Paiva, S. K. Piechnik, S. Neubauer, S. E. Petersen, and A. A. Young, “Fully Automated Myocardial Strain Estimation from Cardiovascular MRI–tagged Images Using a Deep Learning Framework in the UK Biobank,” Radiology: Cardiothoracic Imaging, vol. 2, no. 1, p. e190032, Feb. 2020. [43] R. Vallat, “Pingouin: statistics in Python,” JOSS, vol. 3, no. 31, p. 1026, Nov. 2018. [44] N. Painchaud, Y. Skandarani, T. Judge, O. Bernard, A. Lalande, and P.- M. Jodoin, “Cardiac MRI Segmentation with Strong Anatomical Guarantees,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, vol. 11765, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, and A. Khan, Eds. Cham: Springer International Publishing, 2019, pp. 632–640. [45] M. Khened, V. Alex, and G. Krishnamurthi, “Densely Connected Fully Convolutional Network for Short-Axis Cardiac Cine MR Image Segmentation and Heart Diagnosis Using Random Forest,” in Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges, vol. 10663, M. Pop, M. Sermesant, P.-M. Jodoin, A. Lalande, X. Zhuang, G. Yang, A. Young, and O. Bernard, Eds. Cham: Springer International Publishing, 2018, pp. 140–151. [46] J. A. San Román, J. Candell-Riera, R. Arnold, P. L. Sánchez, S. Aguadé-Bruix, J. Bermejo, A. Revilla, A. Villa, H. Cuéllar, C. Hernández, and F. Fernández-Avilés, “Quantitative Analysis of Left Ventricular Function as a Tool in Clinical Research. Theoretical Basis and Methodology,” Revista Española de Cardiología (English Edition), vol. 62, no. 5, pp. 535–551, May 2009. [47] J. P. Kelly, R. J. Mentz, A. Mebazaa, A. A. Voors, J. Butler, L. Roessig, M. Fiuzat, F. Zannad, B. Pitt, C. M. O’Connor, and C. S. P. Lam, “Patient Selection in Heart Failure With Preserved Ejection Fraction Clinical Trials,” Journal of the American College of Cardiology, vol. 65, no. 16, pp. 1668–1682, Apr. 2015. [48] B. A. Venkatesh, S. Donekal, K. Yoneyama, C. Wu, V. R. S. Fernandes, B. D. Rosen, M. L. Shehata, R. McClelland, D. A. Bluemke, and J. A. C. Lima, “Regional myocardial functional patterns: Quantitative tagged magnetic resonance imaging in an adult population free of cardiovascular risk factors: The multi-ethnic study of atherosclerosis (MESA): Reference Values of Strain From Tagged MRI,” J. Magn. Reson. Imaging, vol. 42, no. 1, pp. 153–159, Jul. 2015. [49] D. Muraru, U. Cucchini, S. Mihăilă, M. H. Miglioranza, P. Aruta, G. Cavalli, A. Cecchetto, S. Padayattil-Josè, D. Peluso, S. Iliceto, and L. P. Badano, “Left Ventricular Myocardial Strain by Three-Dimensional Speckle-Tracking Echocardiography in Healthy Subjects: Reference Values and Analysis of Their Physiologic and Technical Determinants,” Journal of the American Society of Echocardiography, vol. 27, no. 8, pp. 858-871.e1, Aug. 2014. [50] Z. Gan, J. Tang, and X. Yang, “Left Ventricle Motion Estimation Based on Unsupervised Recurrent Neural Network,” in 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 2019, pp. 2342–2349. [51] A. Fry, T. J. Littlejohns, C. Sudlow, N. Doherty, L. Adamska, T. Sprosen, R. Collins, and N. E. Allen, “Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants With Those of the General Population,” American Journal of Epidemiology, vol. 186, no. 9, pp. 1026–1034, Nov. 2017. [52] S. Chen, J. Yuan, S. Qiao, F. Duan, J. Zhang, and H. Wang, “Evaluation of Left Ventricular Diastolic Function by Global Strain Rate Imaging in Patients with Obstructive Hypertrophic Cardiomyopathy: A Simultaneous Speckle Tracking Echocardiography and Cardiac Catheterization Study,” Echocardiography, vol. 31, no. 5, pp. 615–622, May 2014. [53] A. J. Marian and E. Braunwald, “Hypertrophic Cardiomyopathy: Genetics, Pathogenesis, Clinical Manifestations, Diagnosis, and Therapy,” Circ Res, vol. 121, no. 7, pp. 749–770, Sep. 2017. [54] M. J. W. Götte, A. C. van Rossum, J. W. R. Twisk, J. P. A. Kuijer, J. T. Marcus, and C. A. Visser, “Quantification of regional contractile function after infarction: strain analysis superior to wall thickening analysis in discriminating infarct from remote myocardium,” Journal of the American College of Cardiology, vol. 37, no. 3, pp. 808–817, Mar. 2001. [55] N. Zhang, G. Yang, Z. Gao, C. Xu, Y. Zhang, R. Shi, J. Keegan, L. Xu, H. Zhang, Z. Fan, and D. Firmin, “Deep Learning for Diagnosis of Chronic Myocardial Infarction on Nonenhanced Cardiac Cine MRI,” Radiology, vol. 291, no. 3, pp. 606–617, Jun. 2019. [56] Q. Zheng, H. Delingette, and N. Ayache, “Explainable cardiac pathology classification on cine MRI with motion characterization by semi-supervised learning of apparent flow,” arXiv:1811.03433 [cs, stat], Mar. 2019. [57] P. N. Kampaktsis and M. Vavuranakis, “Diastolic Function Evaluation,” JACC: Cardiovascular Imaging, vol. 13, no. 1, pp. 336– 337, Jan. 2020. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425266doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425266 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_01_05_425384 ---- Covid19Risk.ai: An open source repository and online calculator of prediction models for early diagnosis and prognosis of Covid-19 1 Covid19Risk.ai: An open source repository and online calculator of prediction models for early diagnosis and prognosis of Covid-19 Iva Halilaj1,2, Avishek Chatterjee1, Yvonka van Wijk1, Guangyao Wu1, Brice van Eeckhout3, Cary Oberije1, Philippe Lambin1 1The D-Lab, Department of Precision Medicine, GROW- School for Oncology, Maastricht University, Maastricht, The Netherlands 2Health Innovation Ventures, Maastricht, The Netherlands 3Medical Cloud Company, Belgium *Correspondence: philippe.lambin@maastrichtuniversity.nl Abstract Objective The current pandemic has led to a proliferation of predictive models being developed to address various aspects of COVID-19 patient care. We aimed to develop an online platform that would serve as an open source repository for a curated subset of such models, and provide a simple interface for included models to allow for online calculation. This platform would support doctors during decision-making regarding diagnoses, prognoses, and follow-up of COVID-19 patients, expediting the models’ transition from research to clinical practice. Methods In this proof-of-principle study, we performed a literature search in PubMed and WHO database to find suitable models for implementation on our platform. All selected models were publicly available (peer reviewed publications or open source repository) and had been validated (TRIPOD type 3 or 2b). We created a method for obtaining the regression coefficients if only the nomogram was available in the original publication. All predictive models were transcribed on a practical graphical user interface using PHP 8.0.0, and published online together with supporting documentation and links to the associated articles. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 2 Results The open source website https://covid19risk.ai/ currently incorporates nine models from six different research groups, evaluated on datasets from different countries. The website will continue to be populated with other models related to COVID-19 prediction as these become available. This dynamic platform allows COVID-19 researchers to contact us to have their model curated and included on our website, thereby increasing the reach and real-world impact of their work. Conclusion We have successfully demonstrated in this proof-of-principle study that our website provides an inclusive platform for predictive models related to COVID-19. It enables doctors to supplement their judgment with patient-specific predictions from externally-validated models in a user-friendly format. Additionally, this platform supports researchers in showcasing their work, which will increase the visibility and use of their models. Keywords: Covid-19, predictive models, diagnosis, prognosis, nomogram, machine learning. Introduction The recent COVID-19 pandemic, at its start, emphasized several key unmet needs in terms of patient stratification using quantifiable metrics [1]. These include (a) identifying, in the uninfected population, at-risk persons who should be subjected to stricter restrictions than the general population [2], and (b) in the infected population, improving the detection of high-risk patients by utilizing all available patient data (e.g., clinical, laboratory, genetic, and radiological features) so as to improve quality of care and use of hospital resources [3][4]. Now, with several vaccines emerging, there is another compelling reason for identifying those who are most at risk and should therefore receive the vaccines first [5–7]. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://paperpile.com/c/m1KDL1/Jg0A https://paperpile.com/c/m1KDL1/sJ8WE https://paperpile.com/c/m1KDL1/6Xp3Y https://paperpile.com/c/m1KDL1/6Xp3Y https://paperpile.com/c/m1KDL1/ISaG+gtT4+VjIm https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 3 Ideally, one should address the above needs using quantitative tools that (a) help people at home decide (in consultation with their doctor) whether their health status warrants being self- quarantined, and whether their symptoms (if present) indicate the need for visiting the hospital, and (b) help doctors during triage decide if a patient should be sent home, hospitalized in a ward, or admitted to intensive care [8]. Quantifying these probabilities can be done by using predictive machine learning models. Currently, COVID-19 publications regarding such models are booming. There are numerous studies being published, from multiple countries and all using different inclusion criteria and outcome measures [4]. This heavily complicates the selection of the optimal model for a specific patient [9]. In addition, the quality of the research is sometimes suboptimal, as a recent review paper has shown [4]. We, as researchers working on COVID-19 models, saw an urgent need for a web-based platform that would serve as an open source repository for validated models. Such a platform would allow the user to have a quick overview of the strengths and weaknesses of the curated models that passed our quality checks. The platform would also allow the user to calculate the output of such models by simply providing the inputs in a user-friendly format, rather than creating their own implementation or conducting their own search to find a suitable implementation. Our aim for this platform is to include validated prediction models (TRIPOD type 2b and 3) [10], acquired from institutions all over the world, related to all aspects of the disease, including risk assessment of being infected, triage at hospital admission, prediction of recovery process during follow-up, and patient inclusion and stratification in clinical trials. We aim to be inclusive, and thus models that are outside the scope of risk assessment and patient stratification are still within the purview of the platform, e.g., diagnostic models. We .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://paperpile.com/c/m1KDL1/qAA80 https://paperpile.com/c/m1KDL1/8aHWG https://paperpile.com/c/m1KDL1/bIJVC https://paperpile.com/c/m1KDL1/8aHWG https://paperpile.com/c/m1KDL1/wfRst https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 4 believe it will be of interest to doctors who want to leverage the results of all the great research that is taking place, and it will also benefit researchers in dissemination of their own work and in learning about the findings of other groups. The proof-of-concept of such a platform forms the basis of this paper. We intend to maintain this platform as a public service, and increase the number of curated models by encouraging other researchers to share their work through our platform. The benefits to them include (a) helping the researchers to generalize their models by allowing the models to be tested by research groups that are different from the ones that created the model (TRIPOD 4), and (b) improved visibility of their model, which should stimulate usage and citations[11]. Methods We reviewed the PubMed database of the National Center for Biotechnology Information (NCBI) and the World Health Organization (WHO) database for COVID-19 publications from December 2019 to June 2020. To find relevant publications to our focus we used the terms in the search field: "COVID 2019 prognostic models", "novel coronavirus 2019 diagnostic tools", "COVID-19 predictive models", and "machine-learning COVID 19 models". The steps that we followed from the literature search until the final stage of publishing online are shown in Figure 1. Figure 1-The workflow from defining the convenient models until the end phase. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://paperpile.com/c/m1KDL1/VeCw https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 5 In order to assess the reporting quality of the models from the studies, we tested each paper for its compliance to the TRIPOD (Transparent Reporting of studies on prediction models for Individual Prognosis Or Diagnosis) reporting guideline as shown in Figure 2 [10,12]. Figure 2-TRIPOD types classifications [10]. Getting model coefficients from a Nomogram In order to improve readability and interpretability by medical specialists, regression models are often published as nomograms, without the model coefficients. To publish the models in a consistent manner on our platform, we used a simple method to extract the coefficients from nomograms. This method is explained using an example taken from one of the implemented models [3], and shown below in Figure 3. The first step was to determine the relationship between the parameter and the nomogram score, which was done by reading the nomogram, as shown in Table 1. Figure 3-Nomogram Published in https://doi.org/10.1101/2020.04.03.20052068 Table 1-Point reading for the nomogram Parameter (Unit) Equation Value Points Epidemiological history (yes/no, Boolean) 1 9.32 0 0.00 Wedge/fan-shaped lesion (yes/no, Boolean) 1 10.00 0 0 Bilateral lower lobes (yes/no, Boolean) 1 8.82 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://paperpile.com/c/m1KDL1/UDuS+wfRst https://paperpile.com/c/m1KDL1/wfRst https://paperpile.com/c/m1KDL1/6Xp3Y https://doi.org/10.1101/2020.04.03.20052068 https://doi.org/10.1101/2020.04.03.20052068 https://doi.org/10.1101/2020.04.03.20052068 https://doi.org/10.1101/2020.04.03.20052068 https://doi.org/10.1101/2020.04.03.20052068 https://doi.org/10.1101/2020.04.03.20052068 https://doi.org/10.1101/2020.04.03.20052068 https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 6 0 0.00 Ground glass opacities (yes/no, Boolean) 1 3.04 0 0.00 Crazy paving pattern (yes/no, Boolean) 1 2.10 0 0.00 WBC (<4*109/L, Boolean) 1 0.63 0 0.00 The relationship between the parameters and the nomogram score 𝑃𝑃𝑡𝑡𝑜𝑜𝑡𝑡𝑎𝑎𝑎𝑎 is described by the following equation: 𝑃𝑃𝑡𝑡𝑜𝑜𝑡𝑡𝑎𝑎𝑎𝑎 = 𝑥𝑥1 ∙ 9.32 + 𝑥𝑥2 ∙ 10 + 𝑥𝑥3 ∙ 8.82 + 𝑥𝑥4 ∙ 3.04 + 𝑥𝑥5 ∙ 2.10 + 𝑥𝑥6 ∙ 0.63 The next step is to determine the relationship between the nomogram score and the probability through the regression equation. A logistic regression model follows the following equation: 𝑎𝑎𝑜𝑜𝑔𝑔𝑖𝑖𝑡𝑡(𝑝𝑝) = 𝛽𝛽0 + � 𝛽𝛽𝑛𝑛 ∙ 𝑥𝑥𝑛𝑛 𝑛𝑛 The Logit of the probability and the nomogram score should have a linear relationship, from which the slope was used to determine the value of the coefficients, and the intercept of the model was extracted (Figure 4). Figure 4 – Logit(P) plotted against the nomogram score For this example the regression coefficients are shown in Table 2. Table 2 – Coefficients and intercept extracted from nomogram. Parameter (Unit) Coefficient Epidemiological history 0,93 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 7 Wedge/fan-shaped lesion 0,64 Bilateral lower lobes 0,19 Ground glass opacities 0,93 Crazy paving pattern 0,64 WBC 0,19 Intercept -4,23 All the models are written in PHP 8.0.0, where for regression models we set the coefficients and variables in the PHP syntax, thereby making the models operate identically. For the frontend side we used languages such as HTML, CSS and JavaScript for some specific functionalities. The backend of this platform is PHP based and the database is MySQL. Results We have created an open source website (https://covid19risk.ai/) to serve as an archive for published AI prediction models related to all aspects of COVID-19, including diagnosis, theragnosis (how to treat the patient, risk stratification), and follow-up (treatment response and complication). Currently there are nine models implemented and published as illustrated in Table 3. Every showcased model includes a description of the methodology and clinical datasets used for model development and validation, and limitations of each model are explicit. Model Nr Input features-Output Cohort type Tripod Type Model 1 Input features: Age, hospital staff (Yes/No) Output: Probability of severe illness [1]. Asymptomatic COVID positive patients Type 2b .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://covid19risk.ai/ https://paperpile.com/c/m1KDL1/Jg0A https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 8 Model 2 Input features: Age, hospital staff (Yes/No), body temperature, days since onset of symptoms. Output: Probability of severe illness [1]. Symptomatic COVID positive patients Type 2b Model 3 Input features: Age, CT lesion score (0 = no lung parenchyma involved, 1 = up to 5% of lung parenchyma involved, 2 = 5-25%, 3 = 26-50%, 4 = 51-75%, 5 = 76-100% of lung parenchyma involved; final CT score is a total score from five lobes). Output: Probability of severe illness [1]. COVID positive patients with semantic CT features Type 2b Model 4 Input features: Age, Lymphocyte, C-reactive protein, Lactate dehydrogenase, Creatine kinase, Urea, Calcium. Output: Probability of severe illness [1]. COVID-19 positive patients with blood test results Type 3 Model 5 Input features: Signs of pneumonia on CT, History of close contact with Covid-19 confirmed, Patient(yes/No), Fever, Age, Gender, Max temperature, Respiratory symptoms, Neutrophil-to- lymphocyte ratio. Output: Probability of severe illness [2]. COVID-19 positive patients Type 2b Model 6 Input features: Age, Direct bilirubin, Red blood cell distribution width, Blood urea nitrogen, C-reactive protein, Lactate dehydrogenase, Albumin Output: Probability of severe illness [13]. COVID-19 positive patients Type 2b Model 7 Input features: Age, Sex, Diabetes, COPD or emphysema, or chronic Bronchitis, Asthma, Cystic fibrosis, Hypertension, Had a heart attack, had a stroke, Coronary atherosclerosis or other heart disease, Congestive heart failure, Rheumatic kidney disease, Chronic kidney disease, Liver disease, Cancer, Neurocognitive conditions, Sickle cell COVID-19 positive patients (vulnerable to Type 3 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://paperpile.com/c/m1KDL1/Jg0A https://paperpile.com/c/m1KDL1/Jg0A https://paperpile.com/c/m1KDL1/Jg0A https://paperpile.com/c/m1KDL1/sJ8WE https://paperpile.com/c/m1KDL1/lMI7H https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 9 anemia, HIV infection (Yes/No) Health history: Organ transplant, Hemodialysis treatment, pneumonia, acute bronchitis, influenza or other acute respiratory infection, pregnant last 2 weeks, hospital admission/emergency last year, height, weigh. Symptoms: fever, shortness in breath, cough, fatigue, body aches, headache, diarrhea, sore throat, decrease smell and taste(Yes/No) Output: Predicts vulnerability score to serious illness from COVID-19 [14]. develop serious complications) Model 8 Input features: Age, cardiovascular disease, diabetes, chronic respiratory disease, hypertension, cancer, prior stroke, heart disease, chronic kidney disease. Output: Estimate mortality rates in patients with COVID-19 [15]. COVID-19 positive patients Type 3 Model 9 Input features: Epidemiological history, wedge- shaped or fan-shaped lesion parallel or near to the pleura, bilateral lower lobes, ground glass opacities, crazy paving pattern, WBC Output: Probability of severe illness [3]. Suspected COVID- 19 pneumonia patients Type 2b Table 3-For every model: input features, output, cohort type and TRIPOD type. For each online model given, doctors can find: a) the intended use (predicted outcome) of the model, b) to which patients does this tool apply (particularly among individuals with preexisting medical conditions), c) the information and the parameters that need to be entered by the doctor, and d) how the tool was developed. The doctors can visit the website, choose an applicable model, and fill in the variables asked in order to generate a probability. The COVID-19 predictive models on the website use the same calculations as the models described in the scientific publications on which they are based. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://paperpile.com/c/m1KDL1/xPaKk https://paperpile.com/c/m1KDL1/9m2hq https://paperpile.com/c/m1KDL1/6Xp3Y https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 10 The main result of our work is a broadly applicable platform, which includes validated models regarding different stages, symptoms and outcomes of COVID-19. This repository of COVID-19 predictive models will serve as a decision aid for doctors. Discussion This platform can be viewed as a “model zoo” aimed at researchers and clinicians and with adequate grasp of the medical complexities associated with COVID-19. The aim for all showcased models is to stimulate research and supplement clinical judgment, not substitute it. The open source website is not intended for unaided use by laypeople (e.g., patients). We re-emphasize that this manuscript and the website in its current form are only a proof-of- principle. We do not claim that all models that would pass our selection criteria have been included. Similarly, any model not currently included on the platform should not be seen as problematic. Our inclusion period ranged from December 2019 till June 2020. As many models were published since then, an update of the search and the website needs to be and will be done in the near future. This paper should be seen by researchers from outside our collaboration as an invitation to participate on this platform, with the option of keeping the code hidden from the end user while still offering full functionality. We will assist external researchers for the successful incorporation of their models on our platform. This will create synergies that are bound to accelerate AI research on COVID-19. It will also ensure that models get the recognition they deserve and are used widely, instead of gathering dust as often happens when there are many publications on the same broad theme during a short period (a certainty in the context of COVID-19, given its world-changing nature). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 11 The method we used for retrieving coefficients of a regression model from a nomogram has certain limitations. For one, the accuracy is highly dependent on the resolution of the published model. Another limitation is that though the coefficients of the model are retrieved, the standard error for the coefficients of the parameters cannot be obtained from a nomogram alone. However, the method can be applied to any nomogram, making it a tool that can be broadly used, not restricted to COVID-19. Conclusions Our platform (https://covid19risk.ai/), at the current proof-of-principle stage, includes nine validated machine-learning models to serve as decision aids to doctors for various aspects of COVID-19 patient care. Our method for obtaining regression coefficients from a nomogram can be used by other researchers, including in non-COVID contexts. Our platform will be maintained and regularly updated for at least three years, since we have secured funding for this period (DRAGON grant). Therefore, we are encouraging research groups to collaborate with us to share their models with the world. Acknowledgments Authors acknowledge financial support from the European Commission’s Horizon 2020 research and innovation programme under grant agreement MSCA-ITN-PREDICT n° 766276, and the IMI2 Joint undertaking program under grant agreement DRAGON n° 101005122. Disclosure Dr Philippe Lambin reports, within and outside the submitted work, grants/sponsored research agreements from Varian medical, Oncoradiomics, ptTheragnostic/DNAmito, Health .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://covid19risk.ai/) https://covid19risk.ai/) https://covid19risk.ai/) https://covid19risk.ai/) https://covid19risk.ai/) https://covid19risk.ai/) https://covid19risk.ai/) https://covid19risk.ai/) https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 12 Innovation Ventures. He received an advisor/presenter fee and/or reimbursement of travel costs/external grant writing fee and/or in kind manpower contribution from Oncoradiomics, BHV, Merck, Varian, Elekta, ptTheragnostic and Convert pharmaceuticals. Dr Lambin has shares in the company Oncoradiomics SA, Convert pharmaceuticals SA and The Medical Cloud Company SPRL and is co-inventor of two issued patents with royalties on radiomics (PCT/NL2014/050248, PCT/NL2014/050728) licensed to Oncoradiomics and one issue patent on mtDNA (PCT/EP2014/059089) licensed to ptTheragnostic/DNAmito, three non- patented invention (softwares) licensed to ptTheragnostic/DNAmito, Oncoradiomics and Health Innovation Ventures and three non-issues, non licensed patents on Deep Learning- Radiomics and LSRT (N2024482, N2024889, N2024889). He confirms that none of the above entities or funding was involved in the preparation of this paper. References 1. Wu G, Yang P, Xie Y, Woodruff HC, Rao X, Guiot J, et al. Development of a Clinical Decision Support System for Severity Risk Prediction and Triage of COVID-19 Patients at Hospital Admission: an International Multicenter Study. European Respiratory Journal. 2020. p. 2001104. doi:10.1183/13993003.01104-2020 2. Song C-Y, Xu J, He J-Q, Lu Y-Q. COVID-19 early warning score: a multi-parameter screening tool to identify highly suspected patients. doi:10.1101/2020.03.05.20031906 3. Wang Z, Weng J, Li Z, Hou R, Zhou L, Ye H, et al. Development and Validation of a Diagnostic Nomogram to Predict COVID-19 Pneumonia. doi:10.1101/2020.04.03.20052068 4. Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020. p. m1328. doi:10.1136/bmj.m1328 5. Reiter PL, Pennell ML, Katz ML. Acceptability of a COVID-19 vaccine among adults in the United States: How many people would get vaccinated? Vaccine. 2020;38: 6500– 6507. 6. Williams L, Gallant AJ, Rasmussen S, Brown Nicholls LA, Cogan N, Deakin K, et al. Towards intervention development to increase the uptake of COVID‐19 vaccination among those at high risk: Outlining evidence‐based and theoretically informed future intervention content. British Journal of Health Psychology. 2020. pp. 1039–1054. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://paperpile.com/b/m1KDL1/Jg0A http://dx.doi.org/10.1183/13993003.01104-2020 http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://paperpile.com/b/m1KDL1/sJ8WE http://dx.doi.org/10.1101/2020.03.05.20031906 http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://paperpile.com/b/m1KDL1/6Xp3Y http://dx.doi.org/10.1101/2020.04.03.20052068 http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://paperpile.com/b/m1KDL1/8aHWG http://dx.doi.org/10.1136/bmj.m1328 http://dx.doi.org/10.1136/bmj.m1328 http://dx.doi.org/10.1136/bmj.m1328 http://dx.doi.org/10.1136/bmj.m1328 http://dx.doi.org/10.1136/bmj.m1328 http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/ISaG http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ 13 doi:10.1111/bjhp.12468 7. Persad G, Peek ME, Emanuel EJ. Fairly Prioritizing Groups for Access to COVID-19 Vaccines. JAMA. 2020. doi:10.1001/jama.2020.18513 8. Helms J, CRICS TRIGGERSEP Group (Clinical Research in Intensive Care and Sepsis Trial Group for Global Evaluation and Research in Sepsis), Tacquard C, Severac F, Leonard-Lorant I, Ohana M, et al. High risk of thrombosis in patients with severe SARS-CoV-2 infection: a multicenter prospective cohort study. Intensive Care Medicine. 2020. pp. 1089–1098. doi:10.1007/s00134-020-06062-x 9. Jehi L, Ji X, Milinovich A, Erzurum S, Merlino A, Gordon S, et al. Development and validation of a model for individualized prediction of hospitalization risk in 4,536 patients with COVID-19. PLOS ONE. 2020. p. e0237419. doi:10.1371/journal.pone.0237419 10. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): The TRIPOD Statement. European Urology. 2015. pp. 1142–1151. doi:10.1016/j.eururo.2014.11.025 11. Naseem M, Akhund R, Arshad H, Ibrahim MT. Exploring the Potential of Artificial Intelligence and Machine Learning to Combat COVID-19 and Existing Opportunities for LMIC: A Scoping Review. J Prim Care Community Health. 2020;11: 2150132720963634. 12. van Wijk Y, Halilaj I, van Limbergen E, Walsh S, Lutgens L, Lambin P, et al. Decision Support Systems in Prostate Cancer Treatment: An Overview. Biomed Res Int. 2019;2019: 4961768. 13. Gong J, Ou J, Qiu X, Jie Y, Chen Y, Yuan L, et al. A Tool to Early Predict Severe Corona Virus Disease 2019 (COVID-19) : A Multicenter Study using the Risk Nomogram in Wuhan and Guangdong, China. doi:10.1101/2020.03.17.20037515 14. COVID-19 Vulnerability Index (cv19index) - ClosedLoop.ai. [cited 15 Dec 2020]. Available: https://closedloop.ai/c19index/ 15. COVID-19 Prognostic Tool. [cited 15 Dec 2020]. Available: https://qxmd.com/calculate .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint http://paperpile.com/b/m1KDL1/gtT4 http://paperpile.com/b/m1KDL1/gtT4 http://dx.doi.org/10.1111/bjhp.12468 http://dx.doi.org/10.1111/bjhp.12468 http://dx.doi.org/10.1111/bjhp.12468 http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://paperpile.com/b/m1KDL1/VjIm http://dx.doi.org/10.1001/jama.2020.18513 http://dx.doi.org/10.1001/jama.2020.18513 http://dx.doi.org/10.1001/jama.2020.18513 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://paperpile.com/b/m1KDL1/qAA80 http://dx.doi.org/10.1007/s00134-020-06062-x http://dx.doi.org/10.1007/s00134-020-06062-x http://dx.doi.org/10.1007/s00134-020-06062-x http://dx.doi.org/10.1007/s00134-020-06062-x http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://paperpile.com/b/m1KDL1/bIJVC http://dx.doi.org/10.1371/journal.pone.0237419 http://dx.doi.org/10.1371/journal.pone.0237419 http://dx.doi.org/10.1371/journal.pone.0237419 http://dx.doi.org/10.1371/journal.pone.0237419 http://dx.doi.org/10.1371/journal.pone.0237419 http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://paperpile.com/b/m1KDL1/wfRst http://dx.doi.org/10.1016/j.eururo.2014.11.025 http://dx.doi.org/10.1016/j.eururo.2014.11.025 http://dx.doi.org/10.1016/j.eururo.2014.11.025 http://dx.doi.org/10.1016/j.eururo.2014.11.025 http://dx.doi.org/10.1016/j.eururo.2014.11.025 http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/VeCw http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/UDuS http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://paperpile.com/b/m1KDL1/lMI7H http://dx.doi.org/10.1101/2020.03.17.20037515 http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk http://paperpile.com/b/m1KDL1/xPaKk https://closedloop.ai/c19index/ https://closedloop.ai/c19index/ https://closedloop.ai/c19index/ https://closedloop.ai/c19index/ https://closedloop.ai/c19index/ https://closedloop.ai/c19index/ https://closedloop.ai/c19index/ https://closedloop.ai/c19index/ https://closedloop.ai/c19index/ https://closedloop.ai/c19index/ http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq http://paperpile.com/b/m1KDL1/9m2hq https://qxmd.com/calculate https://qxmd.com/calculate https://qxmd.com/calculate https://qxmd.com/calculate https://qxmd.com/calculate https://qxmd.com/calculate https://qxmd.com/calculate https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425384doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425384 http://creativecommons.org/licenses/by/4.0/ Covid19Risk.ai: An open source repository and online calculator of prediction models for early diagnosis and prognosis of Covid-19 Abstract Introduction Methods Discussion Conclusions Acknowledgments Disclosure References 10_1101-2021_01_05_425409 ---- HLA-SPREAD: A comprehensive resource for HLA associated diseases, drug reactions and SNPs across populations HLA-SPREAD: A comprehensive resource for HLA associated diseases, drug reactions and SNPs across populations Dhwani Dholakia1,2*#, Ankit Kalra3#, Uma Kanga4, Mitali Mukerji1,2* 1. Institute of Genomics and Integrative Biology-Council of Scientific and Industrial Research, New Delhi-110025, India. 2. Academy of Scientific and Innovative Research, Ghaziabad-201002, India. 3. Netaji Subhas University of Technology, New Delhi-110078, India. 4. All India Institute of Medical Sciences, New Delhi-110029, India. * Correspondence: Mitali Mukerji; Email: mitali@igib.res.in Dhwani Dholakia; Email: dhwani.dholakia@igib.in #Equal Contribution Keywords: HLA associations, Natural Language processing, Adverse Drug Reactions, HLA Biomarker, Transplantation, HLA alleles (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 ABSTRACT Extreme complexity in the HLA system and its nomenclature makes it difficult to interpret and integrate relevant information for HLA associations with diseases, Adverse Drug Reactions (ADR), Transplantation. PubMed search displays ~110,000 studies on Human Leukocyte Antigens (HLA) reported from, diverse locations and on multiple populations and IPD-IMGT/HLA database houses data on 28,320 HLA alleles till date. We developed an automated pipeline with a unified graphical user interface HLA-SPREAD that provides a structured information on SNPs, Populations, REsources, ADRs and Diseases information. Information on HLA was extracted from ~24 million PubMed abstracts extracted using Natural Language Processing (NLP). Python scripts were used to mine and curate information on diseases, filter false positives and categorize to 24 tree hierarchical groups and named Entity Recognition (NER) algorithms and semantic analysis to infer HLA association(s). This resource from 116 countries and 47 ethnic groups provides interesting insights on: markers associated with allelic/haplotypic association in autoimmune, cancer, viral and skin diseases, transplantation outcome and ADRs for hypersensitivity. Summary information on clinically relevant biomarkers related to HLA disease associations with mapped susceptible/risk alleles are readily retrievable from HLASPREAD. This resource is first of its kind that can help uncover novel patterns in HLA gene-disease associations. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 INTRODUCTION Human Leukocyte Antigen (HLA) locus consists of six classical genes (HLA-A, -B, -C, -DP, -DQ and - DR) that play an important role in eliciting immune response against pathogens (1) and three non- classical genes (HLA-E, -F and -G) that interact with Natural Killer cells to regulate virus-infected and malignant cells (2). HLA genes harbour a large number of mutations. As of September 2020, there are 28,320 HLA alleles reported in IPD-IMGT/HLA database. These variations mostly arise to generate defensive mechanisms against pathogens. However, some variations also confer risk to autoimmune diseases like rheumatoid arthritis, multiple sclerosis, Type 1 diabetes and Graves’ disease etc. More than 100 different autoimmune diseases, infectious diseases and adverse drug reactions have been reported to be associated with HLA genes (3–5). These alleles have clinical utility as diagnostic markers for example in rheumatoid arthritis, ankylosing spondylitis (6–8). They are also used in genetic screening e.g. HLA-B*57:01 in Caucasian population for abacavir hypersensitivity, HLA-B*15:02 in Chinese and Asians for carbamazepine induced life-threatening conditions like Stevens-Johnson syndrome (SJS) and toxic epidermal necrolysis (TEN) and also for SJS due to carbamazepine and other drug combinations (9, 10). In the context of transplantation, mismatch of HLA alleles between donor and recipient impacts the solid organ and hematopoietic stem cell transplantation outcomes (11). In addition, mismatching for certain HLA loci are also reported to provide benefit in terms of Graft versus Leukemia effect (12). Each of the reported studies is unique in itself as they describe the molecular basis of disease associations, HLA matching and anti-HLA antibody formation that are relevant for transplantation. Besides, studies also report some relevant and associated clinical information, e.g different HLA-B27 subtypes are reported to be associated with clinical categories under spondyloarthropathies (13). There are other studies that implicate HLA allele association with the composition of gut microbiome and diseases (14–16). The expanse of this information is immense as there is wide genetic variability and heterogeneity among populations (17). Although advancements in HLA typing technologies has been beneficial in identifying novel HLA sequences (18), this has also led to reporting the same HLA allelic variant using different HLA nomenclature. With the rapid increase in biomedical data, HLA alleles and their associations in multiple diseases, it becomes imperative to create a platform with structured information to query and retrieve relevant information. Current knowledge about HLA limits to individual papers that can be searched through PubMed or reviews where a subset of studies has been summarised. Hitherto, there exists no database that complies the existing HLA related information in an organised framework. In absence of such a repository with meta information gaps, resource sharing among researchers and clinicians becomes a big challenge. The integration of computer sciences with biomedical research has accelerated the progress, both in terms of novel discoveries and data structuring. Natural Language Processing (NLP) is a method to extract relevant information from unstructured data (19). A simple NLP pipeline contains 4 components: data assembly, pre-processing and normalization, Named Entity Recognition (NER) and Relation Extraction (RE). The output of NLP algorithms, i.e. structured dataset can be used to generate insights via direct interpretation or through downstream analyses. In recent times, NLP methods have started (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 gaining popularity in biological sciences. For instance, Rakhi et.al (20) reported a text mining pipeline to study spice-disease associations and link phytochemicals from different spices/herbs to diseases. Another report by Lee et.al highlights BioBERT, a pre-trained biomedical language representation model that can be used for various text mining tasks like Name Entity Recognition (NER), Relationship extraction (RE) and question answering, specifically on biomedical datasets. Similarly, PubTator Central (21) is an open access tool available via NCBI that uses text mining algorithms for assisted bio- curation of entities in literature. The tool uses NER to identify and thus highlight six bio-entities viz. Gene, Disease, Chemical, Mutation, Cell Line and Species from abstracts and open access articles available on PubMed. Another interesting report by Kuleshov et.al(22) presents a machine compiled database for studying genotype-phenotype associations generated using applications of text mining on genome-wide association studies (GWAS). All these resources work on similar text mining algorithms, but each has a different set of applications and tasks to perform. The use of these resources as such in addressing the HLA research often overlooks the extent of variability of HLA complex and involved parameters in this domain. For instance, PubTator Central is able to mine gene names from literature, but would not pick HLA allele information e.g. HLA-DRB1*01:01 when HLA-DRB1 is the search query. Conventional processes to individually mine a large amount of unstructured literature available on HLA research requires both manpower and resources. For understanding and integrating the observations from HLA studies we require knowledge of genomic datasets, i.e. diseases, SNPs, drugs, populations, and ethnic groups along with an understanding of the relationship between them. NLP based text mining is an ideal approach to understand the complexity of this process to create a structured information. We provide HLA-SPREAD (Figure 1) as a platform for integrated HLA resources that has been developed using NLP to understand the complexity of this locus. The resource provides a platform to summarize HLA related genomics knowledge as well as to design and develop new hypothesis. In this study, we have used publicly available ~24 million peer reviewed abstracts. We extracted biomedical entities including HLA alleles, diseases, SNPs, drugs and geographical locations. We also tried assigning positive and negative relationships between disease and alleles. This HLA connectivity was then used to address biologically and clinically relevant objectives like HLA-biomarkers and risk and protective alleles for various diseases. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 MATERIAL AND METHODS Data Retrieval MEDLINE was used as a source of biomedical literature that comprises more than 24 million peer- reviewed articles from over 5600 scholar journals. Bulk data was downloaded from the FTP server in XML format. HLA alleles with nomenclature were downloaded from IPD-IMGT/HLA database(23). To maintain uniformity in disease names and their IDs, we used MeSH keywords from UMLS (Unified Medical Language System). Drugs associated with side effects were obtained from SIDER 4.1 and Allele Frequency Net Database (AFND) (24, 25). Allele frequency of HLA alleles were also taken from AFND. Extensive Pre-processing was done on all the datasets before they were implemented in the pipeline. Pre-processing and Keywords Dictionary PubMed parsing: A modified version of PubMed parser was used to extract PMID, title, abstract, publication date, journal, article type and authors’ information from MEDLINE biomedical literature dataset (26). Only records with the above information were considered for further analysis and stored in a tabular format. All the subheadings in the abstract viz background, introduction, objective, method, experimental design, result, discussion, importance, setting, design, study objective, patients, participants and conclusion were removed. Disease Dictionary: Mentions of disease keywords were identified using a dictionary created from UMLS 2019MRCONSO.RRF (27). UMLS is a set of biomedical vocabulary that includes data from OMIM, Gene Ontology, Clinical repositories, Medical Subject Headings (MeSH) and NCBI taxonomy. In this study, we used MeSH descriptors including Entry Term (ET), Main Heading (MH), Preferred Entry term (PEP), Descriptor Sort Version (DSV), Machine Permutation (PM). Descriptor Entry Version (DEV) was excluded as keywords belonging to this category were incomplete, e.g. abdominal injury was reported as abdominal inj. These descriptors are assigned a unique MeSH ID which is stored in a hierarchical format with 24 head categories along with a unique Descriptor ID. We termed the root form of the disease as level-zero and top-level diseases as level-one for our analysis. Multiple forms of a disease like diabetes insipidus, diabetes mellitus, type 1 diabetes, juvenile-onset diabetes and others are assigned the same MeSH ID. This dataset was also supplemented with keyword variants such as plural and lemmatised forms to increase the search space. HLA Dictionary: Keywords for HLA alleles and their nomenclature were fetched from the centralized repository of international ImMunoGeneTics project (IMGT) database. IMGT is updated quarterly with submission or deletion of alleles and their nomenclature and currently houses 28,320 alleles. Many reports do not follow the conventional HLA allele nomenclature which makes mapping a strenuous task. To maximally capture all HLA alleles, we created a dataset comprising of all possible keywords including the removal of special characters, whenever required. We have also attempted (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 mapping all the old nomenclature to the current allele names. This dictionary also includes few generic HLA keywords like HLA class I, HLA class II, HLA linked and HLA associated. There are few alleles based on old nomenclature that belong to more than one antigenic group, hence they were put under “broad antigen” category. A few haplotypes that were a combination of more than one HLA allele were grouped in “haplotype” category. Named Entity Recognition Keyword Matching across Abstracts A python-based NER pipeline was implemented to filter abstracts based on a dictionary matching approach using parallel multiprocessing. Disease and HLA allele keyword dictionaries were used for initial screening. Abstracts were converted to lower case with special characters removed and if a match was found in either title or text, the abstract was sentence tokenized using sentence tokenizer, a part of python Natural Language Tool Kit (NLTK). We encountered a great extent of variability in the names of disease keywords. Most of it had special characters like (-) and (‘) in the keyword or with the plural and singular forms. To deal with the former, we kept instances of sentences where special characters were not removed, this increased the search space that enables capturing of keywords such as Stevens-Johnson syndrome (Stevens-Johnson syndrome), Graves' disease (Graves disease). Our disease dictionary was already enriched with plural and lemmatized forms of keywords to tackle the latter. For HLA allele keywords, word boundary-based regex matching was implemented to search alleles in the sentences. Sentences with at least a single mention of both HLA allele and disease keywords were considered for further steps. Identification of Tags: Populations, Drugs and SNPs Populations: The filtered abstracts were processed using spaCy NLP tagging algorithm (model: en_core_web_md) to search for mention of populations in text. From the two output tags, i.e. GPE (Geo-Political Entities) and NORP (Nationalities Or Religious Groups), we selected the keywords having the latter as GPE tag often reported scientific names of organisms as populations when applied on biomedical data, e.g. scientific names such as Chlamydia spp. and Chlamydomonas spp. were reported under GPE tags. The output was classified into countries and ethnic groups for further analysis with the help of an expert anthropologist. Manual curation of the obtained list was also done to remove plural and inappropriate entries. Drugs: The information on drugs with side effects were taken from the SIDER database (SIDER 4.1). We also added 16 drugs from AFND, whose information was missing in SIDER. The list of drugs was mapped across the dataset to check for its occurrences in selected HLA related abstracts. There were many instances where drug names were subpart of disease keywords, e.g. “insulin” was obtained as a false match wherever it was present as a part of the disease name “insulin dependent diabetes mellitus”. A small python snippet was written to remove such false positives. SNPs: SNP IDs were mapped across abstracts of the HLA dataset using the RegEx module of python. The algorithm iteratively searched for all instances of RSIDs using regular expression “[rR][sS][0-9]{2,}”. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 All the tags captured in various sentences of abstracts were stored in a list of strings format along with their respective PMIDs for facilitated future access. Semantic Assessment N-GRAM Evaluation and Manual Labelling N-grams refers to a contiguous sequence of n items (can be syllables, letters, or word pairs) in a text for determining the context of said items in a sentence or paragraph. We used the functions of NLTK viz. WordNetLemmatizer, WordPunctTokenizer and CollocationFinder to create a corpus of NGRAMS (n=1, 2 and 3) from the abstract dataset. After removal of stop words, that do not add significant meaning to the context, a subset consisting of all reported verb/adverb(n=1), adverb-verb(n=2,3) combinations based on a frequency cut-off was filtered out using Part of Speech (POS) tags of tokenised words. We observed that N-grams for negative labels often gave misleading information, e.g. “HLA-B27 negative” refers to the absence of allele rather than a negative association between entities. Hence, we used very stringent criteria for choosing negative labels. Manual annotation of positive and negative labels was then carried out on this dataset and a total of 1128 labels (Supplementary Table 1) were categorised (1108 positive and 20 negative) for labelling the sentences. We assert a positive label where the HLA allele is positively associated with disease and hence its presence makes individuals susceptible to disease, whereas in negative statements the HLA allele is negatively associated with disease and hence protective for the disease. We also considered negation words like “not, none, no” which if present, can reverse the actual meaning of the sentences. Instances of above mentioned three keyword sets (positive, negative and negation) were iteratively searched in all the sentences. Further, a coding scheme was constructed using the binary layout to label sentences as positive, negative, complex ambiguous. Sentences having no match from either of the categories were labelled as others. Root-Verb and Associated Adverbs using Dependency Parsing Dependency parsing refers to the formation of a tree layout based on the semantics of a sentence, where the root node is represented by a verb that relates different entities of that sentence. The allele and disease keywords present in each sentence were replaced with @GENE and @DISEASE tags and a parse tree was generated using StanfordCoreNLP python module (Stanford-corenlp-full-2018-10-05 package). The list of verbs obtained from the root nodes of all the sentences in the dataset was manually curated under positive and negative labels. We also added a category “Studied/Investigatory” that doesn’t convey any positive or negative context but have mentions of both entities together, e.g. “To investigate the association of HLA-A, B, and DRB1 alleles with leukaemia in the Han population in Hunan province”. Sentence Annotation We termed our approach as “hybrid approach” for labelling sentences, where annotation was done using both N-gram labels and the type of root verbs. If a sentence had a positive N-gram label and a positive root verb, that inferred the relationship between entities as associated or linked, then the sentence was labelled as positive. For negative labelling also we used the same approach. Finally, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 labelling of sentences were grouped into different categories: 1) Positive, 2) Negative, 3) Both positive and negative, referring as Complex sentences, 4) Positive+negation referring as Ambiguous group, and 5) Investigatory. Database and web server HLA SPREAD database is built for quick and easy retrieval of information related to HLA genes. The web interface was coded in HTML5, CSS3, Bootstrap & ES6. We used D3.js for data visualization and jQuery DataTables for table integration. The server was hosted using Apache HTTP server. The database uses flat file system with data stored in excel file. JavaScript handles the search queries & data visualizations. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 RESULTS Mining Medline literature for HLA association NLP based text mining of 24 million publicly available biomedical abstracts provided 41845 abstracts with either one or more sentences that describe the relationship between the HLA alleles and diseases. To understand the distribution of various kinds of articles published among the filtered abstracts, we studied the article type per year trend from 1975 to 2019 (Figure 2). We found research journal, comparative study and review articles to have maximum numbers every year. In addition, there were papers corresponding to clinical trials phase I, II, III and IV and observational studies highlighting the importance of this locus in translational studies. HLA genes, alleles and its distribution There are 28,320 alleles, and we hypothesize that not all of them would be associated with a disease or pathological condition. For instance, while collating data/analysing of HLA alleles, we observed a great extent of variability in the names within articles. E.g. HLA-B*13:01, a risk factor for dapsone hypersensitivity syndrome in multiple populations was written as HLA-B*13:01, HLA-B*1301, B*1301, B(*)1301 and B1301 in different papers. In such instances, if one has to search for an allele and its related information, the user must be aware of all possible formats of writing an allele encompassing its current and previous nomenclature. So, based on this, we converted all existing HLA keywords to a standard allele name. We identified only ~1% of the total alleles to be associated with conditions like diseases, graft survival, or drug reactions. To represent these alleles in the form of a graph, we collapsed the nomenclature to two-digit level (Figure 3). Majority of the studies were with HLA-DRB1 loci, followed by HLA-B and HLA-A, while fewer studies were on HLA-C locus. Each HLA alleles, collapsed to its two-digit information are linked to AFND server highlighting its allele frequency. The focus of our present study was also to understand the semantics between alleles and diseases, wherein we noted that some alleles were reported as protective and some as risk alleles. e.g. some reports indicated HLA-DRB1*15 was protective for HIV and diabetes whereas some studies reported it as a risk allele for pulmonary tuberculosis. We were also interested in exploring the effects of multiple alleles individually on a single disease. To address this, we listed out 45 articles (Supplementary Table 2)highlighting the fact that for a single disease, different alleles can have contrasting effects, e.g. HLA- DQA1*02:01 and HLA-DQB1*06:02 can be protective in Artemisia pollen-induced allergic rhinitis while HLA-DQA1*03:02 can be a risk factor (28). Exploring diseases, its associated categories and other relevant information (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 The HLA studies were divided into four broad categories: Diseases, Transplantations, Sign and Symptoms, and Therapeutics/ADRs, to study the information systematically. This grouping was done based on the MeSH keywords identified in the abstracts. There is a total of 24 categories for diseases in MeSH, ranging from C1 to C26 and Transplantation procedures are listed under E04. Keywords falling under C23 were grouped as “Sign and Symptoms” and C20.452 (GVHD) and E04 were grouped as “Transplantations”. For “Therapeutics/ADRs”, we selected only those sentences that had mentions of drug keywords, allele name and disease names together. We then filtered them further if they satisfied either of the three conditions: 1) Belongs to category Drug adverse reactions category or 2) Sentences had mentions of keywords such as reactions, -induced(carbamazepine-induced) or 3) Disease keyword had mention of –induced (Drug-induced liver injury). The remaining were grouped as “Diseases”. Table 1 shows the number of articles under each category. To study the association with diseases, we analysed data from both the “Diseases” and “Transplantation” category. Inconsistency in writing disease names increases the efforts in searching a specific query. To reduce this variability, MeSH ID was used to summarise the obtained information e.g. diseases like tumour, cancer, malignancy, and neoplasm (malignant and benign) were mapped to a single entity malignancy (D009369). Collapsing a large number of similar keywords to a single ID reduces the complexity in searching for articles related to particular diseases. We observed a total of 3615 different disease terms mapping to unique 1869 MeSH IDs. Figure 4 represents a snapshot of common HLA associated diseases. To examine the disease associations, we mapped it to level-one (level-zero) terms. Diabetes Mellitus Type 1, Rheumatoid Arthritis, Multiple Sclerosis (Autoimmune Disease), Melanoma and Leukemic (Neoplasms by Histologic Type), Psoriasis (Skin disease) and Celiac Disease (Metabolic) were the topmost HLA associated diseases. In the analysed abstracts, the list of HLA associated diseases/conditions indicates that some diseases were very frequently reported, whereas other diseases like Down syndrome, Guillain-Barre Syndrome, Polymyalgia Rheumatica were infrequently or rarely reported. Supplementary Table 3 represent the distribution of both common and less explored HLA associated diseases. To get an overall perspective of genes and diseases, we considered the diseases at level-one along with HLA gene. We observed the majority of reported associations with HLA-DRB1, followed by HLA- B and HLA-A (Figure 5). We also listed details of individual allele-disease pairs for more information (Supplementary Table 4). HLA-DRB1 was reported to be linked with disease conditions like rheumatoid arthritis, type 1 diabetes, multiple sclerosis, melanoma and 1184 other diseases. HLA-B association was reported with spondylitis, infections, hypersensitivities, psoriasis, drug allergies and 928 other diseases and HLA-A was reported to be associated with melanoma, leukemia, influenza, haemochromatosis, and 778 other diseases. The analysis also takes into consideration the diseases which require transplantation and also include the complications associated with it both pre and post-transplantation. As anticipated, we observed that individuals suffering from beta Thalassemia and sickle cell anaemia (genetic and congenital disorders), multiple myeloma (an immunoproliferative disorder) and liver injury underwent transplantations of bone marrow, hematopoietic stem cells and renal tissue. However, there were other additional details (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 included with the transplantation data such as disease history of patients before undergoing transplantation e.g. psoriasis, Graves’ disease, diabetic neuropathy and post-transplantation complications e.g. Ischemia, Necrosis, Fibrosis, Haemorrhage.” Such collated information under one platform may be of interest to a clinician for designing therapy modules. Supplementary Table 5 represents details of transplantation related studies. SNPs and HLA diseases HLA loci have a repertoire of genetic variations, a large number of which have been linked to multiple diseases via genome-wide association studies (GWAS). Though GWAS lists information about SNPs in/associated with HLA gene, a number of genetic variation studies go unnoticed either because they are small cohort analysis or are not compiled in a single resource for systematic study. Thus, to include the overlooked studies and missing information, this analysis reports information from all kinds of studies and includes abstracts mainly from journal articles, review, metanalysis, letters, and clinical trials. To acquire robust data, we retained only those HLA variations, that are present in the sentences along with the disease and allele keywords. We identified 313 unique SNPs mention and its details is compiled in Supplementary Table 6. Majority of SNPs mapped to intronic variants followed by missense and intergenic. Figure 6 represents genomic distribution of mapped SNPs. A substantial number of variations also mapped to genes other than HLA, indicating they may be in Linkage Disequilibrium (LD) or frequently occur in conditions like transplantation success or ADRs example. We observed top hits of SNPs mapping to infectious diseases like HIV and hepatitis, inflammatory conditions like psoriasis, complex diseases like asthma and diabetes and hypersensitivity largely attributed by drug ADRs. SNP association studies are also based on a proxy SNP, which can be in LD with the causal variant and the LD values vary from one population to another. To address this, we also added population information of the studies whenever available in the abstract. The most studied SNP rs9277535, associated with hepatitis B virus, has been studied across a large number of populations from Asian and central Asian countries like China, Japan, Asia, Turkey, Korea, and Indonesia. Geographical Spread of HLA literature across various ethnic groups and populations Genetic differences in HLA genes across populations and their link with biological conditions make it imperative to consider geographical information while studying HLA association with a particular condition. We assumed that the population/ethnic groups name might not be present in the same sentences that mention HLA and disease, so we used a flexible approach here and fetched the names of geographical locations present anywhere in the abstracts. In total, we reported 7696 NORP tags, mapping to 174 unique geographical entities. These unique tags were binned into 112 country-based populations and 62 ethnic groups. Figure 7 represents the frequency distribution of these matched populations belonging to the countries and ethnic groups. Japan, China, USA, India and Italy are the major countries where the HLA gene-disease association studies have been reported with disease (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 groups as shown in Supplementary Table 7. Along with this, the European subcontinent has been extensively studied (1102 unique reports) as a major ethnic group. Apart from frequently studied areas, we also observed locations like New Zealand, Armenia and Sri Lanka that have a low number of reported studies. This type of analysis can help researchers understand not only the extent of allele- disease associations among populations in the context of these immune players but also the scope of research in their selected geographical location while planning their hypothesis. Response to therapeutics HLA genes are known to have association with various hypersensitivities and drug reactions, a few of them like Stevens-Johnson syndrome can also be life-threatening. Due to allele differences among individual and population level, these hypersensitivities vary, and thus studying these pharmacogenetic markers with the population information becomes important. For instance, we observed from our data that HLA-A*31:01 is associated with carbamazepine induced Stevens-Johnson syndrome in European population while HLA-B*15:02 is associated with Chinese and Indian populations. A meta resource like HLA-SPREAD can help understand such population-wise differences that obstruct designing of therapy modules for ADRs/ hypersensitivities. To be more specific, this analysis focuses on drugs that are present in sentences along with the disease and allele keywords. We observed a total of 1755 abstracts mentioning 252 unique drugs, of which 78 mapped to ADR category. Details of drugs and related information are listed in Supplementary Table 8. We also validated our results with AFND, a manually curated database that has information about ADRs. Out of 42 drugs present, we were able to find 33 common. One of the drugs “Valporic acid”, mentioned in AFND, was not present in the actual cited article. The remaining drugs could not be captured because of the stringent criteria of drug mapping i.e. the drug name should be present in the sentence along with disease and allele keyword. Figure 8 lists the frequency-based distribution of top 20 drugs fetched from our analysis. Interestingly, we also observed 19 drugs that are not mentioned in AFND database, e.g. HLA-B*38:02:01 allele was found to predict carbimazole/methimazole induced agranulocytosis, HLA-DRB1 associated azathioprine induced pancreatitis in IBD patients. This analysis highlights, how one can miss information apart from the time and manpower intensive nature in manual curation. Insights from HLA-SPREAD: Biomarker ANALYSIS We demonstrate the usability of the database to address clinically relevant queries. Multiple questions on the identification of HLA alleles and diseases linked with hypersensitivity, allergy, genetic marker, prognosis and diagnosis can be addressed using HLA-SPREAD. As an example, we present an analysis to identify biomarkers in HLA studies. To address this question, we used an n-gram based approach to identify the keyword most frequently occurring with “marker” in the sentences. Supplementary Table 9 list the most common keywords identified. We checked the details of such sentences and complied the information (Supplementary Table 10). A few of them like abacavir (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 hypersensitivity and SJS syndrome were present in multiple papers. HLA-G and HLA-E were also reported to be markers for conditions like tumour, transplantation and heart diseases. Discussion HLA alleles are known to be associated with a large number of diseases. There is no existing repository that summarises this information in a systemic manner. Manual curation is a cumbersome process and one might also miss a lot of important information. The need for such a user-friendly platform increases significantly since HLA alleles have been found clinically associated with a large number of conditions. NLP based text mining offers a way to fetch this information pragmatically. NLP is instrumental in terms of extracting information from unstructured data. This method has started assuming immense importance in the biomedical domain. A few papers like GWASkb and SNP literature have used it for extracting information such as SNP and its related knowledge from the biomedical data whereas Monarch initiative has used it for studying phenotype information (29). Extracting information from HLA related literature is very difficult owing to the large number of studies and complex nomenclature. This project is an attempt to consolidate all the HLA relevant information such as SNPs, populations studied, ADRs and associated diseases into a structured database. This resource is also handy for user-specific advanced HLA searches like looking for biomarkers for toxicity-based studies and disease progression. There were a few drawbacks of this analysis worth highlighting – primary arising due to the different formats of various journals. The initial tokenised data used in the analysis was based on English stop words. However, we observed in a small set of papers, the author missed giving full stops or spaces which lead to the fusion of two sentences. The subheadings were present in different cases and often followed by different special characters leading to complexity in their removal. Also, a prefix of keywords like SETTINGS, STUDY DESIGN, etc. have been observed in a few sentences, as those papers did not follow standard headlines. Apart from these, few other parameters like abbreviations at the end of sentences, presence of roman letters in sentences and different brackets and quotes styles in title caused errors during tokenisation process. Similarly, it was observed that with the updation of various abstracts in new releases, the previous incorrect entries were not removed which lead to duplication of different information. Since HLAspread has catalogued information from diverse resources, in many instances it provides pieces of information that would be more informative and exhaustive. For instance, besides information retrieved from databases like DisGeNET, OMIM (Mendelian) reporting information on a few diseases we also used MESH is more comprehensive as it houses 139264 variant disease terms mapping to 4674 diseases. We also reduced the high variability in the method of mentioning the disease name in various articles. On average, a disease has around 30 names with one ID, showing the wide spectrum of disease dictionary required to capture all possible disease terms. In order to capture the HLA and ADRs we selected a list of drugs from SIDER4.1. However, not all drugs present in side effect database will be associated with ARDs. To get a more specific answer, we selected drugs from categories such as adverse drug reactions, hypersensitivity and toxicity. We were able to fetch a large number of studies (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 and observed that the AFND database has missed quite some drugs in the ADR analysis. We thus added information from both AFND and SIDER to get heuristic information for a set of different drugs. There were a few unique aspects that we could capture because of our approach. For instance, in transplantation studies in addition to just listing different kinds of transplantations, we also observed the most common diseases which required transplantation and drugs given during the process with few side effects. Also, a unique aspect we added was a category called signs and symptoms for simplifying user searches. For instance, some users may also be interested in knowing the context of HLA alleles with conditions like inflammation, relapse, hypoxia, septic shock, diarrhoea, etc. We aim to add a few features in future updates for example mapping the variants reported in dbSNP, OMIM, ClinVar with to the HLA alleles. This would help in seamless integration of high-throughput variation data with the wealth of HLA information in literature and HLA alleles reported in IMGT database. To summarise this is one of its kind of efforts to integrate the diversity of HLA information into a structured format for ease of query and analysis. This could also provide an informative resource for the non-HLA specialists for initiating any new studies in populations and diseases. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 Acknowledgements The authors would Acknowledge COE M/o AYUSH grant MLP-901 to MM and DD and SRF fellowship to DD from Department of Biotechnology (DBT) and Dr. Yatender Kumar (NSIT) for permitting AK to work on this project. We would also acknowledge Mr Praveen Sinha for designing and developing the webpage of HLA SPREAD, Dr. Debasis Dash, CSIR-IGIB for critical reviewing of work, Dr. Ganesh Bagler and Rudransh Tunwani from IIITD for NLP discussion, Dr. Ganganath Jha from Hazaribagh University in QC of population curation and Malika Seth in QC of semantic annotations. The authors would also like to acknowledge Mr. Raghunandanan MV and Mr. Amit Khulve at CSIR-IGIB for IT support. Authors Contributions MM, DD designed the study and co-wrote the manuscript. DD and AK executed the entire work. UK helped in HLA analysis, interpretation and manuscript writing (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 List of Figures Figure 1. Workflow of HLA-SPREAD: An automated pipeline developed to extract information related from ~110,000 studies related to HLA retrieved from over 24 million abstracts. Structured information from these abstracts was created using Natural Language Processing methods developed into a database HLA-SPREAD. The various resources used at each step are indicated. Figure 2. Nature and trends of HLA related publications in PubMed annually from 1975 onwards: Stacked Bar plot shows distribution of PubMed articles in different categories. a) Diverse studies including clinical trials are reported, with maximum numbers represented in the “journal article” category. b) A subplot of (a) after removing the most frequent “Journal article” type to visualise the trends in other categories. 2a 2b (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 Figure 3. The topmost reported HLA alleles associated with diseases: All the HLA alleles indicated have been grouped to their second digit and represented in the pie chart. HLA-A, HLA-B and HLA- DRB1 are the most studied amongst the HLA genes. Figure 4. Diseases/conditions associated with HLA genes: Graph represents three level hierarchy of diseases. Each colour represents a level. There are 24 major categories as represented in green colour, which is further divided into subcategories. Each disease name is matched to its Mesh id and a normalised mesh keyword. Autoimmune, Neoplasms and Joint disease are the top most associated diseases. As anticipated, significant numbers of studies related to transplantation are also observed. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 Figure 5. Heatmap of HLA Disease associations: The gradient heat map representing the number of diseases associated with HLA genes. First column represents generic “HLA” studies where specific gene information is not mentioned. A large number of associations were also observed with Non- classical(HLA-E,F,G) genes. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 Figure 6. Genomic distribution of SNPs: Pie chart representing the number of variations in genic region with majority of them mapping to introns. Figure 7. Geographical Spread of HLA studies: Identified geographical locations are binned to the nearest a) Country b) Ethnic group. Color gradient representing the count of various HLA alleles with respect to disease or ARD’s studies. China, Japan and the USA report maximum studies and European, Asian and African are the most studied ethnic groups 7a 7b Count (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 Figure 8. Statistics of drugs related HLA studies: This bar plot includes the most common top 20 drugs associated with ADR’s identified using HLA-SPREAD. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 List of Tables Table1: Number of articles in broad categories Supplementary tables:- https://doi.org/10.5281/zenodo.4276878 Categories Number of PubMed abstracts Diseases 29713 Transplantation 9258 Signs and Symptoms 6050 ADR 317 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 References: 1. Mosaad,Y.M. (2015) Clinical Role of Human Leukocyte Antigen in Health and Disease. Scand J Immunol, 82, 283–306. 2. Niehrs,A. and Altfeld,M. (2020) Regulation of NK-Cell Function by HLA Class II. Front. Cell. Infect. Microbiol., 10, 55. 3. Shiina,T., Hosomichi,K., Inoko,H. and Kulski,J.K. (2009) The HLA genomic loci map: expression, interaction, diversity and disease. J Hum Genet, 54, 15–39. 4. Blackwell,J.M., Jamieson,S.E. and Burgner,D. (2009) HLA and Infectious Diseases. CMR, 22, 370– 385. 5. Fricke-Galindo,I., LLerena,A. and López-López,M. (2017) An update on HLA alleles associated with adverse drug reactions. Drug Metabolism and Personalized Therapy, 32. 6. Klimenta,B., Nefic,H., Prodanovic,N., Jadric,R. and Hukic,F. (2019) Association of biomarkers of inflammation and HLA-DRB1 gene locus with risk of developing rheumatoid arthritis in females. Rheumatol Int, 39, 2147–2157. 7. Khan,M.A., Mathieu,A., Sorrentino,R. and Akkoc,N. (2007) The pathogenetic role of HLA-B27 and its subtypes. Autoimmunity Reviews, 6, 183–189. 8. Khan,M.A. (2008) HLA-B27 and Its Pathogenic Role: JCR: Journal of Clinical Rheumatology, 14, 50–52. 9. Ferrell,P.B. and McLeod,H.L. (2008) Carbamazepine, HLA-B*1502 and risk of Stevens–Johnson syndrome and toxic epidermal necrolysis: US FDA recommendations. Pharmacogenomics, 9, 1543– 1546. 10. Sawal,N., Kanga,U., Shukla,G., Goyal,V. and Srivastava,A.K. (2020) Stevens-Johnson syndrome triggered by Levetiracetam—Caution for use with Carbamazepine. Seizure, 80, 63–64. 11. Ayuk,F., Beelen,D.W., Bornhäuser,M., Stelljes,M., Zabelina,T., Finke,J., Kobbe,G., Wolff,D., Wagner,E.-M., Christopeit,M., et al. (2018) Relative Impact of HLA Matching and Non-HLA Donor Characteristics on Outcomes of Allogeneic Stem Cell Transplantation for Acute Myeloid Leukemia and Myelodysplastic Syndrome. Biology of Blood and Marrow Transplantation, 24, 2558–2567. 12. Petersdorf,E.W. (2017) Which factors influence the development of GVHD in HLA-matched or mismatched transplants? Best Practice & Research Clinical Haematology, 30, 333–335. 13. Kanga,U., Mehra,N.K., Larrea,C.L., Lardy,N.M., Kumar,A. and Feltkamp,T.E.W. (1996) Seronegative Spondyloarthropathies and HLA-B27 Subtypes: A Study in Asian Indians. Clin Rheumatol, 15, 13–18. 14. Xu,H. and Yin,J. (2019) HLA risk alleles and gut microbiome in ankylosing spondylitis and rheumatoid arthritis. Best Practice & Research Clinical Rheumatology, 33, 101499. 15. Andeweg,S.P., Keşmir,C. and Dutilh,B.E. (2020) Quantifying the impact of Human Leukocyte Antigen on the human gut microbiome Bioinformatics. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 16. Gomez,A., Luckey,D., Yeoman,C.J., Marietta,E.V., Berg Miller,M.E., Murray,J.A., White,B.A. and Taneja,V. (2012) Loss of Sex and Age Driven Differences in the Gut Microbiome Characterize Arthritis-Susceptible *0401 Mice but Not Arthritis-Resistant *0402 Mice. PLoS ONE, 7, e36095. 17. Buhler,S. and Sanchez-Mazas,A. (2011) HLA DNA Sequence Variation among Human Populations: Molecular Signatures of Demographic and Selective Events. PLoS ONE, 6, e14643. 18. Saxena,A., Suzuki,S., Mourya,M., Shiina,T. and Kanga,U. (2020) Novel and extended HLA class I and II alleles encountered in Kashmiri Brahmin population from North India. HLA, 96, 487–489. 19. Sfakianaki,P., Koumakis,L., Sfakianakis,S., Iatraki,G., Zacharioudakis,G., Graf,N., Marias,K. and Tsiknakis,M. (2015) Semantic biomedical resource discovery: a Natural Language Processing framework. BMC Med Inform Decis Mak, 15, 77. 20. Rakhi,N.K., Tuwani,R., Mukherjee,J. and Bagler,G. (2018) Data-driven analysis of biomedical literature suggests broad-spectrum benefits of culinary herbs and spices. PLoS ONE, 13, e0198030. 21. Wei,C.-H., Allot,A., Leaman,R. and Lu,Z. (2019) PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Research, 47, W587–W593. 22. Kuleshov,V., Ding,J., Vo,C., Hancock,B., Ratner,A., Li,Y., Ré,C., Batzoglou,S. and Snyder,M. (2019) A machine-compiled database of genome-wide association studies. Nat Commun, 10, 3341. 23. Giudicelli,V., Chaume,D., Bodmer,J., Muller,W., Busin,C., Marsh,S., Bontrop,R., Marc,L., Malik,A. and Lefranc,M.-P. (1997) IMGT, the international ImMunoGeneTics database. Nucleic Acids Research, 25, 206–211. 24. Kuhn,M., Letunic,I., Jensen,L.J. and Bork,P. (2016) The SIDER database of drugs and side effects. Nucleic Acids Res, 44, D1075–D1079. 25. Ghattaoraya,G.S., Dundar,Y., González-Galarza,F.F., Maia,M.H.T., Santos,E.J.M., da Silva,A.L.S., McCabe,A., Middleton,D., Alfirevic,A., Dickson,R., et al. (2016) A web resource for mining HLA associations with adverse drug reactions: HLA-ADR. Database, 2016, baw069. 26. Achakulvisut,T., Acuna,D. and Kording,K. (2020) Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. JOSS, 5, 1979. 27. Bodenreider,O. (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32, 267D – 270. 28. Wang,M., Xing,Z.-M., Yu,D.-L., Yan,Z. and Yu,L.-S. (2004) Association between HLA class II locus and the susceptibility to Artemisia pollen-induced allergic rhinitis in Chinese population. Otolaryngol Head Neck Surg, 130, 192–196. 29. Shefchek,K.A., Harris,N.L., Gargano,M., Matentzoglu,N., Unni,D., Brush,M., Keith,D., Conlin,T., Vasilevsky,N., Zhang,X.A., et al. (2020) The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Research, 48, D704–D715. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425409doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425409 10_1101-2021_01_05_425417 ---- 14989606 TITLE 1 Taxonomy-aware, sequence similarity ranking reliably predicts phage-host relationships 2 3 AUTHORS 4 Andrzej Zielezinski1,*, Jakub Barylski2, Wojciech M. Karlowski1 5 6 AUTHOR AFFILIATIONS: 7 1 Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University 8 Poznan, Uniwersytetu Poznanskiego 6, 61-614, Poznan, Poland 9 2 Molecular Virology Research Unit, Faculty of Biology, Adam Mickiewicz University Poznan, 10 Uniwersytetu Poznanskiego 6, 61-614, Poznan, Poland 11 12 * Address correspondence to: 13 Andrzej Zielezinski: andrzejz@amu.edu.pl 14 15 ABSTRACT 16 Motivation: Similar regions in virus and host genomes provide strong evidence for phage-host 17 interaction, and BLAST is one of the leading tools to predict hosts from phage sequences. 18 However, BLAST-based host prediction has three limitations: (i) top-scoring prokaryotic 19 sequences do not always point to the actual host, (ii) mosaic phage genomes may produce matches 20 to many, typically related, bacteria, and (iii) phage and host sequences may diverge beyond the 21 point where their relationship can be detected by a BLAST alignment. 22 Results: We created an extension to BLAST, named Phirbo, that improves host prediction quality 23 beyond what is obtainable from standard BLAST searches. The tool harnesses information 24 concerning sequence similarity and bacteria relatedness to predict phage-host interactions. Phirbo 25 was evaluated on two benchmark sets of known phage-host pairs, and it improved precision and 26 recall by 25 percentage points, as well as the discriminatory power for the recognition of phage-27 host relationships by 10 percentage points (Area Under the Curve = 0.95). Phirbo also yielded a 28 mean host prediction accuracy of 60% and 70% at the genus and family levels, respectively, 29 representing a 5% improvement over BLAST. When using only a fraction of phage genome 30 sequences (3 kb), the prediction accuracy of Phirbo was 5-11% higher than BLAST at all 31 taxonomic levels. 32 Conclusion: Our results suggest that Phirbo is an effective, unsupervised tool for predicting 33 phage-host relationships. 34 Availability: Phirbo is available at https://github.com/aziele/phirbo. 35 36 KEYWORDS 37 phage-host prediction, phage, prokaryote, bacteria, virus, genome sequence 38 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint mailto:andrzejz@amu.edu.pl https://github.com/aziele/phirbo https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ INTRODUCTION 39 Prokaryotic viruses (phages) are the most abundant entities across all habitats and represent a vast 40 reservoir of genetic diversity [1]. Phages mediate horizontal gene transfer and constitute a major 41 selection pressure that shapes the evolution of bacteria [2]. Prokaryotic viruses also affect 42 biogeochemical cycles and ecosystem dynamics by controlling microbial growth rates and 43 releasing the contents of microbial cells into the environment [2,3]. Moreover, phages play a key 44 role in shaping the composition and function of the human microbiome in health and disease [4–45 6]. Recently, there has been renewed interest in phage therapy and phage-based biocontrol of 46 harmful bacteria [7,8] in medical treatment [9,10] and the food industry [11,12]. Hence, 47 characterizing phage–host interactions is critical to understanding the factors that govern phage 48 infection dynamics and their subsequent ecological consequences [13]. 49 50 The scope of phage-host interactions is poorly understood, although it has been hypothesized that 51 all prokaryotic organisms fall prey to viral attacks [1]. Methods for studying phage-host 52 interactions primarily rely on cultured virus-host systems; however, recent in silico approaches 53 suggest a much broader range of hosts may be susceptible to viral infections [14]. These methods 54 predict prokaryotic hosts based on sequence composition [15,16], direct sequence similarity 55 between phages and hosts [14], analysis of CRISPR spacers or tRNAs [13,17], as well as 56 supervised approaches that integrate several sequence-based methods [18,19]. 57 58 Despite significant progress in phage-host predictions, the classic BLAST [20] algorithm is 59 currently the most effective, unsupervised method for identifying phage-host interactions [14,15]. 60 Depending on the dataset, the tool finds the correct genus level host for 40-60% of phages [14,15]. 61 The task of finding a host for a given phage using BLAST is conceptualized as obtaining the host 62 sequence with the highest similarity to the query phage sequence. However, restricting host 63 predictions to the first top-scored prokaryotic sequence has three limitations. First, the true host 64 may not be the top-scoring match in the BLAST results. Second, selecting a prokaryotic host based 65 on the first sequence assumes that a phage infects a single host. Although phages are generally 66 host-specific, some may infect multiple host species [21,22]. Finally, many distantly-related 67 prokaryotic species may obtain a comparable BLAST score for a query phage due to spurious 68 alignments. These ambiguous host predictions require further manual curation of the taxonomic 69 or phylogenetic relationship between the top-scored prokaryotic species to select the true host(s). 70 71 We have addressed these issues by developing a simple extension to BLAST, named Phirbo, that 72 exploits the information contained in the full BLAST results, rather than its top-ranking matches. 73 Phirbo improved the accuracy of finding hosts, beyond what is found from the best BLAST match, 74 by relating phage and host sequences through intermediate, common reference sequences that are 75 potentially homologous to both phage and host queries. Subsequent quantification of the 76 overlapping signals allows for the reliable prediction of phage-host interactions without the need 77 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ for direct comparisons between the phage and host sequences and without any prior knowledge of 78 their phylogenetic or taxonomic context. 79 80 RESULTS 81 82 Phirbo algorithm overview 83 This algorithm is based on the assumption that the degree of similarity between phage and host 84 sequences is proportional to the overlap between ranked similarity matches of each sequence to 85 the same reference data set of prokaryotic sequences. Specifically, to compare a pair of phage (P) 86 and host (H) sequences, we first perform two independent BLAST searches against the reference 87 database of prokaryotic genomes (D)—one BLAST search for phage and the other for the host 88 query (Fig. 1a). The two lists of BLAST results (Fig. 1b), P → D and H → D, contain prokaryotic 89 genomes ordered by decreasing sequence similarity (i.e., bit-score). To avoid a taxonomic bias due 90 to multiple genomes of the same prokaryote species, we rank prokaryotic species according to 91 their first appearance in the BLAST list (Fig. 1c). In this way, both lists represent phage and host 92 profiles consisting of the ranks of top-score prokaryotic species. 93 94 The properties of these lists (Fig. 1c) closely resemble the outcome of an Internet search and can 95 be characterized by four features: (i) species listed at the top of each ranking are more important 96 (similar) to the query than those listed at the bottom; (ii) the lists may not be conjoint (some species 97 may appear in one ranking but not in the other); (iii) the ranking lists may vary in length (BLAST 98 may return few prokaryotic matches in response to virus sequences in contrast to thousands of 99 matches in cases of multiple-species prokaryotic families); (iv) two or more species from the 100 database may achieve the same BLAST score and, therefore, occupy the same position on the 101 ranking list (Fig. 1c). A recently introduced similarity measure used for comparing the rankings 102 of Web search engine results [23], the Rank-Biased Overlap (RBO), satisfies these four conditions. 103 The RBO algorithm starts by scoring the overlap between the sub-list containing the single top-104 ranked item of each list. It then proceeds by scoring the overlaps between sub-lists formed by the 105 incremental addition of items further down the original lists. Each consecutive iteration has less 106 impact on the final RBO score as it puts heavier weights on higher-ranking items by using 107 geometric progression, which weighs the contribution of overlaps at lower ranks (see ‘Methods’). 108 An overall RBO score falls between 0 and 1, where 0 signifies that the lists are disjoint (have no 109 items in common) and 1 means the lists are identical in content and order. Our results indicate that 110 the extent of the phage-host relationship can be estimated by the application of an RBO 111 measurement to the ranking lists generated from BLAST results (Fig. 1d). 112 113 Phirbo differentiates between interacting and non-interacting phage-host pairs 114 To assess the discriminatory power of Phirbo to recognize phage-host interactions, we used two 115 published reference data sets: Edwards et al. (2016) [14], which contains 2,699 complete bacterial 116 genomes and 820 phages with reported hosts, and Galiez et al. (2017) [16] that has 3,780 complete 117 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ prokaryotic genomes and 1,420 phage genomes. For each data set, we compared the distribution 118 of Phirbo scores between all known phage-host interaction pairs and the same number of randomly 119 selected non-interacting phage-prokaryote pairs (Fig. 2). The scores obtained by Phirbo in both 120 data sets separated the interacting from non-interacting phage-host pairs more than the BLAST 121 scores. The median Phirbo score across interacting phage-host pairs was nearly 1,500 times greater 122 than for non-interacting pairs, while the median BLAST score was three times higher for 123 interacting pairs than non-interacting pairs (Supplementary Table 1). Both methods, however, 124 differentiated between interacting and non-interacting phage-host pairs with higher accuracy than 125 WIsH — the state-of-the-art, alignment-free, host prediction tool [16]. 126 127 To further examine the discriminatory power of Phirbo across all possible phage-prokaryote pairs, 128 we used receiver operating characteristic (ROC) curves (Fig. 2a,b). The area under the ROC 129 (AUC), which measured the discriminative ability between interacting and non-interacting phage-130 host pairs, was higher for Phirbo (AUC = 0.95) in the Edwards et al. and Galiez et al. data sets 131 than for BLAST (AUC = 0.86) and WIsH (AUC = 0.78-0.79). An additional advantage of Phirbo 132 was its capacity to score phage-host pairs whose sequence similarity could not be established by a 133 direct BLAST comparison but, instead, through other, ‘intermediate’ prokaryotic sequences that 134 were detectably similar to both phage and host query sequences. For example, BLAST did not 135 provide scores for 20% of the interacting phage-host pairs in the Edwards et al. and Galiez et al. 136 data sets due to alignment score thresholds (Supplementary Table 2). Using the same BLAST 137 lists, Phirbo evaluated 99% of the interacting phage-hosts pairs. This high coverage indicated that 138 nearly every pair of phage-prokaryote sequences could be related by at least one common 139 prokaryotic sequence detectably similar to both the phage and host sequences. 140 141 Phirbo has the highest host prediction performance 142 To evaluate host prediction performance, we used precision-recall (PR) curves, which provide 143 more reliable information than ROC when benchmarking imbalanced data sets for which the non-144 interacting pairs vastly outnumber the interacting pairs [24,25]. Accordingly, we plotted PR curves 145 for Phirbo, BLAST, and WIsH predictions obtained from the Edwards et al. (Fig. 3a) and Galiez 146 et al. (Fig. 3b) data sets. Overall, Phirbo performed better at host prediction at the species level 147 than BLAST and WIsH, regardless of the data set. The area under the PR curve (AUPR), which 148 summarized overall performance, was higher in Phirbo by 25 percentage points (AUPR = 0.56-149 0.65) than in BLAST (AUPR = 0.33-0.41). Phirbo also reported the highest F1 score (an average 150 of precision and recall [see ‘Methods’]) in the Edwards et al. and Galiez et al. data sets (Fig. 3). 151 Specifically, the precision and recall of Phirbo were 59-65% and 57-64%, respectively, while 152 BLAST had precision and recall in the range of 28-43% (Fig. 3). Furthermore, Phirbo yielded 153 slightly higher specificity (99.7-99.8%) and accuracy (99.5-99.6%) than BLAST or WIsH. 154 155 Phirbo preserves BLAST top-ranked host predictions 156 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ We further evaluated the host prediction accuracy of Phirbo by selecting a top-scored prokaryotic 157 sequence for each phage [14–16,18]. Briefly, host prediction accuracy is calculated as the 158 percentage of phages whose predicted hosts have the same taxonomic affiliation as their respective 159 known hosts (if multiple top-scoring hosts are present, the prediction is scored as correct if the true 160 host is among the predicted hosts). Phirbo restored all hosts predicted by BLAST in the datasets 161 by Edwards et al. and Galiez et al., achieving the same prediction accuracy as BLAST across all 162 taxonomic levels (Table 1). Of note, BLAST found multiple different host species with equal 163 scores for 14 phage genomes. This was observed in phages infecting bacteria from the 164 Enterobacteriaceae family and the Rhodococcus and Bacillus genera. However, Phirbo assigned 165 the highest score to the correct host species (Supplementary Table 3). Additionally, it refined the 166 host prediction for the Cronobacter phage ENT39118 sequence, which BLAST assigned to the 167 Escherichia coli genome. Phirbo revealed Cronobacter sakazaki as the primary host species, as 168 the BLAST list of the Cronobacter phage is more similar in content and order to the BLAST list 169 of C. sakazaki (Phirbo score = 0.50) than E. coli (Phirbo score: 0.48) (Figure S1). 170 171 As Phirbo links phage to host through common sequences, the content of the sequence database 172 was the main factor defining host prediction quality. Since the similarity between viruses may 173 indicate a common host [18,26], we expanded the two BLAST databases of prokaryotic sequences 174 obtained from Edwards et al. and Galiez et al. by phage sequences (n = 820 and n = 1420, 175 respectively), and recalculated Phirbo scores between every phage-prokaryote pair. The phage-176 host linkage through homologous prokaryotic and phage sequences increased the host prediction 177 accuracy of Phirbo at all taxonomic levels, allowing correct identification of hosts at the genus 178 level for 56-63% of phages (Table 1). Specifically, Phirbo refined BLAST mis-predictions for 55 179 phage genomes and showed which sequences demonstrated low similarity to the sequences of their 180 host species. The direct BLAST alignments of these phage sequences, and the sequences of their 181 corresponding hosts, obtained significantly lower scores than alignments obtained by the other 182 known phage-host pairs (P = 1.9 × 10-45, Mann–Whitney U test). Notably, Phirbo also assigned 183 correct host species for 18 phages whose hosts were not reported in the BLAST results, mainly 184 Chlamydia species, Vibrio cholerae, and the opportunistic pathogen, Acinetobacter baumannii. 185 186 Phirbo is suitable for incomplete phage sequences 187 We tested the robustness of our host prediction algorithm to fragmentation of the phage sequence. 188 Following earlier studies [15,16,18], phage genomes from Edwards et al. and Galiez et al. data 189 sets were randomly subsampled to generate contigs of different lengths (20 kb, 10 kb, 5 kb, 3 kb, 190 and 1 kb) with 10 replicates. Host prediction accuracy was calculated as the mean percentage of 191 phages whose predicted hosts had the same taxonomic affiliation as their respective known hosts 192 (Fig. 4). Although Phirbo achieved equal host prediction accuracy with BLAST across all contig 193 lengths, it had substantially higher overall performance in terms of AUC and AUPR (Figure S2; 194 P < 10−5, Wilcoxon signed-rank test). Surprisingly, BLAST-based methods obtained higher host 195 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ prediction accuracy across all contig lengths compared to WIsH, a tool designed to predict the 196 hosts of short viral contigs (Fig. 4). 197 198 The host prediction accuracy of Phirbo was examined using the expanded BLAST database of 199 both prokaryotic and phage full-length sequences. To ensure fairness, for each tested phage contig 200 we removed its corresponding full-length sequence from the BLAST database and recalculated 201 Phirbo scores between the phage contig and every prokaryotic sequence. This approach 202 outperformed BLAST at every contig length across all taxonomic levels in both data sets (Fig. 4). 203 Generally, the host prediction accuracy of Phirbo improved by 5-11 percentage points compared 204 to the BLAST results. For example, when the contig length was 3 kb, the prediction accuracy of 205 Phirbo was 8-11% higher than BLAST at the family level, and 8-17% higher than WIsH (Fig. 4; 206 Supplementary Table 4). Phirbo also achieved the highest AUC and AUPR scores when 207 discriminating between interacting and non-interacting phage-host pairs (Figure S2). 208 209 Phirbo uses multiple protein and non-coding RNA signals for host prediction 210 We investigated the sequence information used by BLAST and Phirbo for host prediction. For 211 each phage that was correctly assigned to the host species by both tools (n = 485), we calculated 212 the fraction of the phage genome that was included in the segments aligned with prokaryotic 213 sequences (sequence coverage). This analysis revealed that our tool used three times more phage 214 sequence (median sequence coverage: 35%) than BLAST (12%) (Figure S3; P < 10-15, Wilcoxon 215 signed-rank test). This increased sequence coverage indicates that different genome regions of the 216 phages map to the genomes of prokaryotic species other than the host species. For 214 of the 485 217 phages, more than half of their genomes were aligned to genomes of their host species 218 (Supplementary Table 5). Such large regions of homology are likely prophages or phage debris 219 left by large-scale recombination events during phage replication. The observed high sequence 220 coverage points to the virus taxa, known for their temperate lifestyle and frequent recombination 221 with host genomes (i.e., Siphoviridae family as well as the Peduovirinae and Sepvirinae 222 subfamilies). 223 224 To further examine the properties of sequences that may be exchanged between a phage and its 225 host, we selected a population of phages with sequence coverage below 50% (n = 271). These 226 phages, which are less likely to represent complete prophages, belong to 16 viral families 227 (Supplementary Table 6). Next, we re-annotated the genomic sequences of the phages to find 228 putative protein and non-coding RNA (ncRNA) genes. Phage sequence regions used by Phirbo for 229 host predictions were significantly enriched (P < 10-5) in more than a hundred protein families of 230 known or probable function. In contrast, only half of the protein families were used in BLAST-231 based host predictions (Supplementary Table 7). The protein families used by Phirbo covered 232 most of the processes of the viral life cycle including DNA replication, cell lysis, recombination, 233 and packaging of the phage genome (Fig. 5). In contrast to BLAST, Phirbo also exploited the 234 information contained in phage ncRNAs while assigning phages to host genomes. The vast 235 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ majority of these ncRNAs (>90%) were tRNAs, which showed significant overrepresentation in 236 the phage sequence fragments used by Phirbo (P = 6 × 10-12) (Supplementary Table 8). The 237 remaining ncRNAs belonged to group I introns (3%), RNAs associated with genes associated with 238 twister and hammerhead ribozymes (1%), skipping-rope RNA motifs (1%), and 12 less abundant 239 RNA families. 240 241 Implementation and availability 242 Predicting hosts from phage sequences using BLAST is accomplished by querying phage 243 sequences against a database of candidate hosts. However, Phirbo also uses information about 244 sequence relatedness among prokaryotic genomes. Therefore, it requires ranked lists of prokaryote 245 species generated by BLAST for the phage and host genomes. The computational cost of querying 246 every host sequence against the database of all candidate hosts using BLAST may still be a limiting 247 factor. However, for mass host searches, the computational cost of all-versus-all host comparisons 248 becomes marginal, as it must be done only once. After the relatedness among host genomes is 249 established, the time required for Phirbo host predictions is negligibly higher than the time for 250 typical BLAST-based host predictions. For example, running Phirbo between ranked lists of host 251 species for 1,420 phages and 3,860 candidate hosts from Galiez et al. (resulting in ~5.5 million 252 phage-host comparisons) took 8 minutes on a 16-core 2.60GHz Intel Xeon. 253 254 As Phirbo operates on rankings, BLAST can be replaced by an alternative sequence similarity 255 search tool to reduce the time to estimate homologous relationships between host genomes. For 256 instance, Mash [27] computed host relationships in 5 minutes for the Edwards et al. and Galiez et 257 al. data sets (see ‘Methods’). The host prediction performance of Phirbo using BLAST-based 258 rankings for phages and Mash-based rankings for host genomes is high compared to the 259 performance of Phirbo predictions using BLAST rankings for both phage and host genomes 260 (Supplementary Table 9). 261 262 We envisage Phirbo as a natural extension to standard BLAST-based host predictions. The Phirbo 263 tool is written in Python and freely available at https://github.com/aziele/phirbo/. 264 265 DISCUSSION 266 The identification of similar sequence regions between host and phage genomes using BLAST has 267 been a baseline for the identification of putative virus-host connections in numerous metagenomic 268 projects [13,28,29]. However, a BLAST search requires regions with significant similarity 269 between the query phage and host [14–16]. Yet, many phage and host sequences lack sufficient 270 similarity and escape detection with standard BLAST searches. To tackle this issue, alignment-271 free tools have been developed to predict hosts from phage sequences [14–16,30]. The rationale 272 behind these tools is based on the observation that viruses tend to share similar patterns in codon 273 usage or short sequence fragments with their hosts [14–16]. As virus replication is dependent on 274 the translational machinery of its host, some phages adapt their codon usage to match the 275 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://github.com/aziele/phirbo/ https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ availability of tRNAs during viral replication in the host cell [31–33]. Similar oligonucleotide 276 frequency use may be driven by evolutionary pressure on the virus to avoid recognition by host 277 restriction enzymes and CRISPR/Cas defense systems [32,34]. Although state-of-the-art 278 alignment-free tools (i.e., WIsH [16] and VirusHostMatcher [15]) can rapidly assess sequence 279 similarity between any pair of phage and prokaryote sequences, they are less accurate for host 280 prediction than BLAST [14,15]. The relatively high accuracy of BLAST suggests that localized 281 similarities of genetic material may be a stronger indication of phage-host interactions than global 282 convergence of their genomic composition. This evidence comes in the form of protein-coding 283 DNA fragments and non-coding RNAs. The latter group is dominated by tRNA genes, which are 284 strongly over-represented in direct BLAST alignments between phages and their hosts, and are 285 even more prevalent among indirect connections used by Phirbo. This may be important, as 286 previous studies have shown that not all phage tRNA genes come directly from their hosts. Some 287 appear to be derived from genomes of other, often distantly related, bacteria and may be the result 288 of earlier evolutionary events [35]. For protein-coding genes, a more diverse picture emerges. 289 Proteins rich in phage-host BLAST alignments can be assigned into different functional categories 290 including phage virion components, replication-related proteins, regulatory factors, and proteins 291 involved in the metabolism of the host. The transfer of some over-represented families in phages 292 and/or prophages has been previously reported (e.g., lytic proteins, DNA replication and 293 recombination proteins, and enzymes involved in nucleotide and energy metabolisms [36]) and 294 some of these genes are connected with the phage-host range [37,38]. However, no clear pattern 295 emerges after analyzing the functions of the remaining, over-represented proteins. 296 297 In this study, we attempted to expand the information content of a single local alignment of phage 298 and host sequences by incorporating the results of multiple local alignments between a phage 299 sequence and different prokaryotic genomes. This approach may more closely resemble a manual 300 assignment of phage-host pairs, where an expert analyst not only considers a top-ranked matching 301 prokaryote in the BLAST results, but also uses the information contained in other, less significant, 302 matches and their sequence and taxonomic similarity. Through a taxonomically-aware 303 stratification scheme, this approach tracks the multilateral dynamics of horizontal gene transfer. 304 Therefore, we propose to relate phage and host sequences through multiple intermediate sequences 305 that are detectably similar to both the phage and host sequences. By linking phage and host 306 sequences through similar sequences, Phirbo achieved a more comprehensive list of phage-host 307 interactions than BLAST. Simultaneously, Phirbo was capable of assessing almost all phage-host 308 pairs, bringing the method closer to alignment-free tools, which compute scores between all 309 possible phage and host pairs. Thus, our approach can be directly applied to different phage and 310 prokaryote data sets without training or optimizing the underlying RBO algorithm. We 311 intentionally avoided machine learning components in Phirbo to ensure the general applicability 312 of the approach and avoid possible overfitting. 313 314 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ Our results show that expanding the information obtained from plain similarity comparisons by 315 incorporating taxonomically-grounded measurements of phage-host similarity leads to improved 316 accuracy of phage-host predictions. The Phirbo method provides the phage research community 317 with an easy-to-use tool for predicting the host genus and species of query phages, which is usable 318 when searching for phages with appropriate host specificity and for correlating phages and hosts 319 in ecological and metagenomic studies. 320 321 METHODS 322 323 Virus and prokaryotic host data sets 324 The data sets analyzed in this study were retrieved from two previously published phage-host 325 studies [14,16]. The first set (Edwards et al. 2016 [14]) contained 2,699 complete bacterial 326 genomes obtained from NCBI RefSeq and 820 RefSeq genomes of phages for which the host was 327 reported. The data set encompassed 16,757 known virus-host interaction pairs and 2,196,424 pairs 328 for which interaction was not reported (non-interacting phage-host pairs). The second data set 329 (Galiez et al. 2017 [16]) contained 3,780 complete prokaryotic genomes of the KEGG database 330 and 1420 phages for which host species were reported in the RefSeq Virus database. The data set 331 consisted of 26,024 interacting- and 5,341,576 non-interacting virus-host pairs. 332 333 Phirbo score 334 The interaction score for a given phage-host pair was calculated using the RBO metric. RBO [23] 335 is a measurement of rank similarity that compares two lists of different lengths (giving more 336 attention to high ranks on the lists). RBO ranges from 0 to 1, where a greater value indicates greater 337 similarity between lists. Equation 1 was used for the calculation of the RBO value between two 338 ranking lists, S and T. 339 340 𝑅𝐵𝑂(𝑆, 𝑇, 𝑝) = (1 − 𝑝) ∑ 𝑝𝑑−1 𝑛 𝑑=1 𝐴(𝑆, 𝑇, 𝑑) 341 342 where the parameter p (0 < p < 1) determines how steeply the weight declines (the smaller the p, 343 the more top results are weighted). When p = 0, only the top-ranked item is considered, and the 344 RBO score is either zero or one. In this study, we set p to 0.75, which assigned ~98% of the weight 345 to the first 10 hosts. A(S, T, d) is the value of overlap between the two ranking lists, S and T, up to 346 rank d, calculated by Eq. 2. n is the number of distinct ranks on the ranking list. 347 348 𝐴(𝑆, 𝑇, 𝑑) = |𝑆:𝑑 ∩ 𝑇:𝑑 | |𝑆:𝑑 ∪ 𝑇:𝑑 | 349 350 where S:d and T:d represents the elements present in the first d ranks of lists S and T, respectively. 351 352 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ Host prediction tools 353 The host prediction tools BLAST [20], WIsH [16], and Phirbo were run separately in the Edwards 354 et al. and Galiez et al. data sets. For each tool, sequence similarity scores were calculated across 355 all combinations of phage-host pairs. BLAST 2.7.1+ [39] was run with default parameters (task: 356 blastn, e-value threshold = 10) to query each phage sequence against a database of candidate host 357 genomes. For each BLAST alignment, the highest bit-score between every phage-host pair was 358 reported (for phage-host pairs that were absent in the BLAST results, a bit-score of 0 was 359 assigned). For RBO host prediction, an additional BLAST search was performed to establish 360 ranked lists of genetically similar host genomes. Specifically, a nucleotide BLAST was run with 361 default parameters to query each host sequence against a database of candidate host genomes. As 362 an alternative to BLAST, Mash 2.1 [27] was used with default parameters (k-mer size = 21, sketch 363 size = 1,000) to establish ranked lists for each host by comparing its sequence against the database 364 of candidate host genomes. RBO scores were calculated between all pairwise combinations of 365 phage and host ranking lists. WIsH 1.0 [16] was used with default parameters to calculate log-366 likelihood scores between all pairwise combinations of phage-host sequences. 367 368 Evaluation metrics 369 The metrics of host prediction performance were calculated using sklearn (i.e., AUC, AUPR, 370 recall, precision, specificity, and accuracy) [40]. Optimal score thresholds to calculate recall, 371 precision, specificity, and accuracy was computed as maximizing the F1 score, an accuracy metric, 372 which is the harmonic mean of precision and recall. Host prediction accuracy was evaluated 373 analogous to previous studies [14,16,18]. Specifically, for each query phage, the host with the 374 highest score to the query virus was selected as the predicted host. In cases where multiple hosts 375 were predicted, the prediction was scored as correct if the correct host was among the predictions. 376 The prediction accuracy was calculated at each taxonomic level as the percentage of viruses whose 377 predicted hosts shared a taxonomic affiliation with known hosts. 378 379 Phage genome annotation 380 To define phage genes potentially exchanged between phage and host genomes, we re-annotated 381 485 phage genomes that were correctly assigned to host species by both Phirbo and BLAST. The 382 genes were classified into predefined pVOGs (prokaryotic Virus Orthologous Groups) [41] and 383 RNA families [42]. Briefly, open reading frames (ORFs) in the analyzed 485 phage genomes were 384 identified using Transeq from EMBOSS [43]. The ORFs were then assigned to the respective 385 orthologue group by HMMsearch (e-value < 10-5) against the database of Hidden Markov Models 386 (HMMs) created for every of 9,518 pVOG alignments using HMMbuild of HMMER v3.3.1 [44]. 387 Non-coding RNAs (ncRNAs) were predicted in the phage genomes (e-value < 10-5) using Rfam 388 covariance models v14.3 [42] and the Infernal tool v1.1.3 [45]. We counted the number of times 389 each pVOG and Rfam term was present in phage sequences used by BLAST and Phirbo during 390 host prediction. To determine whether the observed level of pVOG/Rfam counts was significant 391 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ within the context of all the terms within the phage genome, we calculated the p-value using the 392 hypergeometric distribution implemented in Scipy [46]. 393 394 ACKNOWLEDGMENTS 395 We thank Bas Dutilh, Rob Edwards, Clovis Galiez, and Johannes Söding for providing us with the 396 benchmark data sets used in their studies. We likewise acknowledge William Webber for 397 assistance with modifying the RBO formula to account for tied ranks. The computations were 398 performed at the Poznan Supercomputing and Networking Center. 399 400 AUTHOR CONTRIBUTIONS 401 AZ conceived the project and designed the experiments. AZ and JB wrote Phirbo and tested its 402 performance. WMK provided the conceptual framework for sequence comparisons through 403 intermediate sequences and reviewed the software and manuscript. AZ and JB analyzed the results 404 and wrote the paper. All authors read and approved the final manuscript. 405 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ FIGURE LEGENDS 406 407 Figure 1. Calculation of the interaction score between phage and host sequences. a. The 408 BLAST search of phage and prokaryote sequences against a reference dataset result in b. two 409 BLAST lists containing prokaryote matches ordered by decreasing similarity (i.e., bit-score). c. 410 BLAST lists were converted into rankings of prokaryote species. The ranked lists differ in 411 content: Yersinia rohdei and Y. ruckeri are present in the first ranking list but absent in the 412 second list, while Shigella dysenteriae and Erwinia toletana are only present in the second list. 413 Two species, Y. rohdei and Y. ruckeri, from the first BLAST search have the same scores and are 414 consequently tied for the same rank. d. An interaction score was calculated between two ranking 415 lists using rank-biased overlap. 416 417 Figure 2. Discriminatory power of Phirbo, BLAST, and WIsH scores to differentiate 418 between interacting and non-interacting phage-host pairs. Phage-host pairs were obtained 419 from a. Edwards et al. and b. Galiez et al. data sets. Box plots show the distribution of scores for 420 all interacting phage-host pairs (n = 16,757 and n = 26,024 in Edwards et al. and Galiez et al., 421 respectively) and the same number of randomly selected, non-interacting phage-host pairs. The 422 horizontal line in each box displays the median; boxes display the first and third quartiles; 423 whiskers depict lowest and highest non-outlier scores (details of distributions including outliers 424 are provided in Supplementary Table 1). Receiver operating characteristic curves and the 425 corresponding area under the curve (AUC) display the classification accuracy of phage–host 426 predictions across all possible phage-host pairs. Dashed lines represent the levels of 427 discrimination expected by chance. 428 429 Figure 3. Host prediction performance of Phirbo, BLAST, and WIsH. The performance is 430 provided by Precision-Recall (PR) curves and statistical measures (i.e., F1 score, precision, 431 recall, specificity, and accuracy) separately for a. Edwards et al. and b. Galiez et al. data sets. 432 Dashed lines in the PR-curve plots represent the levels of discrimination expected by chance. 433 Score cut-offs for each tool were set to ensure the highest F1 score. 434 435 Figure 4. Host prediction accuracy over phage contig length. Prediction accuracy is provided 436 separately for a. Edwards et al. and b. Galiez et al. data sets. Each complete virus genome was 437 randomly subsampled 10 times for different sequence lengths (i.e., 20 kb, 10 kb, 5 kb, 3 kb, and 438 1 kb). Hosts were predicted on each subsampling replicate by selecting a prokaryotic sequence 439 with the highest similarity to the query viral sequence. Points indicate the average of the 440 resulting accuracies for all the viruses at a given subsampling length and host taxonomic level 441 (i.e., species, genus, and family). An extended version of this figure containing host prediction 442 accuracy values is provided in Supplementary Table 4. 443 444 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ Figure 5. Functional classification of phage coding sequences used by Phirbo for host 445 prediction. Protein families (pVOGs) were classified into 15 functions related to phage-cycle 446 (e.g., DNA replication, transcription). Numbers in the dark circles indicate the number of 447 different pVOGs related to a given function. An extended version of this figure containing the 448 list of pVOGs is provided in Supplementary Table 7. 449 450 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ TABLES 451 452 Table 1. Host prediction accuracies (%) for phage and host genomes from the data sets by 453 Edwards et al. [14] and Galiez et al. [16]. 454 Dataset Method Species Genus Family Order Class Phylum Edwards et al. (2016) WIsH 28 44 50 53 62 70 BLAST 43 59 71 78 87 96 Phirbo* 43 59 71 78 87 95 Phirbo (+phages)† 48 63 75 82 90 97 Galiez et al. (2017) WIsH 21 44 48 53 68 77 BLAST 31 53 62 68 88 95 Phirbo* 31 53 62 68 88 95 Phirbo (+phages)† 35 56 65 72 90 96 The highest accuracies among the methods for each taxonomic level are in bold. 455 * Interaction scores were calculated using rank-biased overlap (RBO) between BLAST lists containing prokaryotic 456 sequences. Specifically, the BLAST database contained 2,699 sequences of bacterial genomes in the Edwards et al. 457 data set, and 3,780 sequences of bacterial and archaeal genomes in the Galiez et al. data set. 458 † Interaction scores were calculated using RBO between BLAST lists containing both prokaryotic and phage 459 sequences. 460 461 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ SUPPLEMENTARY FIGURES 462 463 Supplementary Figure 1. Host predictions for Cronobacter phage ENT39118 (RefSeq 464 accession: NC_019934) using a. BLAST and b. Phirbo. Querying the Cronobacter phage 465 sequence with a BLAST search against the host database returned the genomic sequence of 466 Escherichia coli (NC_017641) as the best match (bit-score = 14,588), and Cronobacter sakazakii 467 (NC_009778) as the second-best match (bit-score = 14,020). Phirbo predicted Cronobacter 468 sakazakii as the top-score host for the Cronobacter phage due to the highest extent of overlap 469 between the top-ranking BLAST matches of each sequence (NC_019934 and NC_009778) of the 470 same database. For clarity, only the first ten BLAST matches are shown. 471 472 Supplementary Figure 2. Host prediction performance of Phirbo, BLAST and WIsH over 473 phage contig length in terms of a. Area under the curve (AUC) and b. Area under the precision-474 recall curve (AUPR). Bars indicate the AUC or AUPR averaged across 10 replicates at a given 475 subsampling length of phage sequence. 476 477 Supplementary Figure 3. Scatter plot of the phage sequence coverage used in host predictions 478 of Phirbo versus that of BLAST. Each dot represents a phage genome. 479 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ SUPPLEMENTARY TABLES 480 481 Supplementary Table 1. Distribution of Phirbo, BLAST and WIsH scores among interacting 482 and non-interacting phage-host pairs obtained from Edwards et al. and Galiez et al. data sets. 483 Score ranges were summarized separately for 16,757 interacting and non-interacting phage-host 484 pairs from Edwards et al., and 26,024 interacting and non-interacting phage-host pairs from 485 Galiez et al. 486 487 Supplementary Table 2. Number of phage-host pairs evaluated by Phirbo, BLAST, and WIsH 488 in Edwards et al. and Galiez et al. data sets. 489 490 Supplementary Table 3. Phages assigned by BLAST to multiple, equally-scored host species. 491 Phirbo differentiated between host species and provided the highest score to primary host 492 species. 493 494 Supplementary Table 4. Host prediction accuracy of Phirbo, BLAST, and WIsH over phage 495 contig length. 496 497 Supplementary Table 5. Phage sequence coverage of 485 phages correctly assigned by BLAST 498 and Phirbo to their host species. Sequence coverage was calculated for each phage as the sum of 499 the lengths of its non-overlapping high scoring pairs to the genome of the correct host species, 500 divided by the size of the query-phage genome. Prophages were assumed to have sequence 501 coverage greater than or equal to 50%. 502 503 Supplementary Table 6. Summary of taxonomic affiliations of 271 phages that had sequence 504 coverage < 50% with the host species genomes. 505 506 Supplementary Table 7. Protein families present in sequence regions of 271 phage genomes 507 that were used by BLAST and/or Phirbo in host prediction. The table provides information on 508 each protein family (prokaryotic Virus Orthologous Group (pVOG)) used by BLAST and 509 Phirbo, including: (i) pVOG description and functional assignment (manually curated), (ii) 510 pVOG count (number of times a given pVOG was present in the phage genome, as well as in 511 sequences used by BLAST or Phirbo), (iii) pVOG percentage (pVOG count divided by pVOG 512 count in the genome), and (iii) P-value of pVOG enrichment. 513 514 Supplementary Table 8. RNA families present in sequence regions of 271 phage genomes that 515 were used by BLAST and Phirbo in host prediction. The table provides information on each 516 Rfam family used by BLAST and Phirbo. 517 518 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ Supplementary Table 9. Comparison of Phirbo’s host prediction performance between BLAST-519 based and Mash-based rankings of prokaryotic species. 520 521 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ REFERENCES 522 523 1. Suttle CA. Marine viruses--major players in the global ecosystem. Nat Rev Microbiol. 524 2007;5: 801–812. 525 2. Breitbart M, Bonnain C, Malki K, Sawaya NA. Phage puppet masters of the marine 526 microbial realm. Nat Microbiol. 2018;3: 754–766. 527 3. Roux S, Brum JR, Dutilh BE, Sunagawa S, Duhaime MB, Loy A, et al. Ecogenomics and 528 potential biogeochemical impacts of globally abundant ocean viruses. Nature. 2016;537: 529 689–693. 530 4. Norman JM, Handley SA, Baldridge MT, Droit L, Liu CY, Keller BC, et al. Disease-531 specific alterations in the enteric virome in inflammatory bowel disease. Cell. 2015;160: 532 447–460. 533 5. Manrique P, Bolduc B, Walk ST, van der Oost J, de Vos WM, Young MJ. Healthy human 534 gut phageome. Proc Natl Acad Sci U S A. 2016;113: 10400–10405. 535 6. Meyer JR. Sticky bacteriophage protect animal cells. Proceedings of the National Academy 536 of Sciences of the United States of America. Proceedings of the National Academy of 537 Sciences; 2013. pp. 10475–10476. 538 7. Reardon S. Phage therapy gets revitalized. Nature. 2014;510: 15–16. 539 8. Salmond GPC, Fineran PC. A century of the phage: past, present and future. Nat Rev 540 Microbiol. 2015;13: 777–786. 541 9. Svoboda E. Bacteria-eating viruses could provide a route to stability in cystic fibrosis. 542 Nature. 2020;583: S8–S9. 543 10. Dedrick RM, Guerrero-Bustamante CA, Garlena RA, Russell DA, Ford K, Harris K, et al. 544 Engineered bacteriophages for treatment of a patient with a disseminated drug-resistant 545 Mycobacterium abscessus. Nat Med. 2019;25: 730–733. 546 11. Samson JE, Moineau S. Bacteriophages in food fermentations: new frontiers in a continuous 547 arms race. Annu Rev Food Sci Technol. 2013;4: 347–368. 548 12. Sulakvelidze A. Using lytic bacteriophages to eliminate or significantly reduce 549 contamination of food by foodborne bacterial pathogens. J Sci Food Agric. 2013;93: 3137–550 3146. 551 13. Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, 552 Mikhailova N, et al. Uncovering earth’s virome. Nature. 2016;536: 425–430. 553 14. Edwards RA, McNair K, Faust K, Raes J, Dutilh BE. Computational approaches to predict 554 bacteriophage–host relationships. FEMS Microbiol Rev. 2016;40: 258–272. 555 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ 15. Ahlgren NA, Ren J, Lu YY, Fuhrman JA, Sun F. Alignment-free d_2^* oligonucleotide 556 frequency dissimilarity measure improves prediction of hosts from metagenomically-557 derived viral sequences. Nucleic Acids Res. 2017;45: 39–53. 558 16. Galiez C, Siebert M, Enault F, Vincent J, Söding J. WIsH: who is the host? Predicting 559 prokaryotic hosts from metagenomic phage contigs. Bioinformatics. 2017;33: 3113–3114. 560 17. Andersson AF, Banfield JF. Virus population dynamics and acquired virus resistance in 561 natural microbial communities. Science. 2008;320: 1047–1050. 562 18. Wang W, Ren J, Tang K, Dart E, Ignacio-Espinoza JC, Fuhrman JA, et al. A network-based 563 integrated framework for predicting virus-prokaryote interactions. NAR Genom Bioinform. 564 2020;2: lqaa044. 565 19. Zhang M, Yang L, Ren J, Ahlgren NA, Fuhrman JA, Sun F. Prediction of virus-host 566 infectious association by supervised learning methods. BMC Bioinformatics. 2017;18: 60. 567 20. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST 568 and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 569 1997;25: 3389–3402. 570 21. Lima-Mendez G, Faust K, Henry N, Decelle J, Colin S, Carcillo F, et al. Ocean plankton. 571 Determinants of community structure in the global plankton interactome. Science. 572 2015;348: 1262073. 573 22. Flores CO, Meyer JR, Valverde S, Farr L, Weitz JS. Statistical structure of host-phage 574 interactions. Proc Natl Acad Sci U S A. 2011;108: E288-97. 575 23. Webber W, Moffat A, Zobel J. A similarity measure for indefinite rankings. ACM Trans Inf 576 Syst. 2010;28: 1–38. 577 24. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot 578 when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10: e0118432. 579 25. Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. 580 Proceedings of the 23rd international conference on Machine learning - ICML ’06. New 581 York, New York, USA: ACM Press; 2006. doi:10.1145/1143844.1143874 582 26. Villarroel J, Kleinheinz KA, Jurtz VI, Zschach H, Lund O, Nielsen M, et al. HostPhinder: A 583 phage host prediction tool. Viruses. 2016;8. doi:10.3390/v8050116 584 27. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast 585 genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17. 586 doi:10.1186/s13059-016-0997-x 587 28. Gao NL, Zhang C, Zhang Z, Hu S, Lercher MJ, Zhao X-M, et al. MVP: a microbe–phage 588 interaction database. Nucleic Acids Res. 2018;46: D700–D707. 589 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ 29. Paez-Espino D, Roux S, Chen I-MA, Palaniappan K, Ratner A, Chu K, et al. IMG/VR 590 v.2.0: an integrated data management and analysis system for cultivated and environmental 591 viral genomes. Nucleic Acids Res. 2019;47: D678–D686. 592 30. Roux S, Hallam SJ, Woyke T, Sullivan MB. Viral dark matter and virus-host interactions 593 resolved from publicly available microbial genomes. Elife. 2015;4. 594 doi:10.7554/eLife.08490 595 31. Lawrence JG, Ochman H. Amelioration of bacterial genomes: rates of change and 596 exchange. J Mol Evol. 1997;44: 383–397. 597 32. Pride DT, Wassenaar TM, Ghose C, Blaser MJ. Evidence of host-virus co-evolution in 598 tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics. 599 2006;7: 8. 600 33. Carbone A. Codon bias is a major factor explaining phage evolution in translationally 601 biased hosts. J Mol Evol. 2008;66: 210–223. 602 34. Sharp PM, Rogers MS, McConnell DJ. Selection pressures on codon usage in the complete 603 genome of bacteriophage T7. J Mol Evol. 1984;21: 150–160. 604 35. Morgado S, Vicente AC. Global in-silico scenario of tRNA genes and their organization in 605 virus genomes. Viruses. 2019;11: 180. 606 36. Sousa JAM de, Pfeifer E, Touchon M, Rocha EPC. Genome diversification via genetic 607 exchanges between temperate and virulent bacteriophages. bioRxiv. bioRxiv; 2020. 608 doi:10.1101/2020.04.14.041137 609 37. Shapiro JW, Putonti C. Gene co-occurrence networks reflect bacteriophage ecology and 610 evolution. MBio. 2018;9. doi:10.1128/mbio.01870-17 611 38. Hernandes Coutinho F, Zaragosa-Solas A, López-Pérez M, Barylski J, Zielezinski A, Dutilh 612 BE, et al. RaFAH: A superior method for virus-host prediction. bioRxiv. bioRxiv; 2020. 613 doi:10.1101/2020.09.25.313155 614 39. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: 615 architecture and applications. BMC Bioinformatics. 2009;10: 421. 616 40. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: 617 Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830. 618 41. Grazziotin AL, Koonin EV, Kristensen DM. Prokaryotic Virus Orthologous Groups 619 (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic 620 Acids Res. 2017;45: D491–D498. 621 42. Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K, Marz M, et al. 622 Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids 623 Res. 2020. doi:10.1093/nar/gkaa1047 624 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ 43. Rice P, Longden I, Bleasby A. EMBOSS: The European molecular biology open software 625 suite. Trends Genet. 2000;16: 276–277. 626 44. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity 627 searching. Nucleic Acids Res. 2011;39: W29-37. 628 45. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. 629 Bioinformatics. 2013;29: 2933–2935. 630 46. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 631 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17: 632 261–272. 633 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ BLAST Reference prokarote DNA database (D) match score E. coli K12 E. coli O157:H7 S. flexneri 2a S. boydii E. coli K12 E. coli O157:H7 E. coli M34 S. flexneri 2a S. boydii E. toletana S. dysenteriae Y. rohdei S. flexneri 2bRank species Compare rankings 2 540 2 210 1 530 1 290 810 948 390 902 110 836 880 434 420 407 230 385 970 328 660 183 230 match match rank E. coli S. boydii Y. rohdei, Y. ruckeri S. flexneri 1 2 3 4 S. flexneri E. coli S. dysenteriae E. toletana S. boydii 1 2 3 4 5 match rank AGTCGTGTACTGCGCGCCGCGCGCCAGGAC GGTTCGGCCAACGACTGGGTCCTTATCGAT CCAACGACGACGGCTCCAACGACGTTAGGC ACGTTACCGTTTAGGCGCGATGCGATGCGT Phage DNA sequence (P) a b c d score Host DNA sequence (H) Rank-Biased Overlap (RBO) = 0.76 Y. ruckeri 810 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ a S im ila ri ty s c o re interaction non-interaction 0 0.2 0.4 0.6 0.8 S im ila ri ty s c o re interaction non-interaction 0 1000 2000 3000 S im ila ri ty s c o re interaction non-interaction -1.5 -1.45 -1.4 -1.35 -1.3 -1.25 Phirbo BLAST WIsH 1 S im ila ri ty s c o re interaction non-interaction 0 0.2 0.4 0.6 0.8 S im ila ri ty s c o re interaction non-interaction -1.5 -1.45 -1.4 -1.35 -1.3 Phirbo WIsH 1 b 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 T ru e P o s it iv e R a te False Positive Rate AUC = 0.95 AUC = 0.86 AUC = 0.79 0 0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 1 T ru e P o s it iv e R a te False Positive Rate 1 WIsHBLASTPhirbo AUC = 0.95 AUC = 0.86 AUC = 0.78 -1.25 4000 5000 S im ila ri ty s c o re interaction non-interaction 0 1000 2000 3000 BLAST 4000 5000 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ a AUPR = 0.60 AUPR = 0.38 AUPR = 0.03 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Recall WIsHBLASTPhirbo P re ci si o n b AUPR = 0.47 AUPR = 0.28 AUPR = 0.02 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Recall P re ci si o n F1 score Recall Precision Specificity Accuracy 0.646 0.641 0.651 0.997 0.995 0.434 0.362 0.542 0.995 0.993 0.084 0.225 0.052 0.969 0.963 Phirbo BLAST WIsH F1 score Recall Precision Specificity Accuracy 0.568 0.550 0.589 0.998 0.996 0.348 0.279 0.462 0.998 0.995 0.045 0.210 0.025 0.961 0.957 WIsHBLASTPhirbo Score cut-off 0.40 731 -1.34 Score cut-off 0.40 919 -1.34 Phirbo BLAST WIsH .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ a 3 5 10 201 0 20 40 60 80 Species P re d ic ti o n a c c u ra c y ( % ) Sequence length (kb) Genus Family b Phirbo (+phages) BLAST / Phirbo WIsH 3 5 10 201 0% 20% 40% 60% 80% Sequence length (kb) 3 5 10 201 0% 20% 40% 60% 80% Sequence length (kb) 3 5 10 201 0 20 40 60 80 Species P re d ic ti o n a c c u ra c y ( % ) Sequence length (kb) Genus Family 3 5 10 201 0% 20% 40% 60% 80% Sequence length (kb) 3 5 10 201 0% 20% 40% 60% 80% Sequence length (kb) .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ Capsid head Collar Tail Baseplate Fiber Spike Amino acid metabolism po l DNA replication Genome packaging Transcription Cell lysis Host defence systems Energy metabolism Nucleotide metabolism Bacterial chromosome Integration / recombination Other functions A T G C T 10 15 10 7 9 8 6 7 Antibiotic resistance 8 8 5 1 2 5 11 Full phage assembly .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_01_05_425508 ---- Human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of SARS-CoV-2 genomes 1 Human cell-dependent, directional, time-dependent changes in the mono- and 1 oligonucleotide compositions of SARS-CoV-2 genomes 2 3 Yuki Iwasaki1, Takashi Abe2, Toshimichi Ikemura1 4 1. Department of Bioscience, Nagahama Institute of Bio-Science and Technology. 5 Shiga, Japan 6 2. Graduate School of Science and Technology, Niigata University, Niigata, Japan 7 8 Abstract 9 Background 10 When a virus that has grown in a nonhuman host starts an epidemic in the human 11 population, human cells may not provide growth conditions ideal for the virus. 12 Therefore, the invasion of severe acute respiratory syndrome coronavirus-2 (SARS-13 CoV-2), which is usually prevalent in the bat population, into the human population is 14 thought to have necessitated changes in the viral genome for efficient growth in the new 15 environment. In the present study, to understand host-dependent changes in coronavirus 16 genomes, we focused on the mono- and oligonucleotide compositions of SARS-CoV-2 17 genomes and investigated how these compositions changed time-dependently in the 18 human cellular environment. We also compared the oligonucleotide compositions of 19 SARS-CoV-2 and other coronaviruses prevalent in humans or bats to investigate the 20 causes of changes in the host environment. 21 Results 22 Time-series analyses of changes in the nucleotide compositions of SARS-CoV-2 23 genomes revealed a group of mono- and oligonucleotides whose compositions changed 24 in a common direction for all clades, even though viruses belonging to different clades 25 should evolve independently. Interestingly, the compositions of these oligonucleotides 26 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 2 changed towards those of coronaviruses that have been prevalent in humans for a long 27 period and away from those of bat coronaviruses. 28 Conclusions 29 Clade-independent, time-dependent changes are thought to have biological significance 30 and should relate to viral adaptation to a new host environment, providing important 31 clues for understanding viral host adaptation mechanisms. 32 33 Keyword 34 “COVID-19”, “SARS-CoV-2”, “Oligonucleotide composition”, “Time-series analysis”, 35 “Big data”, “Zoonotic virus”, “RNA virus”, “Viral adaptation”, “Coronavirus” 36 37 Background 38 Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), an RNA virus 39 belonging to the betacoronavirus genus, began to spread in the human population in 40 2019. This viral strain is believed to have been originally prevalent in bats and 41 transferred to the human population through intermediate hosts [1]. Viral growth 42 requires a wide variety of host factors (nucleotide pools, proteins, RNA, etc.) and 43 should evade the diverse antiviral mechanisms of host cells (antibodies, killer T cells, 44 interferon, RNA interference, etc.) [2-4]. Since ancestral SARS-CoV-2 strains are 45 thought to be endemic in bats, they should be well adapted to their host environment; 46 when the virus invades the human population, human cells may not provide growth 47 conditions ideal for the virus. For efficient growth and rapid spread of the infection, 48 changes in the viral genome should be required. Analyses of time-dependent changes in 49 SARS-CoV-2 in the human population can be used to characterize how and why viral 50 genomes change to adapt to a new host environment. 51 Due to the great threat of COVID-19 and remarkable development of 52 sequencing technology, a massive number of SARS-CoV-2 genome sequences are 53 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 3 available in databases, even though the epidemic has lasted for approximately 10 54 months. These sequence data have provided a wide range of insights into SARS-CoV-2 55 [5,6]. Phylogenetic methods based on sequence alignment have been widely used in 56 molecular evolution studies [7,8], and these methods are well refined and essential for 57 studying phylogenetic relationships between different viral species and variations in the 58 same viral species at the single-nucleotide level. However, when dealing with a massive 59 number of genome sequences, methods based on sequence alignment become 60 problematic because they require a large amount of computational resources. 61 We have continued to develop sequence alignment-free methods focused on 62 the oligonucleotide compositions of genome sequences [9-12]. Notably, oligonucleotide 63 composition varies widely among species, including viruses, and is designated as 64 genome signatures [13]. These compositions can be treated as numerical data, and a 65 massive amount of sequence data can easily be subjected to various statistical analyses. 66 Furthermore, even genomic fragments without orthologous and/or paralogous pairs can 67 be compared [11,12,14-17]. Specifically, our previous work on influenza A-type virus 68 genomes found that the oligonucleotide compositions of the viral genomes differed 69 between hosts (e.g., humans and birds), even for viruses within the same subtype (e.g., 70 H1N1 and H3N2 of type A) [11,12,14]; we also examined changes in the 71 oligonucleotide compositions of influenza H1N1/09, which have been epidemic in 72 humans beginning in 2009, and found that their compositions changed to approach 73 those of the seasonal flu strains H1N1 and H3N2 [11]. Furthermore, although epidemics 74 of the H1N1 and H3N2 strains began several decades apart, these strains showed highly 75 similar chronological changes from the start of these epidemics. These evolutionary yet 76 reproducible changes suggest that mutations to adapt to a new host environment 77 inevitably accumulate when the host species of a virus changes, and these changes can 78 be efficiently detected by analyzing oligonucleotide compositions. 79 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 4 Several groups, including ours, have examined changes in SARS-CoV-2 80 genomes during the early stages of the SARS-CoV-2 epidemic and found clear 81 directional changes in a group of mono- and oligonucleotides detectable on even a 82 monthly basis [15,18,19]. These directional changes will allow us to predict changes in 83 the near future. Notably, near-future prediction and verification should be the most 84 direct ways to test the reliability of the obtained results, models and ideas (e.g., those 85 discovered for influenza viruses), providing a new paradigm for molecular evolutionary 86 studies. In this context, the present study analyzed the genome sequences of over 87 seventy thousand SARS-CoV-2 strains isolated from December 2019 to September 88 2020. 89 90 91 Results 92 Directional changes in the mononucleotide compositions (%) of SARS-CoV-2 93 For fast-evolving RNA viruses, diversity within the viral population arises rapidly as 94 the epidemic progresses and subpopulation structure forms; the GISAID consortium has 95 defined at least seven main clades (G, GH, GR, L, V, S and Others). Notably, the 96 elementary processes of molecular evolution are based on random mutations, and 97 strains belonging to different clades are thought to have evolved independently. 98 Therefore, the observation of highly similar time-dependent changes independent of 99 clade has certain biological meanings and may be inevitable for efficient growth in 100 human cells. From this perspective, we first examined time-dependent changes in the 101 mononucleotide compositions (%) of SARS-CoV-2 strains isolated from December 102 2019 to September 2020. 103 Among the seven clades (G, GH, GR, L, V, S and Others) reported by the 104 GISAID consortium, we used six clades (G, GH, GR, L, V and S), excluding Others, in 105 the analysis. For the time-series analysis, we calculated the average mononucleotide 106 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 5 compositions (%) of the genomes in each clade collected monthly; in Fig. 1A, the 107 mononucleotide composition of each clade is shown as a colored line, while that for the 108 monthly collected genomes belonging to all clades is shown as a dashed line. 109 Regardless of clade, the composition of C decreased, while that of U increased 110 in a time-dependent manner, but the changes in A and G composition were less clear 111 (Fig. 1A). Correlation coefficients between the mononucleotide composition and month 112 from the start of the epidemic showed a high negative correlation for C and a high 113 positive correlation for U for all clades, but there was no clear directionality for A and 114 G (Fig. 1A and Tables 1, 2). These results indicate that the mononucleotide composition 115 of this virus may be prone to biased mutations that reduce C and increase U or the 116 mutated strains tend to be more favorable for growth in human cells. 117 118 Directional changes in short oligonucleotide compositions 119 Oligonucleotides are known to act as functional motifs, such as binding sites for a wide 120 variety of proteins and target sites for RNA modifications. Therefore, directional 121 changes in some oligonucleotides independent of clade may relate to certain processes 122 for adaptation to the new host environment. Our previous work on influenza A viruses 123 found that their oligonucleotide compositions varied among prevalent hosts [11,12]; 124 notably, although influenza virus isolated from humans tended to prefer A and U (but 125 not G and C) more than viruses isolated from birds, the human viruses showed a 126 preference for GGCG and GGGG, which are G- or C-rich. Importantly, there are 127 various examples of oligonucleotides whose changes in composition cannot be 128 explained by changes in mononucleotide composition alone, and these changes may 129 relate to the molecular mechanisms of viral adaptation to a new host. 130 From this perspective, we next analyzed time-dependent changes in di- and 131 trinucleotide compositions and found that a group of di- and trinucleotides showed a 132 highly positive or negative correlation (Figs. 1B, S1 and Tables 1, 2). Interestingly, a 133 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 6 group of A- or G-rich oligonucleotides, such as GAG and GGA, showed a high positive 134 correlation independent of clade, which was not expected from the changes in 135 mononucleotide compositions alone. To confirm the extent of these changes, we also 136 calculated the fold change in composition for the first isolated month and the last 137 examined month (Fig. 2) and found clear increases and decreases in mono- and 138 oligonucleotide compositions common among the six clades, which supports the result 139 presented in Fig. 1 and Tables 1 and 2. 140 141 Changes towards the sequences of other coronaviruses prevalent in humans 142 In a previous study of SARS-CoV-2 [16], we analyzed mono- and dinucleotide 143 compositions for the first four epidemic months without separating the sequences by 144 clade. Notably, the directional changes shown in Figs. 1 and 2 and Tables 1 and 2 were 145 absolutely consistent with the previous results, even when the six clades were separately 146 analyzed. In the previous study, time-series analysis of ebolavirus at the beginning of 147 the epidemic in West Africa in 2014 also showed directional changes in a group of 148 mono- and dinucleotide compositions, but these directional increases/decreases tended 149 to slow approximately 10 months after the start of the epidemic. The increase/decrease 150 trend for SARS-CoV-2 is far from slowing after 10 months, and the next important 151 questions are how long these directional changes in this virus will last and whether there 152 are possible goals to these changes. 153 To conduct this near-future prediction, the following information concerning 154 influenza viruses should be useful. As mentioned before, mono- and oligonucleotide 155 compositions in influenza H1N1/09 changed towards those of seasonal influenza strains 156 such as the H1N1 and H3N2 subtypes [11]. Furthermore, all the human subtypes 157 showed directional changes away from the compositions of all avian influenza A 158 subtypes and closer to those of the human influenza B type, which has been prevalent 159 only in humans [14]. If we assume that changes similar to those in the influenza virus 160 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 7 will occur, the mono- and oligonucleotide compositions of interest for SARS-CoV-2 are 161 expected to change towards those of other coronaviruses that have been prevalent in 162 humans and away from those of coronaviruses prevalent in bats. To test this hypothesis, 163 we analyzed the following coronaviruses: 238 human-CoV strains (alphacoronaviruses 164 229E and NL63: betacoronaviruses HKU1 and OC43) and 166 bat-CoV strains 165 (alphacoronaviruses and betacoronaviruses, including the SARS virus). 166 As shown in Fig. 3A, we compared the mononucleotide compositions of 167 SARS-CoV-2 with those of the human- and bat-CoV strains; the data for bat SARS 168 among bat-CoV strains, which is thought to be the original strain that caused the current 169 COVID-19 pandemic, are marked in pink. Interestingly, concerning the human- and 170 bat-CoV strains, differences in mononucleotide composition were more pronounced 171 between hosts than between the alpha and beta linages, and the levels for all six clades 172 of SARS-CoV-2 were between those for the two hosts. Fig. 3B shows the results of di- 173 and trinucleotides, for which the directional, time-dependent changes were primarily 174 common among the six clades. The increases and decreases in nucleotide composition 175 observed for SARS-CoV-2 in Figs. 1 and 2 are indicated by hollow up and down 176 arrows, respectively. Interestingly, all changes of interest tended to move away from the 177 compositions of bat SARS and approach those of human-CoV, supporting the view that 178 the directional changes of interest have biological significance and are possibly 179 inevitable, as observed for influenza viruses. Assuming that approaching the levels in 180 human-CoV strains is the hypothetical goal of the directional change of SARS-CoV-2, 181 the current compositions are far from this hypothetical goal (Fig. 3); therefore, we 182 predict that directional changes of interest will continue in the near future. 183 Then, assuming that the average value for all human-CoV strains is a 184 hypothetical goal, we investigated how SARS-CoV-2 has approached this possible goal. 185 Specifically, we calculated the square of the difference between the composition of each 186 nucleotide in SARS-CoV-2 and the average value for human-CoV strains and plotted 187 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 8 the values of the difference according to the elapsed month for each nucleotide. 188 Changes in the compositions of both C and U clearly reduced this difference, as the 189 compositions of these nucleotides approached the hypothetical goal (Fig. 4A); their 190 linear reduction supports the prediction that directional changes in the composition of C 191 and U will continue for the foreseeable future. In contrast, A and G did not show 192 directional changes in composition, which is most likely due to the absence of clear 193 differences in the A and G compositions of human- and bat-CoV, i.e., there is no 194 possible target (Fig. 3A). Fig. 4B shows examples of di- and trinucleotides whose 195 compositions have moved towards the hypothetical goal, but Fig. 4C shows a few 196 exceptional nucleotides whose compositions have not changed towards the hypothetical 197 goal but have changed with a common directionality among the six clades. In Fig. 4D, 198 correlation coefficients between the above difference and the elapsed month are 199 presented. Most nucleotides of interest showed a negative coefficient (i.e., a directional 200 change towards human-CoV), but three oligonucleotides, GG, AGC and CAU, showed 201 positive coefficients indicating an increase in the difference (i.e., moving away from the 202 human-CoV level). For these opposing directional changes, certain causes specific to 203 SARS-CoV-2 may be assumed. 204 205 Motifs for RNA-binding proteins 206 Next, we considered the mechanisms that move oligonucleotide compositions away 207 from those of bat coronaviruses and closer to those of human coronaviruses. Certain 208 human cellular factors involved in viral growth may be candidates in such mechanisms. 209 When considering possible protein factors, oligonucleotides longer than trinucleotides 210 should be a focus. As an attempt, we here focused on host RNA-binding proteins 211 because their binding to hepatitis C virus is known to be involved in the growth of this 212 RNA virus [20]. We thus searched for motifs for human RNA-binding proteins in 213 coronavirus genomes (see Methods section) and found multiple loci with binding motifs 214 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 9 for each protein. Table 3 (and Table S10) lists the motifs for which a directional time-215 dependent change was primarily common among six clades. Table 3 and Fig. 5A show 216 that only ELAVL1 showed a positive correlation, but the other nine proteins in Table 3 217 showed a negative correlation for almost all clades; the results for other motifs are 218 presented in Table S12. 219 We next compared the numbers of these motifs in SARS-CoV-2 with the 220 numbers of human- and bat-CoV motifs (Fig. 5B). Of the ten proteins shown in Table 3, 221 the only elevated motif, that for ELAVL1 binding, was found in a significantly higher 222 number of loci in human-CoV than in bat-CoV, but motifs for PCBP2 and SRSF1 223 binding, which tended to decrease (Table 3), were found in significantly fewer loci in 224 human-CoV. These observations appear to be consistent with the features found in the 225 mono-, di- and trinucleotide compositions of interest. However, unlike these changes, 226 there was significant diversity within even a single clade, which appears to be greater 227 than the differences between hosts, with the possible exception of ELAVL1. In regard 228 to long oligonucleotides, they should carry out a variety of functions, and mutations that 229 accumulate in their functional motifs may have complex effects on the presence of 230 functional motif sequences, so an analysis from a new perspective appears to become 231 important. 232 233 Discussion 234 We first discuss possible molecular mechanisms related to time-dependent directional 235 changes in mononucleotide composition. Fig. 1A shows that the frequency of C tended 236 to decrease in SARS-CoV-2, while that of U tended to increase. Since a similar change 237 was previously found for MERS and all A-type influenza subtypes [12,14], these 238 changes may have biological significance for a wide range of RNA viruses that invade 239 from nonhuman hosts. One possible mechanism is the host RNA-editing function; 240 Simmonds (2020) proposed that the C→U hypermutation in SARS-CoV-2 may be due 241 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 10 to the influence of APOBEC family proteins in humans [19]. APOBEC is an antiviral 242 protein in various animal species, including humans, that can convert C to U by the 243 deacetylation of C [21-23]. Such RNA editing is also known to act as a defense 244 mechanism against various viruses, including retroviruses [24]. The APOBEC gene 245 family has generated various paralogs during mammalian evolution, with seven known 246 APOBEC genes in humans and ten in bat families [25-27]. The prevalence C→U 247 change in SARS-CoV-2 upon transfer of its host environment from bats to humans 248 suggests that these changes may be due to human-specific APOBEC genes. 249 We next discuss changes in short oligonucleotides. Directional changes in 250 some oligonucleotides, such as GAG and GGA, cannot be explained by APOBEC-251 induced C→U mutations alone. Although the evidence is weak, these oligonucleotides 252 are part of the binding motifs of several RNA-binding proteins, such as SRSF1 and 253 PCBP2 (Table S9); the number of loci for these motifs has decreased independently of 254 clade. In contrast, the number of motif loci for only ELAVL1 among the ten proteins 255 listed in Table 3 has increased independently of clade. As an RNA-binding protein that 256 binds A- or U-rich elements, ELAVL1 binding to mRNA is known to contribute to 257 RNA stability [28, 29]; SARS-CoV-2 and human-CoV, which are prevalent in humans, 258 may contain increased binding motifs for ELAVL1 for efficient growth in the human 259 cellular environment. However, for further analysis, information on RNA-binding 260 proteins in bat cells is needed. 261 262 Conclusions 263 In the present study, we found that the compositions of a group of mono- and 264 oligonucleotide in SARS-CoV-2 genomes have changed in a host cell-dependent 265 manner. This is totally consistent to our previous finding for influenza A and B viruses 266 [11,12,14], supporting the previous prediction that the host-dependent directional 267 changes of various mono- and oligonucleotides should inevitably occur in zoonotic 268 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 11 RNA viruses that have invaded from nonhuman hosts. Phylogenetic methods based on 269 sequence alignment [7,8] are well refined and undoubtedly essential for studying the 270 phylogenetic relationships between viruses. The present alignment-free method to 271 analyze mono- and oligonucleotide compositions can also serve as a powerful tool for 272 molecular evolutionary studies of viruses, revealing directional changes in viruses and 273 predicting the possible goals of these changes. 274 275 276 Methods 277 SARS-CoV-2 genome sequences 278 Human SARS-CoV-2 genome sequences were downloaded from the GISAID database 279 (https://www.gisaid.org/); sequences that were complete, showed high coverage and had 280 been isolated from humans were downloaded on Sep 17, 2020. Among the acquired 281 sequences, strains with an unknown isolation month were excluded from the analysis, 282 and the polyA tail was removed. A list of all 72,314 strains used is provided in Table 283 S1. 284 285 Genome sequences of coronaviruses prevalent in humans or bats 286 The complete sequences of two types of human coronavirus (human-CoV) strains, 287 alphacoronaviruses (27 229E and 55 NL63 strains) and betacoronaviruses (18 HKU1 288 and 138 OC43 strains), were obtained from the NCBI virus database 289 (https://www.ncbi.nlm.nih.gov/labs/virus/). The complete genome sequences of two 290 types of bat coronavirus (bat-CoV) strains, alphacoronaviruses (87 strains) and 291 betacoronaviruses (79 strains, including 34 SARS-CoV), isolated from three types of 292 bats (Chiroptera, Vespertilionidae and Rhinolophidae) were obtained from the NCBI 293 virus database (https://www.ncbi.nlm.nih.gov/labs/virus/), and the polyA tail of each 294 sequence was removed. The strains are listed in Table S2. 295 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 12 296 Time-series analysis of changes in oligonucleotide compositions 297 In the time-series analysis, the average mono- and oligonucleotide compositions (%) of 298 viruses collected in each month were calculated for each clade. To avoid statistical 299 fluctuations due to the small sample size, months in which fewer than 10 strains had 300 been collected were excluded from the monthly analysis. 301 302 RNA-binding motif analysis 303 RNA-binding motifs were obtained from the ATtRACT database [30]. In this database, 304 multiple binding motifs are registered as corresponding to one RNA-binding protein; 305 we calculated the total number of loci containing the binding motifs for each protein in 306 the viral genomes. 307 308 309 310 311 List of abbreviations 312 SARS-CoV-2: Severe acute respiratory syndrome coronavirus-2 313 human-CoV: human coronavirus 314 bat-CoV: bat coronavirus 315 316 Ethics approval and consent to participate 317 Not applicable 318 319 Consent for publication 320 Not applicable 321 322 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 13 Availability of data and materials 323 The sequence dataset analyzed in this study are stored in GISAID. Other data are 324 available from YI. 325 326 Competing interests 327 The authors declared that there are no conflicts of interests. 328 329 Funding 330 This work was supported by JSPS KAKENHI Grant Number 18K07151, by AMED 331 under Grant Number JP20he0622033 and by COVID-19 Counterplan Research Project 332 (supervised by Prof. Tatsumi Hirata, NIG) from the Research Organization of 333 Information and Systems (ROIS). 334 335 Authors' contributions 336 YI conceived the approach and conducted this analysis. TA developed the algorithm. TI 337 supervised this study. 338 339 Acknowledgements 340 We gratefully acknowledge the authors submitting their sequences from GISAID’s 341 Database and also the valuable comments of Dr. Yashushi Hiromi of National Institute 342 of Genetics (Mishima). We thank Springer Nature Author Services for editing this 343 manuscript for English language. 344 345 Figure legends 346 Fig. 1. Time-dependent directional changes in nucleotide compositions. (A) 347 Average mononucleotide compositions (%) in the SARS-CoV-2 genomes of each clade 348 isolated in each month are plotted against the elapsed month. To compare the four 349 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 14 mononucleotides, the scale widths on the vertical axis are set to the same values. The 350 colored lines distinguishing the clade (G, GH, GR, L, V and S) are shown at the bottom 351 of the figure. The dashed line shows the averaged compositions for all strains isolated in 352 each month. (B) The average di- and trinucleotide compositions that primarily undergo 353 common directional changes among the six clades are plotted against the elapsed 354 month. 355 356 Fig. 2. Fold changes in nucleotide composition between the epidemic start and the 357 last month of analysis. A bar plot shows the fold change in composition of each mono- 358 or oligonucleotide; this value was calculated by dividing the nucleotide composition in 359 the last month of analysis by that at the start of the epidemic. Each bar is colored to 360 indicate the clade, as described in Fig. 1. Since we analyzed strains belonging to 361 different clades separately, data from the first or last month differed among clades; see 362 also the Methods section. 363 364 Fig. 3. Nucleotide compositions of human and bat coronavirus sequences. A 365 boxplot shows the nucleotide compositions in human-CoV (alpha 229E, alpha NL63, 366 beta HKU1 and beta OC43), bat-CoV (bat SARS, alphacoronavirus and 367 betacoronavirus) and SARS-CoV-2 strains. Bat SARS are marked pink. A hollow arrow 368 indicates the direction of change in oligonucleotide composition observed for SARS-369 CoV-2 in Figs. 1 and 2. (A) Mononucleotides. To compare the four mononucleotides, 370 the scale widths on the vertical axis scale are set to the same values. (B) Di- and 371 trinucleotides. 372 373 Fig. 4. Differences in nucleotide composition between SARS-CoV-2 and human-374 CoV. (A) Values for the square of the difference in mononucleotide composition 375 between SARS-CoV-2 isolated in each month and human-CoV are plotted against the 376 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 15 elapsed month. The data are presented as colored or dashed lines, as described in Fig. 1. 377 (B and C) Oligonucleotide compositions that approach and move from those of human-378 CoV are presented, respectively. (D) The correlation coefficients between the elapsed 379 month from the start of the epidemic and the above differences in mono- and 380 oligonucleotides whose directionality of change is common among six clades are 381 presented. The results for A and G mononucleotides, which show nondirectional 382 change, are also presented. 383 384 Fig. 5. Time-dependent changes in the numbers of RNA-binding motif loci. (A) The 385 numbers of loci containing RNA-binding motifs per genome are plotted against the 386 elapsed month. Here, we selected RNA-binding proteins for which the number of motif 387 loci increased or decreased by at least one for all six clades from the epidemic start. The 388 data are presented as colored or dashed lines, as described in Fig. 1A. (B) A boxplot 389 shows the number of loci containing RNA-binding motifs in human-CoV (alpha 229E 390 and NL63: beta HKU1 and OC43), bat-CoV (bat SARS, alphacoronavirus and 391 betacoronavirus) and SARS-CoV-2 strains. Bat SARS are marked pink. A hallow arrow 392 indicates the direction shown in Fig. 5A with which the oligonucleotide compositions of 393 SARS-CoV-2 changed. 394 395 396 Table 1. Correlation coefficients for time-dependent changes in mono- and 397 oligonucleotide compositions in SARS-CoV-2 that have increased. 398 399 Table 2. Correlation coefficients for time-dependent changes in mono- and 400 oligonucleotide compositions in SARS-CoV-2 that have decreased. 401 402 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 16 Table 3. The number motif-containing loci for RNA-binding proteins whose 403 occurrences have increased or decreased between strains of the first and last month of 404 the analysis. 405 406 407 Additional file 1 408 Fig. S1: Average di- and trinucleotide compositions (A and B) of for SARS-CoV-2 409 strains collected in each elapsed month. 410 Fig. S2: Oligonucleotide compositions of human and bat coronavirus sequences. 411 Fig. S3: Differences in oligonucleotide composition between SARS-CoV-2 and human-412 CoV. 413 414 415 Additional file 2 416 Table S1: List of SARS-CoV-2 strains used in the analysis. 417 418 Table S2: List of human-and bat-CoV strains used in the analysis. 419 420 Table S3: Number of SARS-CoV-2 strains in each clade isolated in each elapsed month. 421 422 Table S4: Average oligonucleotide compositions for SARS-CoV-2 strains in each clade 423 isolated in each elapsed month. 424 425 Table S5: Correlation coefficients for time-dependent changes in oligonucleotide 426 compositions of SARS-CoV-2. 427 428 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 17 Table S6: Fold change in compositions between strains of the first and last month of the 429 analysis. 430 Table S7: Distance between the oligonucleotide composition of SARS-CoV-2 isolated 431 in each elapsed month and that of human-CoV. 432 433 Table S8: Correlation coefficients for time-series changes in the distance between 434 oligonucleotide compositions of SARS-CoV-2 and human-CoV. 435 436 Table S9: List of RNA-binding motifs. 437 438 Table S10: Numbers of motif-containing loci for RNA-binding proteins whose 439 abundance increases or decreases between strains of the first and last month of the 440 analysis. 441 442 Table S11: P-value from t-test to analyze the number of RNA-binding motif loci whose 443 abundance increases or decreases between strains of the first and last month of the 444 analysis. 445 446 Table S12: Correlation coefficients for time-dependent changes in the number of loci 447 containing RNA-binding motifs. 448 449 450 Reference 451 1. Singhal T: A review of coronavirus disease-2019 (COVID-19). Indian J Pediatr. 452 2020; 87:281-86. 453 2. García-Sastre A: Inhibition of interferon-mediated antiviral responses by influenza 454 A viruses and other negative-strand RNA viruses. Virology. 2001;279: 375–84. 455 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 18 3. Voinnet O: Induction and suppression of RNA silencing: insights from viral 456 infections. Nat. Rev. Genet. 2005;6:206–20. 457 4. Randall RE, Goodbourn S: Interferons and viruses: an interplay between induction, 458 signalling, antiviral responses and virus countermeasures. J. Gen. Virol. 2008;89:1–459 47. 460 5. Konno Y, Kimura I, Uriu K, et al: SARS-CoV-2 ORF3b is a potent interferon 461 antagonist whose activity is increased by a naturally occurring elongation variant. 462 Cell Rep. 2020;32:108185. 463 6. Zhou et al: A novel bat coronavirus closely related to SARS-CoV-2 contains natural 464 insertions at the S1/S2 cleavage site of the spike protein. Curr Biol. 2020;30:2196-465 203. 466 7. Nei M: Molecular evolutionary genetics. Columbia University Press: New York. 467 1987. 468 8. Kumar S, Nei M, Dudley J, Tamura K: MEGA: a biologist-centric software for 469 evolutionary analysis of DNA and protein sequences, Brief Bioinform. 2008;9:299–470 306. 471 9. Abe T, Kanaya S, Kinouchi M, et al: Informatics for unveiling hidden genome 472 signatures, Genome Res. 2003;13:693–702. 473 10. Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T: Novel phylogenetic 474 studies of genomic sequence fragments derived from uncultured microbe mixtures 475 in environmental and clinical samples, DNA Res. 2005;12:281–90. 476 11. Iwasaki Y, Abe T, Wada K, Itoh M, Ikemura T,: Prediction of directional changes of 477 influenza A virus genome sequences with emphasis on pandemic H1N1/09 as a 478 model case. DNA res 2011;18:125-36 479 12. Iwasaki Y, Abe T, Wada Y, Wada K, Ikemura T: Novel bioinformatics strategies for 480 prediction of directional sequence changes in influenza virus genomes and for 481 surveillance of potentially hazardous strains. BMC Infect Dis. 2013;13:386- 482 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 19 13. Karlin S, Campbell AM, Mrazek J: Comparative DNA analysis across diverse 483 genomes. Annu. Rev. Genet. 1998;32:185–225. 484 14. Wada Y, Wada K, Iwasaki Y, Kanaya S, Ikemura T: Directional and reoccurring 485 sequence change in zoonotic RNA virus genomes visualized by time-series word 486 count. Sci Rep. 2016;6:36197. 487 15. Wada K, Wada Y, Iwasaki Y, Ikemura T: Time-series oligonucleotide count to 488 assign antiviral siRNAs with long utility fit in the big data era. Gene Ther. 489 2017;24:668–73. 490 16. Wada K, Wada Y, Ikemura T: Time-series analyses of directional sequence changes 491 in SARS-CoV-2 genomes and an efficient search method for candidates for 492 advantageous mutations for growth in human cells. Gene. 2020;5:100038. 493 17. Qiu Y, Abe T, Nakao R, Satoh K, Sugimoto C: Viral population analysis of the 494 taiga tick, Ixodes persulcatus, by using Batch Learning Self-Organizing Maps and 495 BLAST search. Journal of Veterinary Medical Science, 2019;81(3):401-10. 496 18. Mercatelli D, Giorgi FM: Geographic and genomic distribution of SARS-CoV-2 497 mutations. Front Microbiol. 2020;22:11:1800. 498 19. Simmonds P: Rampant C→U hypermutation in the genomes of SARS-CoV-2 and 499 other coronaviruses: causes and consequences for their short- and long-term 500 evolutionary trajectories. mSphere. 2020;24:e00408-20. 501 20. Paek KY, Kim CS, Park SM, Kim JH, Jang SK: RNA-binding protein hnRNP D 502 modulates internal ribosome entry site-dependent translation of hepatitis C virus 503 RNA. J Virol. 2008;82:12082-93. 504 21. Harris RS, Bishop KN, Sheehy AM, Craig HM, Petersen-Mahrt SK, Watt IN, 505 Neuberger MS, Malim MH: DNA deamination mediates innate immunity to 506 retroviral infection. Cell. 2003;113:803–809. 507 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 20 22. Mangeat B, Turelli P, Caron G, Friedli M, Perrin L, Trono D: Broad antiretroviral 508 defence by human APOBEC3G through lethal editing of nascent reverse transcripts. 509 Nature. 2003;424:99–103. 510 23. Zhang H, Yang B, Pomerantz RJ, Zhang C, Arunachalam SC, Gao L: The cytidine 511 deaminase CEM15 induces hypermutation in newly synthesized HIV-1 DNA. 512 Nature. 2003. 424:94–98. https://doi.org/10 .1038/nature01707. 513 24. Harris RS, Dudley JP: APOBECs and virus restriction. Virology. 2015;479–514 480:131–45. 515 25. Sawyer SL, Emerman M, Malik HS: Ancient adaptive evolution of the primate 516 antiviral DNA-editing enzyme APOBEC3G. PLoS Biol. 2004;2:E275. 517 26. Münk C, Willemsen A, Bravo IG: An ancient history of gene duplications, fusions 518 and losses in the evolution of APOBEC3 mutators in mammals. BMC Evol Biol. 519 2012;12:71. 520 27. Henry M, Terzian C, Peeters M, Wain-Hobson S, Vartanian JP: Evolution of the 521 primate APOBEC3A cytidine deaminase gene and identification of related coding 522 regions. PLoS One. 2012;7:e30036. 523 28. Wang W, Caldwell MC, Lin S, Furneaux H, Gorospe M: HuR regulates cyclin A 524 and cyclin B1 mRNA stability during cell proliferation. EMBO J. 525 2000;19(10):2340-50. 526 29. Lal A, Mazan-Mamczarz K, Kawai T, Yang X, Martindale JL, Gorospe M: 527 Concurrent versus individual binding of HuR and AUF1 to common labile target 528 mRNAs. EMBO J. 2004;23(15):3092-102. 529 30. Giudice G, Sánchez-Cabo F, Torroja C, Lara-Pezzi E: ATtRACT-a database of 530 RNA-binding proteins and associated motifs. Database (Oxford). 2016;7:baw035. 531 532 533 534 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 21 Table 1 535 Clade G Clade GH Clade GR Clade L Clade V Clade S U 0.97 0.92 0.91 0.72 0.66 0.73 UA 0.89 0.80 0.92 0.62 0.30 0.50 AUU 0.81 0.78 0.92 0.90 0.27 0.05 CAU 0.56 0.59 0.54 0.51 0.55 0.17 UGU 0.68 0.82 0.52 0.59 0.88 0.50 UUA 0.96 0.72 0.93 0.80 0.19 0.34 UUG 0.67 0.78 0.20 0.92 0.69 0.05 UUU 0.94 0.82 0.89 0.80 0.41 0.93 536 Table 2 537 Clade G Clade GH Clade GR Clade L Clade V Clade S C -0.95 -0.83 -0.95 -0.98 -0.97 -0.35 AG -0.77 -0.71 -0.27 -0.57 -0.67 -0.44 CA -0.73 -0.93 -0.85 -0.16 -0.45 -0.82 CC -0.93 -0.87 -0.75 -0.90 -0.90 -0.09 CU -0.81 -0.40 -0.79 -0.52 -0.15 -0.10 GA -0.28 -0.90 -0.80 -0.62 -0.73 -0.29 GG -0.60 -0.79 -0.65 -0.03 -0.10 -0.23 UC -0.33 -0.23 -0.10 -0.58 -0.93 -0.41 AGC -0.57 -0.87 -0.41 -0.80 -0.78 -0.12 CCC -0.69 -0.81 -0.67 -0.94 -0.81 -0.50 GAC -0.73 -0.91 -0.65 -0.35 -0.18 -0.55 GAG -0.11 -0.58 -0.64 -0.50 -0.81 -0.02 GGA -0.73 -0.84 -0.75 -0.65 -0.81 -0.52 538 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 22 Table 3 539 Clade G Clade GH Clade GR Clade L Clade V Clade S PTBP1 -2.81 -3.46 -4.92 -13.50 -11.66 3.95 HNRNPL -3.03 -1.78 -0.52 -3.62 -6.48 0.71 NOVA1 -0.04 -2.77 -0.70 -3.37 -5.04 1.09 SRSF2 -1.17 -3.02 -1.09 -3.10 -3.10 0.78 ZFP36 1.67 -3.50 -1.98 -3.48 -4.78 1.86 HNRNPA1 -0.16 -2.50 -0.50 -2.64 -3.93 0.44 ELAVL1 2.82 0.47 2.87 0.59 0.03 2.39 TIA1 -0.82 -2.05 -1.20 -1.72 -4.41 1.67 PCBP2 -0.37 -2.50 -1.03 -2.28 -2.35 0.11 SRSF1 -0.63 -2.29 -1.26 -2.55 -1.84 0.36 540 541 542 543 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 23 544 545 546 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 24 547 548 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 25 549 550 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 26 551 552 553 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 27 554 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425508doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425508 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_01_04_425335 ---- 64487503 Learning association for single-cell transcriptomics by integrating profiling of gene expression and alternative polyadenylation Guoli Ji 1,4 , Wujing Xuan 1,4 , Yibo Zhuang 2 , Lishan Ye 3,4 , Sheng Zhu 1,4 , Wenbin Ye 1,4 , Xi Wang 1,4 and Xiaohui Wu 1,4* 1 Department of Automation, Xiamen University, Xiamen 361005, China 2 Xiamen YLZ Yihui Technology Co., Ltd, Xiamen, Fujian 361008, China 3 Xiamen Health and Medical Big Data Center, Xiamen, Fujian 361008, China 4 National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361102, China Keywords: cell type clustering; alternative polyadenylation; single-cell RNA-seq; integrative analysis; software Guoli Ji is a professor with the Department of Automation in Xiamen University. His research interests include bioinformatics, advanced control, data mining and information system. Wujing Xuan is a graduate student with the Department of Automation in Xiamen University. His research interests are bioinformatics and data mining. Yibo Zhuang is an employee in Xiamen YLZ Yihui Technology company. His research interests are software design, cloud computing and big data. Lishan Ye is the director of Xiamen Health and Medical Big Data Center. Her research interests are cloud computing and healthcare big data. Sheng Zhu is a Ph.D. candidate with the Department of Automation in Xiamen University. His research interests are bioinformatics and healthcare big data. Wenbin Ye is a Ph.D. candidate with the Department of Automation in Xiamen University. Her research interests are bioinformatics and mRNA processing. Xi Wang is a graduate student with the Department of Automation in Xiamen University. Her research interests are bioinformatics and data mining. Xiaohui Wu is an associate professor with the Department of Automation in Xiamen University. Her research interests are mRNA processing, bioinformatics, and data mining. * Corresponding author. E-mail: xhuister@xmu.edu.cn, Tel: 86+13459029440 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Abstract Single-cell RNA-sequencing (scRNA-seq) has enabled transcriptome-wide profiling of gene expressions in individual cells. A myriad of computational methods have been proposed to learn cell-cell similarities and/or cluster cells, however, high variability and dropout rate inherent in scRNA-seq confounds reliable quantification of cell-cell associations based on the gene expression profile alone. Lately bioinformatics studies have emerged to capture key transcriptome information on alternative polyadenylation (APA) from standard scRNA-seq and revealed APA dynamics among cell types, suggesting the possibility of discerning cell identities with the APA profile. Complementary information at both layers of APA isoforms and genes creates great potential to develop cost-efficient approaches to dissect cell types based on multiple modalities derived from existing scRNA-seq data without changing experimental technologies. We proposed a toolkit called scLAPA for learning association for single-cell transcriptomics by combing single-cell profiling of gene expression and alternative polyadenylation derived from the same scRNA-seq data. We compared scLAPA with seven similarity metrics and five clustering methods using diverse scRNA-seq datasets. Comparative results showed that scLAPA is more effective and robust for learning cell-cell similarities and clustering cell types than competing methods. Moreover, with scLAPA we found two hidden subpopulations of peripheral blood mononuclear cells that were undetectable using the gene expression data alone. As a comprehensive toolkit, scLAPA provides a unique strategy to learn cell-cell associations, improve cell type clustering and discover novel cell types by augmentation of gene expression profiles with polyadenylation information, which can be incorporated in most existing scRNA-seq pipelines. scLAPA is available at https://github.com/BMILAB/scLAPA. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Introduction Single-cell RNA-sequencing (scRNA-seq) has enabled transcriptome-wide profiling of gene expressions in individual cells, which has great potential to reveal cellular composition of tissues, transcriptional heterogeneity among cells and structure of cell types [1]. Cell-type identification is a critical step in most scRNA-seq data analyses, and a myriad of computational methods have emerged to detect novel cell types, previously un-appreciated sub-types of cells and rare cells [2]. Fundamentally, these numerous clustering methods rely on cell-cell associations (or similarities) for categorizing individual cells into different clusters [3]. A wide range of computational tools have been proposed to cluster cells, which implicitly or explicitly rely on a similarity concept [2]. SIMLR (Single-cell Interpretation via Multikernel Learning) adapts k-means by simultaneously training a similarity measure based on multiple kernel learning [4]. RaceID extends k-means with outlier detection to discover rare cell types [5]. SC3 (Single-Cell Consensus Clustering) utilizes a consensus approach to combine multiple clustering solutions [6]. PhenoGraph combines shared nearest-neighbour graphs and Louvain community detection to fast identify cell clusters [7]. Despite of the considerable progress, there is no strong consensus on which is the best clustering approach to define cell types for all situations [2, 8, 9]. Particularly, high variability and dropout rate inherent in scRNA-seq confounds the reliable quantification of lowly and/or moderately expressed genes [10, 11], resulting in extremely sparse gene-cell count matrix. Consequently, there might be little satisfactory overlap of observed genes among cells, hindering reliable quantification of cell-cell similarities based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Recently, multi-omics methods that leverage additional aspects of the cell, such as the DNA methylome, open chromatin or proteome, are beginning to appear [12]. Seurat v3 [13] harmonizes scRNA-seq and scATAC-seq data from a similar tissue to identify subpopulations of cells that are undistinguishable using the scATAC-seq data alone. LIGER [14], a method based on integrative non-negative matrix factorization (iNMF), was proposed to classify cortical cells profiled from single-cell bisulfite sequencing by integrating scRNA-seq data. Additional modalities of individual cells provide valuable information about the phenotype and genetic cellular state not manifested by the transcriptome. However, not all scRNA-seq data is accompanied data from different modalities. Even that multimodal omics data are gradually available, integrative multimodal analysis is still in its infancy [12]. It remains a challenge to reconcile the heterogeneity across modalities as different modalities are normally profiled from cells sampled from the same tissue rather than the same cells. Although most scRNA-seq studies focus on gene expression profiling, key information on transcript isoforms, e.g., alternative splicing (AS) and/or alternative polyadenylation (APA), can be obtained, enabling multiple aspects of transcriptome information to be derived from standard scRNA-seq without changing experimental technologies [15-20]. Lately, several computational methods, such as scAPAtrap [15], Sierra [16] and scAPA [18], have been proposed to identify APA sites in single cells from diverse 3′ tag-based scRNA-seq protocols, e.g., Drop-seq [21], CEL-seq [22] and 10x Genomics [23]. Cell-to-cell heterogeneity in APA site usage was also observed [15-18]. Particularly, the previous study [15] revealed that the APA profile, even that from non-differentially expressed genes, can distinguish mouse cells in different stages during sperm cell differentiation, suggesting the possibility of discerning cell identities with APA usages independent of gene expression. Recent efforts have (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 pioneered methods to identify APA sites or explore APA dynamics across different cell types [16-18, 24-26], however, most studies profiled APA among cells with predefined cell type labels rather than discern cell types in an unsupervised manner. Complementary information at both layers of APA isoforms and genes can be refined from the same cells [15-20], which creates great potential to develop more sophisticated and cost-efficient approaches to dissect cell types based on multiple modalities derived from existing scRNA-seq experiments. Here we proposed a toolkit called scLAPA for learning association for single-cell transcriptomics by combing single-cell profiling of gene expression and alternative polyadenylation. scLAPA leverages the resolution and huge abundance of scRNA-seq, boosting the gene-level analysis with additional layer of APA information directly derived from the same scRNA-seq data. By employing the strategy of similarity network fusion, scLAPA effectively learns highly informative cell-cell associations from expression profiles of both genes and APA isoforms. We compared scLAPA with seven similarity metrics and five clustering methods, using diverse scRNA-seq data from different experimental technologies and species. Comparative results showed that scLAPA is more effective and robust for learning cell-cell similarities and clustering cell types than competing methods. Moreover, with scLAPA we found two hidden subpopulations of cells in peripheral blood mononuclear cells (PBMCs) that were undetectable using the gene expression data alone. As a comprehensive toolkit, scLAPA provides a unique strategy to learn cell-cell associations, improve cell type clustering and discover novel cell types by augmentation of gene expression profiles with polyadenylation information, which can be incorporated in many other standard scRNA-seq pipelines for single-cell analyses. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Materials and methods scRNA-seq datasets We used five publicly available scRNA-seq datasets from animals and plants generated by 3′ tag-based scRNA-seq protocols (Table S1), spanning a wide spectrum of tissues, cell types and species. Raw data except for the PBMC data were downloaded from NCBI GEO (Gene Expression Omnibus). Cell types and cell labels of the data of Amygdala, Mammary and Root were obtained from the corresponding studies; cell labels of the Hypothalamus data were obtained from PanglaoDB [27]. The PBMC 4k dataset was downloaded from the 10x Genomics website (https://www.10xgenomics.com/). For cell type annotation of PBMCs, we followed the tutorial of Seurat v3 [13] to cluster cells on the basis of the gene-cell expression matrix. Specifically, cells with total read counts less then 300 were discarded. The LogNormalize method was adopted for normalization. Top 2000 highly variable features were selected by the vst method. PCA (Principal Component Analysis) was used for dimensionality reduction and top 20 principal components were retained. Finally, cells were clustered by Seurat’s FundClusters with argument ‘resolution=0.9. For cell type annotation of cell clusters, known marker genes of PBMCs were complied from relevant studies (Table S2). Differentially expressed (DE) genes for each cell group were calculated with Seurat’s FindAllMarkers. We also calculated, for each cell cluster, the number of cells where a DE gene is expressed and the mean expression level of a DE gene. The cell type was carefully assigned to a cell cluster according to the presence and expression level of marker gene(s). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Overview of scLAPA scLAPA mainly consists of four modules (Figure S1): (i) the input module, (ii) cell-cell distance, (iii) distance fusion, (iv) cell type clustering. The input module prepares the input for scLAPA, including a poly(A) site expression matrix (hereinafter referred as PA-matrix) and a gene expression matrix (hereinafter referred as GE-matrix). The PA-matrix is generated from raw scRNA-seq with scAPAtrap [15], which stores expression levels of poly(A) sites, with each row denoting a poly(A) site and each column denoting a cell. The GE-matrix can be obtained from websites like NCBI GEO and 10x Genomics, or generated by various routine scRNA-seq analysis tools like Cell Ranger. In the module of cell-cell distance, a cell-cell distance matrix is learned for PA-matrix (called PA-dist) and GE-matrix (called GE-dist), respectively. The module of distance fusion employs similarity network fusion (SNF) [28] to integrate the two distance matrices (PA-dist and GE-dist) into one cell-cell distance matrix. The cell type clustering module clusters cells based on the fused distance matrix with various clustering methods. scLAPA was implemented as an open source R package and is available at https://github.com/BMILAB/scLAPA. Scripts and data used in this study are also available at the GitHub website. Identification of poly(A) sites from scRNA-seq We followed the tutorial provided at the scAPAtrap website (https://github.com/BMILAB/scAPAtrap) to identify poly(A) sites with scAPAtrap [15]. It should be noted that alternative tools, such as Sierra [16] and scAPA [18], can also be used. Briefly, raw FASTQ reads were mapped with Cell Ranger 2.1.0 (https://www.10xgenomics.com/) and then uniquely mapped reads were obtained with SAMtools (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://github.com/BMILAB/scAPAtrap https://github.com/BMILAB/scAPAtrap https://doi.org/10.1101/2021.01.04.425335 (http://samtools.sourceforge.net/). Then UMI-tools [29] was employed to remove polymerase chain reaction (PCR) duplicates and extract unique molecular identifiers (UMIs). The findTails function in the scAPAtrap package was used to determine exact locations of poly(A) sites from reads with A/T stretches and the findPeaks function was adopted to identify all potential peaks of poly(A) sites from the whole genome level. Finally consensus poly(A) sites supported by both of the peak and the tail evidence were used. The featureCounts function in the Subread toolkit [30] was adopted to quantify the expression level for each poly(A) site. Poly(A) site annotation was performed with the movAPA package [31], using the latest genome annotation of the respective species -- TAIR10 for Arabidopsis, mm10 for mouse and GRCh38 for human. Briefly, poly(A) sites identified from scAPAtrap were annotated with rich information, such as genomic regions (i.e., 5′ UTR, 3′ UTR, coding sequence (CDS), intron, exon and intergenic) and gene id. Similar to previous studies [32-35], annotated 3′ UTRs were extended by a length of 1000 bp to recruit intergenic sites that may originate from authentic 3′ UTRs. Calculation of cell-cell distance scLAPA learns a cell-cell distance matrix for PA-matrix and GE-matrix, respectively. Various distance metrics can be chosen, including Euclidean distance, Pearson correlation, two metrics of proportionality (𝜌𝑝 and ∅𝑠) [3], RAFSIL (RAndom Forest based SImilarity Learning) [36] and SIMLR [4]. Euclidean distance and Pearson correlation are widely used in either single-cell or bulk transcriptomics. The two measures of proportionality were found to have strong performance according to a comprehensive benchmarking analysis of a large single-cell transcriptome compendium (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 [3]. RAFSIL is a random forest based approach that learns cell-cell similarities from scRNA-seq data, including two variations -- RAFSIL1/2. SIMLR learns a distance metric that fits the structure of the scRNA-seq data by combining multiple kernels corresponding to different informative representations of the data. Euclidean distance and Pearson correlation were calculated by the dists and cor functions in the R package stats, respectively; SIMLR metric was calculated by the SIMLR R package with argument ‘cores.ratio=0’; RAFSIL metric was calculated by the RAFSIL R package with arguments ‘nrep=50, gene_filter=FALSE’; 𝜌𝑝 and ∅𝑠 were calculated by the perb and phis functions in the R package propr, respectively. For each distance metric, cell-cell distance matrices, PA-dist and GE-dist, can be learned for PA-matrix and GE-matrix, respectively. PA-dist represents the cell-cell similarity network learned from the APA isoform layer, whereas GE-dist reflects the network learned from the gene layer, each of which encapsulates complementary information about cell-cell associations absent in the other genomic layer. Distance fusion After learning PA-dist and GE-dist, similarity network fusion (SNF) [28] is utilized to flexibly integrate the two layers of cell-cell similarities into one similarity matrix. First, PA-dist and GE-dist were iteratively and gradually fused to a consensus network, utilizing the non-linear method of message passing theory [37]. Then weak similarities representing potential noise were discarded, and strong similarities were retained. By generating coherent cell-cell similarities from both APA isoform and gene layers, SNF profiles a more comprehensive biological relationship among cells, beyond the scope of methods solely based on GE-matrix. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Given a PA-matrix storing expression levels of 𝑚 poly(A) sites in 𝑛 cells or a GE-matrix recording expression levels of 𝑚 genes in 𝑛 cells, the corresponding cell-cell distance matrix (PA-dist or GE-dist) can be obtained using a selected distance metric. The distance matrix can also be denoted as a graph 𝐺 =< 𝑉, 𝐸, 𝑊 >, with vertices 𝑉 {𝑐1 , … , 𝑐𝑛 } corresponding to cells, edges 𝐸 representing cell-cell link and edge weights 𝑊[𝑛×𝑛] denoting the kernel representation of cell-cell similarities. The weight of an edge linking cells 𝑐𝑖 and 𝑐𝑗 is determined using a scaled exponential similarity kernel: 𝑊𝑖𝑗 = 𝑒𝑥𝑝 − 𝑑𝑖𝑗 2 𝜇𝛽𝑖𝑗 (1) Here 𝑑𝑖𝑗 represents the distance between cells 𝑐𝑖 and 𝑐𝑗 measured by a distance metric (e.g. Pearson correlation). 𝜇 is an empirical hyperparameter with a recommended value in a sizable range of [0.3, 0.8] [28]. 𝛽𝑖𝑗 is a scaling factor defined as follows: 𝛽𝑖𝑗 = 𝑑 𝑐𝑖,𝑁𝑖 +𝑑 𝑐𝑗 ,𝑁𝑗 +𝑑𝑖𝑗 3 (2) where 𝑁𝑖 are neighboring cells of 𝑐𝑖 and 𝑑 𝑐𝑖, 𝑁𝑖 is the average distance of 𝑐𝑖 to its neighbors. To obtain a fused network from PA-dist and GE-dist, a full and sparse kernel on the vertex set 𝑉 is derived from the weight matrix 𝑊. The full kernel is a normalized weight matrix 𝑊 [𝑛×𝑛] which stores the full information of cell-cell similarities. The normalized weight between 𝑐𝑖 and 𝑐𝑗 is defined as: 𝑊 𝑖𝑗 = 𝑊𝑖𝑗 2 𝑊𝑖𝑘𝑘≠𝑖 𝑤ℎ𝑒𝑛 𝑖 ≠ 𝑗 0.5 𝑤ℎ𝑒𝑛 𝑖 = 𝑗 (3) Another matrix 𝐴[𝑛×𝑛] encodes the local affinity that measures similarities of a cell to its 𝐾 most similar cells: (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 𝐴𝑖𝑗 = 𝑊𝑖𝑗 𝑊𝑖𝑘𝑘≠𝑖 𝑤ℎ𝑒𝑛 𝑗 ∈ 𝑁𝑖 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (4) Here 𝑁𝑖 is the set of cell 𝑐𝑖 and its neighbors in the graph 𝐺. The network fusion initiates from 𝑊 , using 𝐴 as the kernel matrix to capture local structure of the graph. To fuse the two distance matrices (PA-dist and GE-dist), first 𝑊𝑃𝐴 and 𝑊𝐺𝐸 were computed, respectively. Then the corresponding initial state matrices 𝑊 𝑃𝐴 and 𝑊 𝐺𝐸 were derived from the two similarity matrices, and the kernel matrices 𝐴𝑃𝐴 and 𝐴𝐺𝐸 were also computed. Given the initial two status matrices at 𝑡 = 0, 𝑊 𝑡=0 𝑃𝐴 and 𝑊 𝑡=0 𝐺𝐸 , the fusion process iteratively updates the respective similarity matrix: 𝑊 𝑡+1 𝑃𝐴 = 𝐴𝑃𝐴 × 𝑊 𝑡 𝑃𝐴 × (𝐴𝑃𝐴 )𝑇 𝑊 𝑡+1 𝐺𝐸 = 𝐴𝐺𝐸 × 𝑊 𝑡 𝐺𝐸 × (𝐴𝐺𝐸 )𝑇 (5) Then after 𝑡 iterations, the final status matrix is obtained: 𝑊 = 𝑊 𝑡 𝑃𝐴 +𝑊 𝑡 𝐺𝐸 2 (6) 𝑊 is the fused cell-cell distance network by incorporating cells’ APA isoform and gene expression profiles. The corresponding cell-cell similarity matrix is 1 − 𝑊 . The distance or similarity matrix can be used for downstream cell type clustering. Single cell clustering Four widely-used clustering methods were provided in scLAPA to cluster cells on the basis of the fused cell-cell similarity matrix, including Louvain clustering [38], hierarchical clustering (HC) [39], spectral clustering (SC) [40] and k-means. The Louvain clustering was implemented by the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 cluster_louvain function in the R package igraph, with arguments ‘mode=undirected, weighted=TRUE, diag = TRUE’. The spectral clustering was implemented by the SpectralClustering function in the R package SNFtool with default settings [28]. The hierarchical clustering [39] was performed by the flashClust function in the R package flashClust with default settings [41]. The k-means clustering was implemented by the kmeans function of the R package stats with arguments ‘iter.max=1e+9, nstart=1000’. Performance evaluation We distinguished two scenarios, similarity learning and clustering, to evaluate our approach. For each scenario, we applied scLAPA to four scRNA-seq datasets with pre-annotated cell labels, and compared results with other competing approaches. For the scenario of similarity learning, we compared scLAPA with seven similarity measures, including three measures designed for scRNA-seq (RAFSIL1/2 and SIMLR), two measures of proportionality (𝜌𝑝 and ∅𝑠) and two traditional similarity measures (Euclidean distance and Pearson correlation). Each of these measures was applied to a given GE-matrix to learn a cell-cell similarity matrix. For scLAPA, we applied each measure to learn two cell-cell similarity matrices from PA-matrix and GE-matrix and fused them into one matrix. We also applied different clustering methods including Louvain, HC, SC and k-means on the similarity matrix learned from each similarity measure to assess different similarity measures in the context of clustering. For the scenario of clustering, we compared scLAPA with five state of the art clustering methods for scRNA-seq data, including SC3 [6], Seurat v3 [13], SINCERA [42], SNN-Cliq [43] and dynamic tree (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 cut method (dynamicTreeCut) [44]. None of these approaches provides explicit similarity learning procedure, instead they provide cell labels by unsupervised learning on the GE-matrix. Each approach was applied to a given GE-matrix for cell clustering and class labels of cells were obtained. For scLAPA, we applied each of the four methods (Louvain, HC, SC and k-means) on the fused similarity matrix to obtain clustering results. Two internal validation metrics, Dunn index [45] and Connectivity [45], were employed for the first scenario to quantitatively assess the goodness of a clustering structure without relying on any clustering methods or knowing external information about class labels. The Dunn index [45] evaluates non-linear combinations of the between-group separation and the within-group compactness. The Connectivity reflects the extent of observations that are present in the same group as their neighbors in the data space. The original value of Connectivity ranges from zero to infinity, with smaller value denoting higher performance. Here we used a transform, 1/log10(Connectivity +1), to make Connectivity consistent with Dunn. The larger the score of Connectivity or Dunn, the better the separation is. The R package clValid [45] was adopted to calculate the Connectivity and Dunn index. Additionally, we used three popular metrics to evaluate the performance of scLAPA in the context of clustering, including the ARI (Adjusted Rand Index), Jaccard and NMI (Normalized Mutual Information). The value of ARI ranges from -1 to 1, and values of NMI and Jaccard range from 0 to 1, with the higher value indicating the better performance. ARI is a widely-used metric for measuring the concordance between two clustering results. The Jaccard index quantifies the similarity between two datasets. NMI is a variation of mutual information for evaluating clustering results, which corrects the bias of the consistency caused by chance. ARI and Jaccard were calculated using the adjustedRand (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://en.wikipedia.org/wiki/Mutual_information https://en.wikipedia.org/wiki/Cluster_Analysis https://doi.org/10.1101/2021.01.04.425335 function in the R package clues [46]; NMI was obtained by the compare function in the R package igraph (https://igraph.org/r/). Bioinformatics analyses UMAP [47] was adopted for visualization of distributions of single cells, which employs the non-linear dimensional reduction technique to group similar cells in low-dimensional space. UMAP was implemented by the calculateUMAP function in the scater R package [48]. For the analysis of the Arabidopsis root data, DESeq2 [49] was adopted to identify DE genes and DE poly(A) sites. First GE-matrix and PA-matrix were normalized by the median ratio method provided in DESeq2. Then the DESeq function was applied for DE detection. Gene or poly(A) sites with log2 fold change>=0.8 and adjusted P-value<=0.1 were considered as DE. Results Single-cell polyadenylation profile distinguishes cells Recently, scRNA-seq has emerged as a unique tool to explore cell-specific gene or isoform expression in plants [50-54]. A previous study [51] utilized root-hair and nonhair cell types as models and revealed the potential of using scRNA-seq data for inferring specific cells during the process of cell-type differentiation. Here we focused on the epidermal tissue and analyzed differential expression on both gene and APA levels between root-hair and nonhair cells. A total of 294 root-hair cells and 195 nonhair cells were defined by the previous study [51]. Although both GE-matrix and PA-matrix were obtained from the same scRNA-seq data, we still found four genes exclusively present in the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 PA-matrix (Figure 1A). For example, AT1G64140, a WRKY transcription factor gene, was absent in the single-cell GE-matrix, while it has one poly(A) site (coord: 23803757) with much higher expression level in nonhair than in hair cells according to the PA-matrix. Interestingly, this poly(A) site is an annotated poly(A) site in extended 3′ UTR, which was supported by bulk 3′-seq data according to PlantAPAdb [55]. Similarly, AT3G2522, a hypothetical protein coding gene, is missing in the GE-matrix, while its one poly(A) site (coord: 9184927) is expressed much higher in nonhair cells than in hair cells. This poly(A) site was also annotated as a 3′ UTR site in PlantAPAdb. Moreover, 1422 genes possess at least one differentially used poly(A) site, among which 171 genes were not DE genes (Table S3). For example, AT1G59725 is a DNAJ heat shock family protein expressed in root. Although both AT1G59725 and its one poly(A) site are expressed higher in root hair cells than in nonhair cells, the difference between the two cell types characterized by the poly(A) profile is much more pronounced than that by the gene profile (Figure 1B). Further, using only the GE-matrix, a subset of cells are indistinguishable between hair and nonhair cell types (Figure 1C). In contrast, cells from the two cell types were clearly separated on the basis of the PA-matrix and two potential subpopulations of nonhair cells were observed (Figure 1C). Therefore, we anticipate that the poly(A) site expression profile may encode complementary information that is absent or insignificant in the gene expression profile, which could be useful to distinguish cell types. There is a great potential to develop integrative approaches for discerning cell identities that can properly incorporate single-cell profiling of both gene expression and polyadenylation information. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Learning cell-cell similarities with scLAPA We proposed the scLAPA toolkit that can learn cell-cell similarities by taking advantage of the complementarities from both layers of APA isoforms and genes. Here we compared the performance of the similarity metric learned from scLAPA with other seven similarity metrics by analyzing four scRNA-seq datasets. Two metrics, Dunn and Connectivity, were adopted to quantitatively measure cell separation independent of clustering methods. Generally, scLAPA provides higher or comparable performance than other metrics across all the four datasets, whereas Pearson correlation or Euclidean has a consistently lower performance (Figures 2A and S2). In terms of both Dunn and Connectivity, scLAPA and SIMLR perform significantly better than other three metrics. Particularly, SIMLR outperforms scLAPA on the Hypothalamus data whereas scLAPA outperforms SIMLR on the Mammary data. Overall, scLAPA performs better than at least six out of the seven metrics in all the four datasets, never being the worst in any case. According to the Dunn index (Figure 2A), even for datasets where the performance of scLAPA is not the best, scLAPA is always the close match to the best. For example, the Dunn score from scLAPA on the Hypothalamus data is 0.94, which is very close to the best score (0.985 from SIMLR). Next we used the radar chart to compare the performance of these similarity metrics more intuitively. Apparently, scLAPA and SIMLR stand out as universally better than the others, and discrepancies of performance of other six metrics across different datasets were observed (Figure 2B). For example, the overall similarity based on the RAFSIL1/2 metric is much higher on Mammary and Hypothalamus data than the other two datasets, revealing the instability of performance of RAFSIL across different (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 datasets. In contrast, for all these four datasets, both Euclidean and Pearson correlation emerge as the worst similarity metric. In contrast, scLAPA provides a more robust result regardless of datasets. scLAPA is integrative and flexible in that different distance metrics can be chosen to learn cell-cell similarities for distance fusion. Next we examined the effect of using different distance metrics in scLAPA. The performance of scLAPA according to the Dunn index is highly robust across all datasets regardless of distance metrics used in scLAPA (Figure 2C). It is widely accepted that it is highly challenging to determine an optimal distance metric for profiling true cell-cell relationships from the complex and heterogeneous scRNA-seq data [3]. However, the integrative framework of scLAPA provides an effective solution of distance fusion by assembling results from multiple data layers into one ensemble result, which can mitigate limitations in individual similarity metrics or data layers and facilitate the generalization and adaption for different scRNA-seq datasets. Take the Hypothalamus data as an example. Apparently, the matrix with block structures obtained from scLAPA showed higher consistency with true labels than did other similarity metrics (Figure S3). Block structures learned by SIMLR are indistinguishable from background signatures; block structures learned by Pearson correlation, Euclidean or the two measures of proportion are also mixed with background signatures; block structures learned from RAFSIL are generally consistent with true structures except that cell types with small number of cells are less distinguishable. Overall, scLAPA provides more divergent clusters with higher distinction, and individual clusters obtained by scLAPA are more compact than those by other similarity metrics. These results demonstrate the ability and robustness of scLAPA in improving the cell separation across numerous scRNA-seq datasets. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Cell type clustering with scLAPA Cell-cell similarities learned by different similarity metrics can be adapted to other clustering methods that take similarities as inputs. Here we performed extensive comparisons of scLAPA with other seven similarity metrics by applying different clustering methods for cell clustering. First we applied Louvain [2], a graph-based method for community detection, to different similarity metrics for clustering. According to the ARI score, similarities learned by scLAPA and SIMLR significantly outperform similarities obtained from Euclidean, Pearson correlation or RAFSIL1/2 (Figure 3A). Overall, SIMLR shows similar performance with scLAPA, whereas scLAPA outperforms SIMLR in three out of the four datasets. Particularly, Euclidean and Pearson correlation present the worst performance in two datasets, Mammary and Root. Similar results were obtained in terms of other two indexes, NMI and Jaccard (Figure S4). In addition to Louvain clustering, we also investigated other three popular clustering methods, including hierarchical clustering [39], spectral clustering [40] and k-means [56], to evaluate the robustness of results by applying different clustering methods on the same similarity metric (Figures S5-7). Particularly, the performance of scLAPA and RAFSIL1/2 are robust regardless of clustering methods used, whereas scLAPA consistently outperforms RAFSIL. In contrast, SIMLR, Euclidean and Pearson correlation are very sensitive to clustering methods applied (Figure 3B). Surprisingly, although SIMLR achieves comparable performance with scLAPA based on Louvain clustering (Figure 3A), its performance is the worst using k-means or spectral clustering (Figure 3B). Take the Mammary data for example, the ARI score of SIMLR drops from 0.769 when using Louvain clustering to an extremely low median value of 0.026 when using k-means. Moreover, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 we noted that, ARI scores from individual runs of k-means clustering on SIMLR similarities varied greatly, revealing the relatively poor robustness of SIMLR with k-means clustering (Figure S5). These results demonstrate that the cell-cell similarity matrix learned from scLAPA is more effective and robust than competing similarity metrics in clustering cell subpopulations. During the preparation of this manuscript, we noticed another method scDaPars [57], which quantifies and recovers APA events in single cells using standard scRNA-seq data. The authors also integrated APA information identified by scDaPars with imputed gene expression by similarity network fusion to reveal novel cell subpopulations during human embryonic development. Different from scDaPars that employs the (imputed) percentage of distal poly(A) site usage index (PDUI) to measure APA usage, scLAPA directly utilizes raw poly(A) expression profile. Here we compared the performance of scLAPA and scDaPars by applying them to the four scRNA-seq datasets in our benchmarking analysis. Following the process in Gao et al. [57], we calculated PDUI based on the PA-matrix and imputed APA profiles using scDaPars. Then we applied five similarity metrics on the scDaPars-imputed APA profile and the GE-matrix to generate scDaPars-dist and GE-dist, respectively. After fusing the two distance matrices with SNF, we applied Louvain clustering on the fused cell-cell similarities to cluster cells. According to the ARI score (Figure 4), scLAPA significantly outperforms scDaPars on all the four datasets. Particularly, ARI scores of scDaPars with different similarity metrics varied greatly whereas the performance of scLAPA is robust with different similarity metrics (Figure 4 vs. Figure 2C), revealing that the poly(A) expression profile used in scLAPA is more efficient and robust than the PDUI profile used in scDaPars for clustering cells. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Next we expanded the benchmarking analysis by comparing clustering results of scLAPA with other single-cell clustering methods that directly take the gene-cell expression matrix as input without an explicit procedure of similarity learning. Specifically, we included five popular tools for comparison, including SC3 [6], Seurat v3 [13], SINCERA [42], SNN-Cliq [43] and dynamicTreeCut [44]. According to the ARI score, scLAPA achieves generally higher or comparable performance than other methods, whereas dynamicTreeCut provides a consistently lower performance (Figure 5). Similar results were observed using indexes of Jaccard or NMI (Figure S8). Specifically, scLAPA provides the best ARI score in three out of the four datasets (Figure 5). For the Hypothalamus data where SC3 performs the best, scLAPA presents very close ARI score to SC3 (scLAPA=0.985; SC3=0.99). Particularly, for three datasets (Mammary, Hypothalamus and Root), ARI scores of individual SC3 runs varied greatly, reflecting the performance of SC3 may be unstable on some kinds of datasets. Overall, the performance of scLAPA is robust and consistently high across diverse scRNA-seq datasets. scLAPA identifies hidden subpopulations of cells We next applied scLAPA on the human PBMC 4k dataset from 10x Genomics for cell type clustering. First we examined the cell type composition of the PBMCs by applying Seurat to the gene-cell expression matrix (GE-matrix). Ten distinct cell clusters were yielded (Figure 6A). Based on the expression of known markers (Table S2), nine clusters were annotated. Up to 13,512 poly(A) sites from 9601 genes were identified from the raw RNA-seq data with scAPAtrap. We learned cell-cell similarities with scLAPA by jointly considering expression profiles of APA isoforms and genes. After (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 applying Louvain clustering on the cell-cell similarity matrix, 14 cell clusters were obtained and 11 clusters were successfully annotated. These 11 clusters covered the nine clusters identified by Seurat and contained two new small clusters (Figure 6B). Both subclusters were supported by the expression patterns of markers, suggesting that they represented distinct cell types. One subcluster was annotated as regulatory T cell on the basis of elevated expression of three markers, CCR10, FOXP3 and IL2RA (Figure S9). Depending only on the gene expression profile, regulatory T cells were not well resolved and are indistinguishable among other T cells (Figure 6A). Although the gene expression of the marker CCR10 is sparse and weak among T cells, we could still distinguish clearly regulatory T cells from other T cell types according to the UMAP visualization of the gene expression profile (Figure 6C). Particularly, CCR10 has four annotated poly(A) sites according to APASdb [58], whereas only one poly(A) site was identified from scRNA-seq data. This is not unexpected as the bulk 3′-seq data contain more diverse tissue samples than the PBMC data and scRNA-seq data is generally too sparse to identify all poly(A) sites. However, we have shown that, even for a single poly(A) site, it could encapsulate useful information beyond the gene expression profile (Figure 1). The other subcluster where cell markers such as PPBP and PF4 are expressed, was annotated as megakanyocyte progenitors (Figures 6D and S10). According to the PA-matrix, PPBP carries three poly(A) sites, and five poly(A) sites of PPBP were annotated in APASdb. These three poly(A) sites were all highly expressed in megakanyocyte progenitors (cluster 10) (Figure 6E). These results demonstrate that scLAPA facilitates the capture and identification of hidden subpopulations of cells that are unrecognizable based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Discussion scLAPA is an integrative framework for learning association for single-cell transcriptomics by leveraging expression profiles of genes and APA isoforms in individual cells, which highlights the inclusion of polyadenylation signatures for improving cell type clustering and discovering new cell types. The effectiveness of scLAPA for cell-cell similarity learning and cell type clustering is evidenced by comparisons with various similarity metrics and single-cell clustering methods on several scRNA-seq datasets. scLAPA has a number of desirable features. First, scLAPA incorporates existing tools to extract and quantify poly(A) sites directly from scRNA-seq, which augments the gene-level analysis with additional layer of APA information without altering the scRNA-seq protocol or performing additional sequencing experiment. Second, by employing the strategy of similarity network fusion, scLAPA jointly considers expression profiles at both levels of APA isoforms and genes for learning highly informative cell-cell similarities. Third, in contrast to many other methods that cluster cells without explicit similarity learning step, scLAPA provides two independent but connected modules for similarity learning and cell clustering, each with various methods for users’ choice. Accordingly, users can freely combine different similarity metrics and clustering methods in scLAPA to evaluate the clustering results for any given dataset. Fourth, the framework of scLAPA is highly flexible, which can be seamlessly embedded into most existing scRNA-seq pipelines or tools for downstream analyses, such as dimension reduction, cell type clustering and differential expression analysis. Accordingly, existing tools, such as those designed for dropout imputation, normalization and similarity learning, can also be easily incorporated into scLAPA. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 With scLAPA, distinct cell-cell similarity networks can be effectively learned from profiles of gene expression and polyadenylation separately by various similarity metrics. scLAPA then employed the strategy of similarity network fusion for scalable and robust integration of similarity networks learned from different data layers. This strategy has the advantage to exploit complementarities in distinct data layers for fully profiling the spectrum of underlying data. Moreover, the consensus set of cell-cell interactions and associations from the APA layer and the gene layer can be learned from the given data, mitigating noise and dropouts in conventional gene-cell expression profile and thus enhancing accuracy for downstream analyses. By combining expression profiles of APA and gene through similarity network fusion, we found two hidden subpopulations of PBMCs that were undetectable using only gene expression data (Figure 6). Moreover, the augmentation of gene expression profiles with polyadenylation information enhances single-cell clustering results and generates more discriminative cell types (Figures 2-5). As a comprehensive toolkit, scLAPA provides a unique strategy to improve cell type clustering and discover novel cell types, by combining gene expression with polyadenylation information at single-cell resolution. scLAPA consists of three core function modules, including learning cell-cell similarities, distance fusion and clustering. Currently, numerous methods are available to learn cell-cell similarities or cluster cells with reasonable accuracy [3]. However each method has its own strengths and limitations, and it is extremely challenging, if not impossible, to determine an optimal method for all kinds of datasets as different methods may exploit different characteristics in the data [59]. Moreover, some similarity metrics may be overly dependent on downstream clustering methods, exacerbating difficulties in choosing a universally applicable combination of similarity and clustering methods. For (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 example, based on the GE-matrix alone, similarities learned from SIMLR provide an overall high performance across datasets in terms of internal validation indexes (Figure 2A). However, SIMLR is highly dependent on downstream clustering methods for single-cell clustering; it achieves high performance with Louvain clustering (Figure 3A), whereas its performance drops sharply with k-means or spectral clustering (Figure 3B). In contrast, our benchmarking analyses showed that performances of scLAPA are robust and consistently high across diverse datasets regardless of distance metrics or clustering methods selected in scLAPA (Figures 2-5). The unique strength of scLAPA may be due to that it efficiently fuses rich structures stored in GE-matrix as well as the accompanied PA-matrix, thus can amplify biological signals and augment cell-cell relationships. scLAPA is an easy-to-use and highly flexible framework. The input of scLAPA is the GE- and PA-matrix, without using any priori biological information. Even with raw scRNA-seq data, it is easy obtain the prerequisite GE-matrix and/or PA-matrix using various tools, e.g. Cell Ranger for GE-matrix, scAPAtrap and Sierra for PA-matrix. Lately another tool, scDaPars [57], was proposed to quantify and recover APA usages from scRNA-seq data, which uses the relative usage of the distal poly(A) site called PDUI to measure a gene’s APA usage. With scDaPars, Gao et al. [57] analyzed cell-type-specific APA regulation and discovered hidden cell subpopulations from cancer and human endoderm differentiation scRNA-seq data. In scLAPA the input PA-matrix can be replaced with any other gene-cell-like matrix, thus the scDaPars-imputed PDUI matrix can be used readily in scLAPA for downstream cell type clustering. However, although the scDaPars-imputed PDUI profile seems to be effective in revealing APA dynamics among cell types in the previous study [57], we found that, for cell type clustering, the performance with the PDUI-matrix is much lower and less robust than that (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 with scLAPA’s PA-matrix (Figure 4). This may be due to several reasons. First, only genes with at least two 3′ UTR poly(A) sites can be used for scDaPars’ PDUI calculation, consequently the PDUI-matrix is much more sparse than the PA-matrix and information encoded in genes with single poly(A) site is lost. Second, although the PDUI profile can be imputed with scDaPars, limited information in the highly sparse PDUI-matrix confounds reliable imputation and may lead to propagation of errors or noises during the imputation process. Third, unlike scLAPA which is specifically designed for learning cell-cell similarities and cell type clustering, the main function of scDarpas is to analyze cell-type-specific APA dynamics and identify novel APA-related cell types. We anticipate that the PA-matrix used in scLAPA may contain more comprehensive and reliable information than the PDUI-matrix or the imputed PDUI-matrix, which can significantly enhance the accuracy of cell type clustering. Overall, the PA-matrix is simple but effective which can be easily obtained from scRNA-seq data by various tools, making it more convenient to use scLAPA for scRNA-seq analyses. For practical application purpose, the current version of scLAPA implements seven similarity metrics and four clustering methods for users’ choice, which allows users to investigate their own strategies for evaluation of the effect of different combinations of distance metrics and clustering methods. Moreover, scLAPA is easily expandable in that additional distance metrics or clustering methods can be readily incorporated. Meanwhile, scRNA-seq preprocessing steps, such as dropout imputation and normalization, can also be easily applied before similarity learning. scLAPA can also be used as a plug-in architecture for most existing scRNA-seq pipelines for similarity learning and cell clustering. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Supplementary Data The file of supplemental materials contains all the Supplementary Figures, and Tables. Funding This work was supported by the National Natural Science Foundation of China (Nos. 61871463 to X.W. and 61573296 to G.J.) and Xiamen YLZ Yihui Technology Co., Ltd (XDHT2020131A). References 1. Ziegenhain C, Vieth B, Parekh S et al. Comparative Analysis of Single-Cell RNA Sequencing Methods, Mol Cell 2017;65:631-643.e634. 2. Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data, Nature Reviews Genetics 2019. 3. Skinnider MA, Squair JW, Foster LJ. Evaluating measures of association for single-cell transcriptomics, Nature Methods 2019;16:381-386. 4. Wang B, Zhu J, Pierson E et al. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning, Nature Methods 2017;14:414. 5. Grun D, Lyubimova A, Kester L et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types, Nature 2015;525:251-255. 6. Kiselev VY, Kirschner K, Schaub MT et al. SC3: consensus clustering of single-cell RNA-seq data, Nature Methods 2017;14:483. 7. Levine Jacob H, Simonds Erin F, Bendall Sean C et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis, Cell 2015;162:184-197. Key Points  We proposed a computational toolkit called scLAPA for learning association for single-cell transcriptomics from scRNA-seq data.  scLAPA improves cell-cell similarity learning and cell type clustering by integrating single-cell profiling of gene expression and alternative polyadenylation.  Objective benchmarking analyses using diverse scRNA-seq datasets demonstrate higher performance and robustness of scLAPA than competing methods in cell-cell similarity learning and cell type clustering.  scLAPA discovers hidden subpopulations of cells that are unrecognizable based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 8. Qi R, Ma A, Ma Q et al. Clustering and classification methods for single-cell RNA-sequencing data, Briefings in Bioinformatics 2020;21:1196-1208. 9. Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data, Briefings in Bioinformatics 2020;21:1209-1223. 10. Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis, Nature Methods 2014;11:740. 11. Grun D, Kester L, van Oudenaarden A. Validation of noise models for single-cell transcriptomics, Nat Methods 2014;11:637-640. 12. Stuart T, Satija R. Integrative single-cell analysis, Nature Reviews Genetics 2019;20:257-272. 13. Stuart T, Butler A, Hoffman P et al. Comprehensive Integration of Single-Cell Data, Cell 2019;177:1888-1902.e1821. 14. Welch JD, Kozareva V, Ferreira A et al. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity, Cell 2019;177:1873-1887.e1817. 15. Wu X, Liu T, Ye C et al. scAPAtrap: identification and quantification of alternative polyadenylation sites from single-cell RNA-seq data, Briefings in Bioinformatics 2020. 16. Patrick R, Humphreys DT, Janbandhu V et al. Sierra: discovery of differential transcript usage from polyA-captured single-cell RNA-seq data, Genome Biol 2020;21:167. 17. Levin M, Zalts H, Mostov N et al. Gene expression dynamics are a proxy for selective pressures on alternatively polyadenylated isoforms, Nucleic Acids Res 2020;48:5926-5938. 18. Shulman ED, Elkon R. Cell-type-specific analysis of alternative polyadenylation using single-cell transcriptomics data, Nucleic Acids Res 2019;47:10027-10039. 19. Arzalluz-Luque A, Conesa A. Single-cell RNAseq for the study of isoforms-how is that possible?, Genome Biology 2018;19:110. 20. Song Y, Botvinnik OB, Lovci MT et al. Single-Cell Alternative Splicing Analysis with Expedition Reveals Splicing Dynamics during Neuron Differentiation, Molecular Cell 2017;67:148-161.e145. 21. Macosko EZ, Basu A, Satija R et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets, Cell 2015;161:1202-1214. 22. Hashimshony T, Wagner F, Sher N et al. CEL-Seq: Single-Cell RNA-Seq by Multiplexed Linear Amplification, Cell Rep 2012;2:666-673. 23. Zheng GX, Terry JM, Belgrader P et al. Massively parallel digital transcriptional profiling of single cells, Nat Commun 2017;8:14049. 24. Ye C, Zhou Q, Hong Y et al. Role of alternative polyadenylation dynamics in acute myeloid leukaemia at single-cell resolution, Rna Biology 2019;16:785-797. 25. Kim N, Chung W, Eum HH et al. Alternative polyadenylation of single cells delineates cell types and serves as a prognostic marker in early stage breast cancer, PloS one 2019;14:e0217196. 26. Velten L, Anders S, Pekowska A et al. Single-cell polyadenylation site mapping reveals 3' isoform choice variability, Molecular Systems Biology 2015;11:812-812. 27. Franzén O, Gan L-M, Björkegren JLM. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data, Database 2019;2019. 28. Wang B, Mezlini AM, Demir F et al. Similarity network fusion for aggregating data types on a genomic scale, Nature Methods 2014;11:333. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 29. Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy, Genome Research 2017;27:491-499. 30. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features, Bioinformatics 2013;30:923-930. 31. Ye W, Liu T, Fu H et al. movAPA: Modeling and visualization of dynamics of alternative polyadenylation across biological samples, Bioinformatics 2020. 32. Shen Y, Ji G, Haas BJ et al. Genome level analysis of rice mRNA 3'-end processing signals and alternative polyadenylation, Nucleic Acids Research 2008;36:3150-3161. 33. Wu X, Liu M, Downie B et al. Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation, Proceedings of the National Academy of Sciences, USA 2011;108:12533-12538. 34. Zhao Z, Wu X, Raj Kumar PK et al. Bioinformatics Analysis of Alternative Polyadenylation in Green Alga Chlamydomonas reinhardtii Using Transcriptome Sequences from Three Different Sequencing Platforms, G3: Genes|Genomes|Genetics 2014;4:871-883. 35. Wu X, Gaffney B, Hunt A et al. Genome-wide determination of poly(A) sites in Medicago truncatula: evolutionary conservation of alternative poly(A) site choice, BMC Genomics 2014;15:615. 36. Pouyan MB, Kostka D. Random forest based similarity learning for single cell RNA sequencing data, Bioinformatics 2018;34:i79-i88. 37. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. 38. Blondel VD, Guillaume J-L, Lambiotte R et al. Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment 2008;2008:P10008. 39. Eisen MB, Spellman PT, Brown PO et al. Cluster analysis and display of genome-wide expression patterns, Proc Natl Acad Sci U S A 1998;95:14863-14868. 40. Ng AY, Jordan M, Weiss Y. On Spectral Clustering: Analysis and an Algorithm. Advances in neural information processing systems. 2001, 849–856. 41. Langfelder P, Horvath S. Fast R Functions for Robust Correlations and Hierarchical Clustering, Journal of Statistical Software 2012;46:1-17. 42. Guo M, Wang H, Potter SS et al. SINCERA: A Pipeline for Single-Cell RNA-Seq Profiling Analysis, PLoS Comput Biol 2015;11:e1004575-e1004575. 43. Xu C, Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method, Bioinformatics 2015;31:1974-1980. 44. Langfelder P, Zhang B, Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R, Bioinformatics 2007;24:719-720. 45. Guy Brock, Vasyl Pihur, Susmita Datta et al. clValid, an R package for cluster validation, Journal of Statistical Software 2011;25:1-22. 46. Chang F, Qiu W, Zamar RH et al. clues: An R Package for Nonparametric Clustering Based on Local Shrinking, Journal of Statistical Software 2010;33:16. 47. McInnes L, Healy J, Saul N et al. UMAP: Uniform Manifold Approximation and Projection, Journal of Open Source Software 2018;3:861. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 48. McCarthy DJ, Campbell KR, Lun AT et al. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R, Bioinformatics 2017;33:1179-1186. 49. Love M, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2, Genome Biology 2014;15:550. 50. Jean-Baptiste K, McFaline-Figueroa JL, Alexandre CM et al. Dynamics of Gene Expression in Single Root Cells of Arabidopsis thaliana, The Plant Cell 2019;31:993-1011. 51. Ryu KH, Huang L, Kang HM et al. Single-Cell RNA Sequencing Resolves Molecular Relationships Among Individual Plant Cells, Plant Physiology 2019;179:1444-1456. 52. Shahan R, Hsu C-W, Nolan TM et al. A single cell Arabidopsis root atlas reveals developmental trajectories in wild type and cell identity mutants. 2020. 53. Shulse CN, Cole BJ, Ciobanu D et al. High-Throughput Single-Cell Transcriptome Profiling of Plant Cell Types, Cell Reports 2019;27. 54. Zhang T-Q, Xu Z-G, Shang G-D et al. A Single-Cell RNA Sequencing Profiles the Developmental Landscape of Arabidopsis Root, Molecular Plant 2019;12:648-660. 55. Zhu S, Ye W, Ye L et al. PlantAPAdb: A Comprehensive Database for Alternative Polyadenylation Sites in Plants, Plant Physiology 2020;182:228-242. 56. Kaufmann L, Rousseeuw P. Clustering by means of medoids. In: Dodge Y. (ed) Statistical data analysis based on the L1-norm and related methods. Amsterdam: North-Holland, 1987, 405–416. 57. Gao Y, Li L, Amos CI et al. Dynamic Analysis of Alternative Polyadenylation from Single-Cell RNA-Seq(scDaPars) Reveals Cell Subpopulations Invisible to Gene Expression Analysis, bioRxiv 2020:2020.2009.2023.310649. 58. You L, Wu J, Feng Y et al. APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals, Nucleic Acids Research 2014;43:D59-D67. 59. Shirkhorshidi AS, Aghabozorgi S, Wah TY. A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data, PLoS One 2015;10:e0144059. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Figure legends Figure 1. Single-cell poly(A) profile in root hair and nonhair cells. (A) Genes exclusively present in the PA-matrix. Four genes (AT3G22970, AT1G64140, AT4G09200 and AT3G25221) were not present in the GE-matrix, whereas they had at least one poly(A) site according to the PA-matrix. For each gene, the violin plot shows expression levels of its poly(A) site in hair and nonhair cells and the UMAP visualization shows the 2D embeddings of poly(A) profile. (B) Two example genes (AT1G59725 and AT4G18940) that are not differentially expressed (DE) but possess at least one DE poly(A) site. The upper panel places the violin plot and UMAP visualization showing the poly(A) profile of the respective gene in hair and nonhair cells. The lower panel shows the gene profile. (C) Single-cell poly(A) profile distinguishes root hair and root nonhair cells. The left plot is the UMAP representation on the basis of 171 genes that are not DE but with at least one DE poly(A) site, the right plot is the UMAP representation on the basis of poly(A) profile of the 171 genes. Figure 2. Benchmarking of similarity learning with scLAPA on four published scRNA-seq datasets. (A) The internal validation metric of Dunn was employed to measure the cell separation. (B) Radar chart showing the performance of different similarity metrics across datasets. Dataset names are shown near the vertex of the plot. Each vertex denoting the Dunn score of a metric on the respective dataset. The larger the area of a polygon displayed in a radar chart is, the higher the overall performance is. (C) Radar chart showing the performance of scLAPA with different distance metrics for distance fusion. Each vertex denotes the Dunn score of using different distance metrics on the respective dataset. Figure 3. Benchmarking of similarity learning with scLAPA in the context of clustering on four published scRNA-seq datasets. (A) ARI was employed to measure the concordance between inferred and true cluster labels. Louvain clustering was applied on the similarity matrices obtained from different methods. (B) Radar charts showing ARI scores by applying different clustering methods on cell-cell similarities learned by each similarity metric. Each plot represents results of one dataset. Clustering methods are shown near the vertex of the plot. The vertex of a plot denotes the ARI score of applying a clustering method on different metrics. The larger the area of a polygon displayed in a radar chart is, the higher the overall performance is. HC, hierarchical clustering; SC, spectral clustering. Figure 4. Comparison of performance between scLAPA and scDaPars across four scRNA-seq datasets. Five similarity metrics were applied on the scDaPars-imputed PDUI profile and the GE-matrix to generate scDaPars-dist and GE-dist, respectively. After fusing the two distance matrices with SNF, Louvain clustering was applied on the fused cell-cell similarities to cluster cells. We did not include RAFSIL in this experiment due to its slow calculation speed. For scLAPA, Pearson correlation was used for similarity learning and Louvain was used for clustering. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Figure 5. ARI scores from six clustering methods across four scRNA-seq datasets. For scLAPA, Pearson correlation was used for similarity learning and Louvain was used for clustering. Figure 6. scLAPA identifies hidden subpopulations of cells from human PBMCs. (A) UMAP representation of Seurat’s clustering results on the basis of GE-matrix. Ten clusters were obtained and nine were annotated with known cell types: Naive T cell (1), CD14+ Monocytes (2), CD8+ T cell (3), B cell (4), CD4+ Memory T (5), NK cell (6), CD16+ Monocytes (7), Monocyte Derived Dendritic (8, 10) and Plasmacytorid Dendritic (9). (B) UMAP representation of scLAPA’s clustering results on the basis of GE-matrix and PA-matrix. Fourteen clusters were obtained and 11 clusters were annotated with known cell types: Regulatory T cell (1), Naive T cell (2, 3), Plasmacytorid Dendritic (4), CD4+ Memory T (5), CD8+ T cell (6), CD14+ Monocytes (7), Monocyte Derived Dendritic (8, 11, 14), CD16+ Monocytes (9), Megakaryocyte Progenitors (10), B cell (12) and NK cell (13). The two arrows mark two new subpopulations of cells identified by scLAPA. (C) Gene expression of CCR10 distinguishes regulatory T cells from other T cell types according to the UMAP visualization of the gene expression profile. The details in the dashed line box are shown in the solid line box. (D) Gene expression of PPBP distinguishes megakanyocyte progenitors from other cell types. (E) Three poly(A) sites of PPBP are all highly expressed in megakanyocyte progenitors. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 0 1 2 3 Hair Nonhair Hair 0 1 2 3 0 1 2 3 0 1 2 3 P A 7 6 4 0 0 1 2 3 A T 1 G 5 9 7 2 5 (B) (C) (A) 0 1 2 3 A T 4 G 0 9 2 0 0 (P A c o o rd : 5 8 6 1 8 6 5 ) A T 3 G 2 2 9 7 0 (P A c o o rd : 8 1 5 4 1 1 0 ) A T 1 G 6 4 1 4 0 (P A c o o rd : 2 3 8 0 3 7 5 7 ) A T 3 G 2 5 2 2 1 (P A c o o rd : 9 1 8 4 9 2 7 ) Hair Nonhair UMAP 1 UMAP 1 U M A P 2 0 1 2 3 A T 4 G 1 8 9 4 0 0 1 2 3 P A 3 0 7 9 2 -4 -2 0 2 -3 0 3 6 -2 -1 0 1 2 -4 0 4 U M A P 2 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Amygdala Hypothalamus Mammary Root 0.00 0.25 0.50 0.75 1.00 D u n n Euclidean Pearson SIMLR RAFSIL1 RAFSIL2 scLAPA (A) p  s  Hypothalamus Mammary Amygdala Root Hypothalamus MammaryRoot Amygdala (B) (C) Euclidean Pearson p  s  SIMLR RAFSIL1 RAFSIL2 scLAPA Euclidean+Euclidean Pearson+Pearson + + p  s  SIMLR+SIMLR RAFSIL1+RAFSIL1 RAFSIL2+RAFSIL2p s  (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Amygdala Hypothalamus Mammary Root 0.00 0.25 0.50 0.75 1.00 A R I Euclidean Pearson SIMLR RAFSIL1 RAFSIL2 scLAPA Hypothalamus Mammary Amygdala Root (A) p  s  HC SC K-means Louvain (B) Euclidean Pearson p  s  SIMLR RAFSIL1 RAFSIL2 scLAPA HC SC K-means Louvain HC SC K-means Louvain HC SC K-means Louvain (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Amygdala Hypothalamus Mammary Root 0.00 0.25 0.50 0.75 1.00 A R I Euclidean Pearson SIMLR scLAPA p  s  scDaPars+ (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Amygdala Hypothalamus Mammary Root 0.00 0.25 0.50 0.75 1.00 A R I SC3 SINCERA SNNclip Seurat dynamicTreeCut scLAPA (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 (A) (C) 4 11 10 8 9 0 4 8 12 -12 -9 -6 UMAP 1 U M A P 2 0 1 2 3 4 5 (E) 6 3 5 1 2 -15 -10 -5 0 5 -15 -10 -5 0 5 10 UMAP 1 U M A P 2 0 1 2 (D) 0 2 4 6 PA11108 (coord:73986478) 0 2 4 6 PA11107 (coord:73986076 ) 0 2 4 6 1 2 3 4 5 6 7 8 91011121314 Identity E x p re s s io n L e v e l PA11109 (coord:73987038) 1 2 3 4 5 6 7 8 91011121314 1 2 3 4 5 6 7 8 91011121314 4 10 5 1 2 5 1 9 8 4 3 6 710 -10 0 10 -10 -5 0 5 10 UMAP 1 U M A P 2 -10 0 10 -10 -5 0 5 10 UMAP 1 U M A P 2 Regulatory T cell Megakaryocyte Progenitors 7 6 3 5 4 11 12 13 2 10 8 14 9 1 (B) (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 10_1101-2021_01_06_425494 ---- bioRxiv.org - the preprint server for Biology Skip to main content Home About Submit ALERTS / RSS Search for this keyword Advanced Search Subject Areas All Articles Animal Behavior and Cognition Biochemistry Bioengineering Bioinformatics Biophysics Cancer Biology Cell Biology Clinical Trials Developmental Biology Ecology Epidemiology Evolutionary Biology Genetics Genomics Immunology Microbiology Molecular Biology Neuroscience Paleontology Pathology Pharmacology and Toxicology Physiology Plant Biology Scientific Communication and Education Synthetic Biology Systems Biology Zoology View by Month 10_1101-2021_01_05_425414 ---- LiquidCNA: tracking subclonal evolution from longitudinal liquid biopsies using somatic copy number alterations LiquidCNA: tracking subclonal evolution from longitudinal liquid biopsies using somatic copy number alterations Eszter Lakatos1⇤, Helen Hockings2,3, Maximilian Mossner1, Weini Huang4, Michelle Lockley2,5, Trevor A. Graham1⇤ 1 Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK 2 Centre for Cancer Cell and Molecular Biology, Barts Cancer Institute, Queen Mary University of London, London, UK 3 Barts Health NHS Trust, St Bartholomew’s Hospital, West Smithfield, London, UK 4 School of Mathematical Sciences, Queen Mary University of London, London, UK 5 Department of Gynaecological Oncology, Cancer Services, University College London Hospital, London, UK ⇤ Correspondence: e.lakatos@qmul.ac.uk; t.graham@qmul.ac.uk Abstract Cell-free DNA (cfDNA) measured via liquid biopsies provides a way for minimally-invasive monitoring of tumour evolutionary dynamics during therapy. Here we present liquidCNA, a method to track subclonal evolution from longitudinally collected cfDNA samples based on somatic copy number alterations (SCNAs). LiquidCNA utilises SCNA profiles derived through cost-e↵ective low-pass whole genome sequencing to automatically and simulta- neously genotype and quantify the size of the dominant subclone without requiring prior knowledge of the genetic identity of the emerging clone. We demonstrate the accuracy of liquidCNA in synthetically generated sample sets and in vitro and in silico mixtures of cancer cell lines. Application in vivo in patients with metastatic lung cancer reveals the progressive emergence of a novel tumour sub-population. LiquidCNA is straightfor- ward to use, computationally inexpensive and enables continuous monitoring of subclonal evolution to understand and control therapy-induced resistance. 1 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction Liquid biopsies, primarily the analysis of cell free DNA (cfDNA) present in blood samples, o↵er the potential for regular longitudinal and minimally invasive monitoring of cancer dynamics [1, 2, 3, 4, 5, 6, 7]. Circulating cfDNA is released into the blood via apoptosis or necrosis of cells. Tumour-derived cfDNA in the blood is detectable from tumours as small as 50 million cells [8], it shows correlation with disease stage [9, 10], and o↵ers the same diagnostic potential as tissue-based biopsies [7]. cfDNA is an aggregate of DNA shed from multiple locations and multiple malignant cells across the body and hence a single sample can provide a comprehensive overview of systemic disease. Consequently, cfDNA is an exceptional resource for non-invasive tracking of tumour composition and for monitoring response to therapy or clinical relapse. Typically, cfDNA analysis has focused on the detection of driver gene single nucleotide variants (SNVs), with the size of mutation-bearing clones inferred from the relative se- quencing read count at the mutation site. For instance, in high-grade serous ovarian cancer (HGSOC) the frequency of TP53 mutation in cfDNA is a measure of tumour burden and is predictive of treatment response [11]. In colorectal cancer, KRAS mutation frequency in cfDNA is predictive of response to anti-EGFR therapy [4]. Somatic copy number alterations (SCNAs) are widespread in cancers [12, 13, 14], and have been used extensively to track tumour composition and dynamics over time [15, 16, 17, 18]. SCNAs can be detected in cfDNA without prior knowledge of the tumour SCNA profile, through measurement of the relative number of reads mapping within ‘bins’ spaced across the genome [19]. Relative di↵erences in read count between bins can be sensitively detected even when the total read count is low [20, 21, 22], meaning that SCNAs can be detected with a fraction of the sequencing depth required for SNV detection. Therefore SCNA profiling o↵ers a high-throughput and cost-e↵ective means to evaluate cfDNA samples [23, 24, 25, 26, 27, 28]. Whilst measuring clone sizes based on the frequency of SNVs is straightforward, de- riving quantitative information on the proportion of tumour population that carries a particular SCNA is challenging. Tumour cells are not the only contributors to the cfDNA pool, and an SCNA can in theory change the copy number to any non-negative integer value. Thus total read count per bin is a noisy compound function of the relative tumour cell contribution to the total cfDNA pool, and the specific copy number of the alteration. Here we present a new method to identify and track tumour subclonal evolution based solely on measurement of SCNAs from longitudinal cfDNA samples. Our algorithm, named liquidCNA, firstly determines the contribution of tumour DNA to the total cfDNA pool (i.e. cellularity/purity) and then uses SCNA data to characterise and quantify the size of the most pervasive (putatively resistant) subclone emerging or contracting over time. The e�cacy of the method is demonstrated using synthetic datasets, in vitro cell line mixtures, and in vivo via longitudinal analysis of cfDNA from lung cancer patients undergoing targeted treatment. 2 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Results Emergent subclone tracking from copy number information First, we derive a mathematical definition of the problem of tracking an emergent (pu- tatively resistant) tumour subclone from longitudinal cfDNA samples, typically taken throughout the course of treatment. We consider a tumour cell population undergoing continuous evolution characterised by two cell types, ancestral tumour cells (A) and an emerging subclone (S). We assume that liquid biopsies contain DNA originating from an- cestral and subclonal tumour cells, as well as contaminating DNA from normal cells (N). The proportion of DNA arising from cells of the emergent subclone within the tumour is expressed by the subclonal-ratio, ri, while the overall proportion of tumour-originating DNA is termed the purity or tumour fraction of the sample, denoted by pi. We consider that the copy number (CN) profile of each sample has been measured – for example using low-pass whole genome sequencing (lpWGS) – and so the genome can be divided into segments, contiguous regions of constant CN. Each measured segment CN in sample i (C j i ) is the combination of each cell population’s CN at the jth genomic location (2 for normal cells and C(A) and C(S) for ancestral and subclonal tumour cells, respectively), weighted by the proportions of the three populations (Fig. 1). C j i = 2 + pi � (1 � ri)C(A)j + riC(S)j � 2 � . (1) We assume that each segment can fall into one of three categories depending on its CN in ancestral and subclonal tumour cells. Clonal alterations (and unaltered segments) are at the same CN in both tumour populations, and their measured CN is only a↵ected by the purity of a sample. Subclonal segments represent SCNAs that are unique to the emerging subclone. Their measured CN is influenced by the subclonal-ratio of a sample, as well as sample purity. Finally, segments that do not follow either of these patterns – due to uncertain measurements or ongoing instability – are termed unstable. Our aim is to estimate the underlying purity and subclonal-ratio, pi and ri, from longitudinal CN measurements of clonal and subclonal segments (Fig. 1). Estimation of subclonal-ratio Estimation is carried out in three steps (Fig. 2a and Methods). First, the purity of each sample is assessed using the distribution of segment CN values. We assume that the majority of segments have integer CN in all tumour cells, hence the distribution is expected to have distinct peaks at regular intervals of pi, corresponding to clonal segments with CN of 1, 2, 3, etc. (Fig. 2b). We derive the purity estimate as the value that minimises the squared error between observed and expected peaks (Fig. 2c). The inferred purity values are used to correct the segment CN values, thus estimating the tumour-specific CN of each segment. LiquidCNA does not require a mainly diploid tumour genome (i.e. major peak at CN= 2) to derive correct estimates, but will derive erroneous conclusions if the CN values 3 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ – as measured by the CN quantification software, e.g. QDNAseq [19] – are incorrectly centred (e.g. major peak is defined as copy number 2, but the true value is copy number 3). To control for this an initial manual check of the CN profile is recommended prior to applying liquidCNA and renormalisation to the correct ploidy if required. Next, for every segment we compute the change in CN, �CN, between each sample and a baseline sample that is assumed to have negligible proportions of the emerging (putatively resistant) subclone – for example a sample taken upon diagnosis or before start of therapy. �CN values naturally highlight subclone-associated segments altered in non-baseline samples, as these segments display markedly positive (CN gain compared to baseline) or negative (CN loss) values (Fig. 2d). From these �CNs we then establish the set of segments that are subclonal and the sample ordering that reflects increasing subclonal proportions. To do this, we examine each possible order of samples, classifying each segment as clonal (if the variance of its �CNs across samples is below a pre-defined threshold), subclonal (if it shows monotone change in �CN value along the order of the samples - i.e. if the �CNs are consistent with an emerging subclone) or unstable (if it does not correlate with sample order) according to that order (Fig. 2e). The order with the highest proportion of segments classified as subclonal is selected, and these subclonal segments are used for downstream computation of tumour composition (Fig. 2f). The methodology ensures that the dominant subclone associated with the most pervasive SC- NAs is evaluated and that subclonal-ratio inference is robust to segments with unstable CN. Finally, we compute the relative and absolute subclonal-ratio of each sample using the identified set of subclonal segments. Relative subclonal-ratios are defined as the median ratio of segment �CNs compared to the sample with the maximum subclonal proportion (Fig. 2g). The absolute subclonal-ratio is computed based on the assumption that sub- clonal segment CN values correspond to distinct SCNAs that di↵er between ancestral and subclonal cells. The subclonal-ratio of sample i is therefore derived as the shared mean (ri) of a mixtures of Gaussian distributions with constrained means �ri, +ri, etc., fitting the �CN distribution of subclonal segments (Fig 2h). We also provide the 95% confidence interval of the absolute subclonal-ratio estimate based on the shared variance of the fitted Gaussians (Fig 2i). LiquidCNA outputs both relative and absolute subclonal-ratio measures, since for most applications the relative value holds su�cient information on how the subclonal (putative resistant) population changes between time-points. Relative proportions are also less susceptible to the measurement noise in the measured segment CNs, while a combination of low subclonal proportion and high sequencing noise can cause the fitting of absolute subclonal-ratio estimates to fail to converge. Synthetic mixed populations We first evaluated the performance of liquidCNA using synthetic datasets where input values of subclonal proportion and purity were known. We generated synthetic datasets 4 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ characteristics matching typical longitudinal measurements of patients. In order to simu- late imperfect measurements, we added varying levels of normally distributed measurement noise (defined by the dimensionless parameter �) to bin-wise CN values (Fig. 3a-c and Methods). We evaluated the accuracy of the purity estimation on 250 synthetic samples (Fig. 3d), and found that purity p could be estimated within 2% of the true tumour fraction in 90% of samples at noise levels �  1. The error on the purity estimation was greater when the noise was increased (Fig. 3e), and was most pronounced in samples with high noise and low tumour fraction. Consequently, we restricted our subsequent analysis to only cases of higher purity (pi � 0.1). Next, we derived subclonal-ratios using purity-corrected CN profiles on the higher purity subset of synthetic mixtures. We set a threshold to filter out clonal segments (see Fig. 2e) such that at least 10 segments were retained and the proportion of retained segments classified as subclonal was maximal following segment classification. Fig. 3f shows the true and estimated subclonal-ratios for 50 synthetic experiments. Overall, we found that subclonal-ratio was estimated with ⇠ 5% error, and the accuracy was influenced by measurement noise (Fig. 3g). Relative subclonal-ratios (calculated compared to the sample with highest subclonal proportion) were estimated with higher accuracy (error ⇠ 3%, Fig. S1a-b). We found that computing absolute subclonal-ratios in a two-step process from these values yielded similar results to direct estimation by fitting a Gaussians mixture model, and provided an estimate even in cases where the direct estimation did not converge (Fig. S1c and Methods). The proportion of unstable segments, unlike noise, had little e↵ect on the estimation accuracy (Fig. S2). Mixtures of ovarian cancer cell lines Next, we evaluated liquidCNA on real data derived from in vitro mixtures of two paired high grade serous ovarian cancer (HGSOC) cell lines [29] (see Method and Table S1). HGSOC cells were ideally suited for this evaluation as high levels of chromosomal insta- bility are a hallmark of the disease [30, 31]. We anticipated that liquidCNA will be most applicable for the tracking of subclonal evolution in malignancies with high CNA burden [26]. We divided a population of OVCAR4 cells into two aliquots, and the first aliquot was untreated and classified as ‘sensitive’. In a process described in detail by Hoare et al. [29], cells from the second aliquot were cultured so that they evolved resistance to platinum- containing chemotherapy and thus were termed ‘resistant’. In addition to the high SCNA burden inherited from the ancestral sensitive cell line, resistant cells acquired new SCNAs during the in vitro evolution of resistance (Figure 4a). We then mixed, in varying known proportions, the genomic DNA extracted from the two cell lines, with sensitive cells representing the ancestral and resistant cells the emerging subclonal population. The mixtures were further diluted with DNA from blood samples of healthy volunteers assumed to have a diploid genome; this modelled the e↵ect of normal 5 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ contamination in patient samples (Table S1). These DNA mixtures were sequenced to mean depth 1.3x and composite SCNA profiles were generated (see Methods). In addition, we generated further in silico mixtures by sampling and mixing genome-aligned reads from sequencing data from each of the three cell types sequenced individually. In these mixtures, we controlled the total number of reads per sample to study the e↵ect of variable read depth and associated measurement noise. First, we used liquidCNA to estimate the purity of in vitro mixed samples (samples S0-S5). The purity of each sample was estimated to be lower than the theoretical mixing proportion (Fig. 4b). In the in silico mixed samples, we found that there was a strong linear relationship between estimated and true purity (Fig. 4c). The underestimation of purity in the samples might be explained by our definition of theoretical purity in the in vitro and in silico mixing procedure (respectively defined as proportion of DNA weight versus the proportion of read counts). A highly aneuploid genome will likely have a higher weight than a diploid genome, therefore mixing of equal weights results in a higher pro- portion of normal genomes than expected. Our purity estimates were in agreement with observed peaks of the CN distribution (Fig. S3a), further confirming that there was no bias in the estimation. By fitting a linear model to the estimates, the theoretical tumour fraction could be fully recovered, as illustrated by the ‘corrected’ estimates of samples S0-S5 (Fig. 4b). The number of reads (sequencing depth) did not systematically influ- ence the accuracy of estimating tumour fraction, but purity estimates of samples with low tumour fraction were noisier at low read depth (Fig. 4c). In summary, liquidCNA pro- vided an accurate estimate for purity values when true purity was above 10%. Decreased measurement accuracy below 10% purity is consistent with our observations on synthetic data and is similar to reported limitations of other methods quantifying tumour fraction from lpWGS cfDNA [24, 20, 22]. Therefore, for samples below 10% predicted purity, we advise to discard the sample from downstream analysis, although low-purity samples may be usable if a very accurate purity estimate can be derived by other means. Next, we inferred the subclonal-ratio for cell line mixtures using purity-corrected �CN values, with sample S0 used as the baseline sample for both in vitro and in silico sample sets. We could correctly order cell line mixtures according to subclonal-ratios without any a priori information (Fig. S3b), and both absolute subclonal-ratio and relative subclonal changes were estimated on average within 2% and 3% of the true subclonal percentage (Fig. 4d,f). In particular, we noted that samples S4 and S3 were accurately estimated as having an equal subclonal-ratio, despite originating from di↵erent biological replicates with di↵erent tumour purity, which was reflected in the small confidence intervals of their estimates. We also note that even though there were no truly unstable segments in this dataset as measurements were not taken over time, three non-clonal segments were clas- sified as such, probably due to higher noise in their measured CN value. Using datasets of randomly selected in silico samples with 50 million reads, we con- firmed that our algorithm could accurately infer the subclonal-ratio of samples, in partic- ular when considering relative proportions (Fig. 4e,g). Although the estimation quality 6 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ decreased with lower read counts (Fig. S4), in most cases the estimated absolute and relative subclonal-ratio was within 15% and 10% of the true subclonal proportion, re- spectively. Furthermore, we found that cases with high estimation error were typically caused by low-purity samples, which could be easily identified and removed without a priori information, as demonstrated in Fig. S5. Using the known theoretical mixing values of tumour-DNA content – instead of data- derived estimates – to derive purity-corrected CN values increased the estimation error, especially in low read count samples (Fig. S6). This finding emphasises that non-diploid genomes might bias alternative measurement methods and internal consistency in the method of deriving sample characteristics (purity and subclonal-ratio) is crucial when assessing the dynamics of the subclonal population. Subclonal analysis of patient samples We used liquidCNA to analyse emergent subclones in longitudinal cfDNA samples from pa- tients with non-small cell lung cancer (NSCLC) undergoing therapy, as previously reported by Chen and colleagues [24]. The liquid biopsies were collected as part of the FIGARO study (GO27912, NCT01493843), a randomised phase II trial designed to evaluate the e�cacy of pictilisib, a selective inhibitor of phosphatidylinositol 3 kinase [32]. Pictilisib or placebo was given in combination with standard chemotherapy regimen which was de- termined based on the subtype of NSCLC. Blood samples were taken at baseline (day 1 of the first treatment cycle) and at 6-week intervals up to the end of treatment (EOT). DNA was isolated from the plasma of liquid biopsies and sequenced using lpWGS to an average depth of 0.5x, as described in details in [24]. Chen et al. [24] identified several SCNAs in EOT samples that were absent at baseline and described several genes within these regions that might be associated with resistance. We sought to apply liquidCNA to these cases to corroborate their observations, and further to quantify the size of emergent subclones over time in these patients. We obtained the lpWGS data (fastq files) and performed CN profiling (see Methods) on patients with cfDNA samples from �3 time-points (n = 32). We identified three patients (1306, 2760 and 3209) whose sample series fulfilled the following criteria: (i) had a cfDNA sample taken on the first day of therapy with purity above ⇠20%; (ii) and had at least two non-baseline samples with purity above ⇠20%. Patients 1306 and 3209 were in the experimental arm of the study, while patient 2760 was assigned to the control arm; and all three patients have progressed during the course of the trial. We ran liquidCNA on data from the three selected patients (discarding samples with purity below 10% (Fig. S7)) and examined the genomic segments that liquidCNA identified as subclonal relative to baseline samples (Fig. 5). While we observed a good overlap with the CNs previously reported to be associated with subclonal evolution through therapy (Figures 5 and S8 of [24]), we also found a few segments that were missed or additionally identified by liquidCNA. The original study focused on the comparison of pre- and post-treatment and highlighted SCNAs occurring 7 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ between the first and last time-points. As our analysis put equal focus on all time-points, it classified some of the previously identified segments as unstable if the CN progression was not consistently with subclone evolution. Furthermore, some segments were too small to pass our initial filtering. On the other hand, liquidCNA was able to identify subclonal segments which were at an abnormal CN in the baseline sample, and subsequently showed diploid CN or a further gain/loss in subclonal tumour cells. For example, in the samples from Patient 1306, whilst liquidCNA identified subclonal SCNAs on chromosomes 2, 6 and 8 that overlapped with the findings of the original study; it also detected additional subclonal changes on chromosomes 17 and 18. However, we did not observe the previously described focal loss on chromosome 7 (harbouring the gene MLL3), probably due to its small size. Overall, we identified 10, 12 and 17 subclone-associated SCNAs in patients 1306, 2760 and 3209, respectively. A further 6 segments in patient 2760 were classified as non-clonal but ’unstable’ as the CN over time was not consistent with the pattern defined by the emerging subclone. As samples from patient 2760 had lower purity, these inconsistent CN changes might have resulted from measurement noise. We found that the emerging subclone accounted for 10 to 30% of the tumour derived DNA in the cfDNA in the three patients evaluated. Patient 2760 showed evidence of a subclonal proportion consistently around 30%, which could be explained by samples from this patient taken at later time-points. Samples from patient 3209 obtained at 18 weeks and end of therapy contained below 20% DNA derived from subclonal tumour cells (Fig. 5). Patient 1306, on the other hand, showed a contracting subclone that reduced in proportion from 20% presence at week 18 to <10% at the end of therapy. In case the total population size was known – which might be accessible from additional measurements of the tumour-associated cfDNA pool –, the tumour subclone fractions established here could also be converted into growth rates to enable future predictions of the tumour dynamics. 8 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Discussion We present liquidCNA, a computational algorithm to infer longitudinal subclonal dynam- ics using copy number measurements. Our algorithm performs simultaneous analysis of several longitudinal samples to identify sample purity, subclonal SCNAs and the abun- dance of an emerging subclone. LiquidCNA distinguishes between SCNAs that are associ- ated with the emerging subclone and those showing unstable behaviour, and consequently is not confounded by uncertain CN measurements. We validate our method both on synthetic SCNA datasets, and in vitro and in silico mixtures of two ovarian cancer cell lines. We successfully infer the proportion of the dominant subclone in all of the above datasets, with good accuracy across a range of sample qualities defined by the noise level or sequenced reads. In patients with lung cancer, liquidCNA applied to lpWGS data derived from longitudinal liquid biopsies (cfDNA) shows the emergence of subclones during therapy and identifies genomic regions associated with the emergent tumour cells. We demonstrate that liquidCNA can identify and quantify emerging subclones from cfDNA samples, therefore enabling tracking of tumour subclone evolution through the course of therapy. Deciphering the evolutionary trajectory of cancer can aid prognostic and therapeutic decision-making and further our understanding of therapy-induced drug resistance [33]. Measuring the dynamics of tumour composition is particularly crucial for prospective monitoring during an adaptive therapy regime aiming to control resistant subclones [34, 35, 36]. Furthermore, the proportion of cfDNA that is tumour-derived (what we term ’purity’) in itself is a promising biomarker for determining initial therapy response and prognosis [5, 37], as well as for tracking tumour progression during and after therapy [2, 10, 6, 24]. We note that there are limitations in our liquidCNA method. Since our inference relies on heterogeneous copy number profiles and subclone-specific SCNAs, we cannot analyse cancer (sub)types with very low chromosomal instability, for example microsatellite un- stable tumours. Conversely, extremely high levels of ongoing instability might bias our analysis due to the lack of stable subclone-associated SCNA profile, and therefore liq- uidCNA is not suitable for oligo-metastatic disease if spatially separate metastases carry distinct karyotypes. Furthermore, the accuracy of our estimation reduces at low purity (below 10%). However, a tumour fractions above this regime were observed in a sub- stantial number of patients, especially in late stage disease where liquidCNA can o↵er the largest benefit, [24, 20, 6, 9, 22, 38]. In addition, recent studies have shown that the unique fragment length of tumour-derived cfDNA can be utilised to enrich for tumour purity either experimentally or bioinformatically [39, 40, 41]. Finally, liquidCNA tracks a single dominant subclone associated with the largest set of subclone-specific SCNAs, and if there are multiple smaller subclones (with less or no associated SCNAs), these will be ignored by the algorithm. In summary, we provide a robust tool to derive quantitative information about dy- 9 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ namic changes in clonal composition from SCNA measurements derived from cfDNA. LiquidCNA enables real-time non-invasive tracking of subclonal tumour evolution, which can provide new insights into the evolution of SCNAs and the dynamical emergence of therapy-associated resistance. Acknowledgements We thank Ann-Marie Baker for reviewing the clarity of the text, and Steve Gendreau and Craig Cummings from Genentech, Inc. for providing access to patient cfDNA sequencing results and for their critical comments on the presentation of the data. This work was supported by the Wellcome Trust (grant 202778/Z/16/Z to T.A.G.) and Cancer Research UK (grant A19771 to T.A.G. supporting E.L.; Advanced Clinician Scien- tist Fellowship C41405/A19694 to M.L.; Clinical Research Training Fellowship to H.H.). M.L. also received support from a Barts and The London Charity Strategic Research Grant (467/2244). T.A.G. also received founding from the National Institutes of Health, National Cancer Institute (grant U54 CA217376). Author contributions E.L., W.H., M.L. and T.A.G. conceived and designed the study. M.L. and T.A.G. acquired funding for the study. E.L. developed the inference method and performed bioinformatic analysis. H.H. and M.M. performed in vivo experiments and sequencing. E.L. and T.A.G. wrote the original draft, and all authors reviewed and approved the manuscript. Competing interests The authors declare no competing interest. 10 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 1 2 3 4 Genomic segment clonal subclonal unstable Purity (pi) =Subclonal-ratio (ri) = Normal contamination Tumour cell mixture Sample 2 Sample 3 Sample 1 Sample 1 Sample 2 Sample 3 1 2 3 1 2 3 1 2 3 Genomic segment M ea su re d co py n um be r1 2 3 1 2 3 1 2 3 Genomic segment Tu m ou r c op y nu m be r C op y nu m be r 1 2 3 4 Subclonal/resistant tumour cells Ancestral/sensitive tumour cells Figure 1: Schematic of copy number measurements. The first panel shows the SCNA profile of ancestral (in yellow) and subclonal (in red) tumour cells. At di↵erent sampling time-points, the overall tumour SCNA profile is a mixture of these profiles (second panel), influenced by the composition of tumour-derived DNA depicted on the pie-charts. Clonal, subclonal and unstable segments are indicated in yellow, red and blue, respectively. Note that the CN of clonal segments remains the same. In the liquid biopsies taken at each time-point, contamination from normal cells leads to ’flattened’ measured SCNA profiles (last panel) due to normal cells having a neutral karyotype. This contamination a↵ects the CN of each segment. Our aim is to estimate purity (pi) and subclonal-ratio (ri) based on clonal and subclonal SCNAs. 11 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 0 1 2 3 1.0 0.5 0.0 0.5 1.0 ΔCN in Sample4 N um be r o f s eg m en ts 0.0 0.2 0.4 0.6 Sample2 Sample3 Sample4 Sample5 S ub cl on al -r at io 0 1 2 3 1.5 2.0 2.5 3.0 Segment CN D en si ty 0.00 0.05 0.10 0.15 0.1 0.2 0.3 0.4 Purity estimate E rr or o f f it b c d Ordered samples clonal/normal unstable subclonale X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 0.0 0.5 1.0 1.5 S eg m en t C N Segment CN distribution Purity p1, p2, … Purity-corrected segment CNs baseline sample Segment classification Maximal Relative subclonal-ratio r1,N, r2,N, … 0.25 0.50 0.75 1.00 Sample2 Sample3 Sample5 S ub cl on al -r at io c om pa re d to S am pl e4 0 1 2 Sample1 Sample2 Sample3 Sample5 Sample4 C N a Sample order 0 1 2 Sample1 Sample3 Sample2 Sample4 Sample5 S eg m en t C N Score: 1 (16%) Order 24 Order 3 Order 2 Order 1 Optimal: Order 3 (Score = 17) Order 3f g subclone sample Subclonal-ratio r1, r2, … Optimal subclonal-ratio: 0.65 h i Subclonal segments ΔCN compared to Figure 2: Illustration of the estimation algorithm. (a) Outline of the steps of the estima- tion algorithm. (b) Purity estimation based on the peaks of the distribution of segment CNs. Green lines show the peaks expected at an example purity of 0.21. (c) The error of a range of purity estimates, computed from the distance of observed and estimated peaks in (b). Each line corresponds to a smoothing kernel applied to the raw segment CN distribution. The optimal purity is indicated with arrow. (d) Change in segment CN values (�CNs) plotted according to an example sample order. The number of subclonal segments computed in (e) is indicated below. (e) Classification of segments based on the sample order in (d). Segments with low variance are classified as clonal (in grey). Non- clonal segments are evaluated whether they follow a quasi-monotone pattern (indicated by the shaded regions) and classified as unstable (outside of shaded region, in blue) or subclonal (in red). (f) �CN values plotted according to the optimal sample order max- imising subclonal segments. Line colours indicate the class of each segment as in (e). (g) Relative subclonal-ratio estimation compared to maximal subclonal-ratio sample (right- most in (f)). Points show individual segment-wise estimates, with an example segment highlighted in black. Black line shows the median. (h-i) Subclonal-ratios and confidence intervals inferred by fitting a Gaussian mixture model to the �CN distribution of sub- clonal segments. The components of the best fit with means �r and r are shown in green and magenta in (h). 12 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 0.0 0.2 0.4 0.6 0.5 1 2 4 Noise level (sigma) E rr or in s ub cl on al -r at io e st im at io n 0.00 0.05 0.10 0.15 0.20 0.5 1 2 4 Noise level (sigma) E rr or in p ur ity e st im at io n sigma=2 sigma=4 sigma=0.5 sigma=1 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 True subclonal-ratio E st im at ed s ub cl on al -r at io d e f g x 5 r: 5-80% p: 4-45%-2,-1,0,+1,+2 1,2,3,4,5 1 2 3 4 5 6 1 2 3 4 Ancestral CN S ub cl on al C N 0 5 10 15 20 25 number of segments c p = 0.26 r = 0.35 x 80 a b 0.5-4 sigma=2 sigma=4 sigma=0.5 sigma=1 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 True purity E st im at ed p ur ity Figure 3: Estimation of mixtures of synthetic cell populations. (a) Parameters used to randomly sample synthetic datasets including simulated measurement noise. The font- size of copy number states indicates their probability. (b) A randomly generated sample. The heatmap depicts the distribution of segment CNs in ancestral and subclonal cells, and the proportion of cell populations is shown on the pie-chart (red: subclonal, yellow: ancestral, grey: normal). (c) Copy number profile of the sample in (b), with raw bin-wise and segmented copy number values shown in black and red, respectively. (d) Estimated purity of 1,000 synthetic samples with varying levels of noise (�), plotted against the true theoretical purity. The y = x line is indicated with dashes. (e) Error of purity estimation (absolute di↵erence to true purity) for samples with noise level indicated on the x axis. (f) True and estimated subclonal-ratio of 200 synthetic datasets (1,000 samples) with varying levels of noise (�). (g) Error in subclonal-ratio estimation for datasets with increasing noise level. Box-plot elements in (e)(g) stand for: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. 13 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 0.00 0.25 0.50 S1 S2 S3 S4 S5 Sample S ub cl on al ra tio 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 True purity E st im at ed p ur ity 0.2 0.4 0.6 0.2 0.4 0.6 True subclonal-ratio E st im at ed s ub cl on al -r at io ba f g Ancestral/sensitive cell line (B0) Subclonal/resistant cell line (B1) c d e Figure 4: Estimation of mixtures of high grade serous ovarian cancer cell lines. (a) Copy number profile of the ancestral/sensitive and subclonal/resistant HGSOC cell lines. Raw bin-wise and segmented copy number values are shown in black and red, respectively. Resistant-specific subclonal SCNAs are highlighted. (b) Purity estimates of samples S0- S5. Corrected values are computed using the linear fit in (c). Theoretical purity values are indicated by maroon diamonds. (c) True (theoretical) and estimated tumour purity of 120 in silico HGSOC cell line mixtures. y = x and the linear fit of the estimates (y = 0.81x) are shown with dashed and solid lines, respectively. Point shape and shade indicate total number of reads per sample. (d) Subclonal-ratio estimates for samples S1-S5. Shaded and empty bars indicate estimates derived using direct (Gaussian fit) and two-step (from relative ratios in (f)) methods, respectively. Error bars show 95% confidence interval of the direct estimate, maroon diamonds indicate theoretical values. (e) True and estimated subclonal-ratio of 50 in silico datasets constructed of samples from (c) with 50 million reads. (f) Relative subclonal-ratio estimates for samples S1-S4, compared to S5. Estimates from each subclonal segment are shown with dots, the median estimates are indicated by black lines, and true values with maroon diamonds. (g) True and estimated relative subclonal-ratio in the 50 datasets shown in (g). 14 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 0 10 20 % subclonal cells 0 10 20 30 % subclonal cells 0 10 20 % subclonal cells baseline SCNA subclone-associated SCNA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 18 20 22 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Chromosome C op y nu m be r 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 18 20 22 1 2 3 1 2 3 1 2 3 1 2 3 Chromosome C op y nu m be r Patient 1306 Patient 2760 Patient 3209 Baseline Week 18 Week 24 End of therapy Baseline Week 24 Week 30 End of therapy Baseline Week 18 End of therapy a b c 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 18 20 22 1 2 3 1 2 3 1 2 3 Chromosome C op y nu m be r baseline SCNA subclone-associated SCNA baseline SCNA subclone-associated SCNA Figure 5: Estimation in cfDNA samples from patient data. Subclone-specific copy number changes and subclonal-ratio in lung cancer patients (a) 1306, (b) 2760, and (c) 3209 from [24]. Left: purity-corrected SCNA profiles. Yellow bars show the CN of each segment in the baseline sample, and red bars indicate subclonal deviations from this value in non-baseline samples. Regions of subclone-specific CNAs are also indicated by darker shades. Right: estimated resistant proportion of each sample with 95% confidence intervals. Note that only samples with >10% purity were analysed (c.f. S7). A bar of CN=6 on chromosome 3 (indicated by asterisk) has been omitted from (c) for better visualisation. 15 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Methods Formal definition of the problem Copy number measurements We consider a tumour that consists of two distinct cell populations, ancestral (A) and subclonal (S) tumour cells, and continuously sheds cell-free DNA (cfDNA) into the blood circulation. A typical scenario would be ancestral cells representing drug-sensitive tumour cells present before cancer therapy, and subclonal cells denoting the emerging subclone with resistance to therapy. The proportion of DNA originating from these two cell types changes over time as we take measurements via blood samples (Fig. 1). Since cell-free DNA found in blood can also originate from normal (non-tumour) cells of the body, the measured DNA is contributed by a mixture of the two tumour cell populations (A and S) and normal cells (N). At each time-point i the proportion of these three populations in the measured sample, si, depends on the proportion of all tumour-derived DNA (the purity of the sample, pi) and the proportion of subclone-derived DNA from the tumour (the subclonal-ratio, ri): Ni = 1 � pi; Ai = pi · (1 � ri); Si = pi · ri. (2) Our aim is to track the dynamics of the subclonal (putatively resistant) population by determining the subclonal-ratio for each time-point, ri, or the change in subclonal-ratio between time-points, ri/rk = rik. To this end, we use the copy number values as typically measured by lpWGS of the sequential cfDNA samples. Let us consider distinct genomic regions with homogeneous copy number state, seg- ments. We assume that the copy number (CN) state of most segments stays constant over time in a particular population. Therefore the jth segment is characterised by a set of three time-independent absolute CN states, C(N)j, C(A)j, C(S)j, corresponding to the local CN in normal, ancestral and subclonal cells, respectively. The copy number of segment j as measured in the ith sample, C j i , is the combination of these three absolute CNs, weighted by the proportions of DNA derived from the three cell populations at that time-point (Ni, Ai, Si). We know that normal cells are at a diploid state, hence C(N) j = 2 for all j. Therefore, using the purity and subclonal-ratio defined in Eq. (2), C j i = 2 + pi � (1 � ri)C(A)j + riC(S)j � 2 � . (3) Since all cells in a cell population share the absolute CN for a given segment, the values C(S)j and C(A)j are always integers. Therefore in theory, measured CNs from a given sample should be limited to a discrete set of values defined by these integer states, making it possible to solve the set of equations formed by Eq. (3) for pi and ri using linear algebra. However, we have to take into account that all real sequencing measurements have a level of imprecision introducing variation on top of this relationship. Using the term �ij to represent the noise in the ith measurement of segment j, Eq. (3) becomes, C j i = 2 + pi � (1 � ri)C(A)j + riC(S)j � 2 � + �ij. (4) 16 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ with the magnitude and family of this noise depending on the specifics of the technology used for CN measurement, especially the sequencing depth [19]. This measurement noise – associated with a continuous distribution – broadens the set of C j i values, rendering a linear algebra solution impossible. Hence, our aim becomes to derive an inference of pi and ri despite this unknown noise, �ij. Segment classification Each segment can fall into three categories depending on their respective copy number states in the two types of cells. (i) Clonal segments have the same absolute CN in ancestral and subclonal tumour cells, C(A)j = C(S)j. A special case of clonal segments are segments of neutral CN, where C(A)j = C(S)j = 2. (ii) Subclonal segments have di↵erent absolute CNs in the ancestral and subclonal tumour population, C(A)j 6= C(S)j. These segments represent SCNAs that distinguish the subclone from its ancestor, even though they are not necessarily associated with a selective/phenotypic di↵erence (e.g. drug-resistance) directly. (iii) Unstable segments are neither clonal nor associated with the emergent subclone, and therefore are best described by a time-dependent tumour-wide CN value, ⇣(T) j i , that does not depend on ri. These segments can arise if a genomic region cannot be measured reliably or if on-going genomic instability introduces novel SCNAs during the time tracked by our samples. We can assume that the number of such segments is small compared to the total number of measured segments. Depending on whether segments are clonal, subclonal or unstable, their measured CN across samples will change according to the subclonal-ratio and purity of each sample. For simplicity, we omit the term �ij and its derivatives, but the reader should keep in mind that all equations are subject to measurement noise: C j i = 2 + pi(C(A) j � 2), if the segment is clonal, (5) C j i = 2 + pi � C(A)j � 2 + ri(C(S)j � C(A)j) � , if the segment is subclonal, (6) C j i = 2 + pi(⇣(T) j i )), if the segment is unstable. (7) Figure 1 illustrates how the measured CN of segments depend on the parameters ri and pi highlighted above. In the following sections, we use Eqs. (5) & (6) to estimate the underlying parameters, pi and ri, via three steps (Fig. 2). Estimation algorithm Purity estimation Purity estimation is carried out based on clonal (including neutral) segments. In general, we expect the majority of segments to fall into this category. Consequently, for the ma- jority of segments their measured copy number follows Eq. (5). Since C(A)j can take only integer values, the distribution of segment CNs is expected to have distinct peaks at regular intervals of pi. 17 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Using a peak-finder algorithm on the smoothed distribution of measured CN values, we directly compare the peaks to the values expected at a given purity, {2 � pi, 2, 2 + pi, 2 + 2pi, . . . }, as shown in Fig. 2b. The error of the fit to a purity, pi, is evaluated as the summed squared distance between each peak and the closest observed peak, X C(A) min � (2 + pi(C(A) � 2)) � peaks)2 � . (8) As the detected peaks of the data depend on the smoothing kernel used on the distribution, we perform this computation for a wide range of smoothing bandwidths (0.5 ⇥ �2.5⇥ the default value) and derive the purity estimate, p̂i, as the value that minimises the mean and/or median error across the range (Fig. 2c). Then, we use the derived p̂i to re-normalise the measured copy number values and thus eliminate normal contamination. We gain an estimate of the tumour-specific CN (C(T) j i ), a mixture of ancestral and subclonal CNs: Ĉ(T) j i = 1 p̂i · (Cji � 2) + 2 ⇡ C(A) j + ri(C(S) j � C(A)j). (9) Note that, due to the noise in measurements, peaks from close absolute CNs can become indistinguishable in low-purity samples. Therefore we expect purity values below 5% to be indistinguishable (unless high sequencing depth is available) and also advise to discard samples with low purity (typically pi < 0.1) as erroneous purity estimations can bias downstream computation. Identifying subclonal segments and sample order Next, we aim to identify the subset of segments with subclone-specific subclonal SCNAs that reflect the changes in subclonal-ratio over time. To easily assess the change in segment CNs, we designate a sample as baseline, and compute the change in segment CN, �CN, between each sample and this baseline sample. Typically, the sample taken upon diagnosis or before start of therapy (usually the first time-point, s1) can be used. We can assume that this sample has no or only negligible population of the emerging subclone, and therefore represents a pure ancestral population: r1 ⇡ 0 �! C(T) j 1 ⇡ C(A) j. Hence the change in CN of a subclonal segment compared to the baseline becomes, �C(T) j i = C(T) j i � C(T) j 1 = ri � C(S)j � C(A)j � . (10) Furthermore, Eq. (10) provides an informative quantity even if the baseline sample is not pure, as �C(T) j i nonetheless describes the change in subclone-specific SCNAs. In order to uncover which segments are truly subclonal, and how the subclonal-ratio changes over measurements, we need to identify a pervasive pattern across samples, and the subset of segments that consistently follows it. If the samples were taken so that 18 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ the subclonal population increases over time-points, this pattern would be a monotone increase or decrease for all segments with subclone-specific SCNAs. While we cannot assume that the samples are taken in order of increasing subclonal proportions (e.g. a change of treatment between sampling times might lead to fluctuating population size in a resistance-associated subclone), we can aim to re-arrange them to follow this rule. Consequently, we rephrase our aim as deriving (i) a set of subclonal segments that follow a monotone pattern across ordered samples; and (ii) an ordering of samples that is correlated with by the maximum number of (subclonal) segments. Formally, we are looking for a subset of segments, {j1, j2, . . . } and a permutation of samples (starting from the designated baseline sample), s1, si, . . . , sN, where for every segment j 2 {j1, j2, . . . } either �C(T) j i+1 � �C(T) j i > �✏, 8i or (11) �C(T) j i+1 � �C(T) j i < ✏, 8i holds for all i for a pre-defined accuracy level, ✏. We use an ✏ > 0 accuracy level to allow for samples with near-equal subclonal-ratio measured with uncertainty. We find that, for typical lpWGS datasets, ✏ ⇡ 0.02 � 0.05 works well to account for the underlying measurement noise. Figs. 2d-f illustrate the derivation of optimal sample order and subclonal segment set. We first separate clonal segments: since these have relative CN values of 0, apart from some measurement noise, we filter out any segment that has a standard deviation below a pre-defined threshold. We then evaluate Eq. (11) over all remaining segments and over all orderings of the samples. As we expect 4-6 time-points per dataset, an exhaustive search of all possible permutations is feasible. Given a permutation, each segment is classified according to whether it follows Eq. (11) – these are candidate subclone-specific and unstable segments, respectively (Fig. 2e). The optimal sample order is defined as the permutation that maximises the number of subclonal segments (Fig. 2f). Subclonal-ratio estimation Finally, we use the set of segments identified as subclonal, and compute the subclonal- ratio of each time point. We derive the (absolute) subclonal-ratio, ri, for each sample using Eq. (10). As both C(A)j and C(S)j are assumed to be integers, and we know that C(A)j 6= C(S)j, �C(T) j i 2 {. . . , �2ri, �ri, ri, 2ri, . . . }, 8 j 2 {j1, j2, . . . }. (12) To take into account that the measured �CNs compared to the baseline, �Ĉ(T) j i , are influenced by noise, we fit these values with a mixture of Gaussian distributions where the mean of the Gaussians follows Eq. (12), as illustrated in Fig 2h. The subclonal-ratio of a sample is derived as the constrained mean parameter, ri, of the Gaussian mixture 19 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ optimising the fit (Fig. 2i). The 95% confidence interval of the inferred subclonal-ratio is computed based on the (shared) variance of the fitted constrained Gaussians. The measurement noise propagated from segment CNs can lead to high spread in values, making estimates less robust and rendering the resolution of low subclonal-ratios (ri  0.1) challenging, occasionally leading to the Gaussian-fitting step to fail. Therefore we also derive relative subclonal-ratios, which allow for a more general application not limited to good quality samples. In particular, relative values are compared to the maximal sample since its subclonal-ratio is assumed to be the most robust against measurement noise. We compute the relative deviation of each normalised subclonal tumour segment CN, �C j iN = �C(T) j i �C(T) j N = ri(C(S) j � C(A)j) rN(C(S)j � C(A)j) = ri rN , (13) giving rise to a distribution of relative subclonal-ratio estimates (Fig. 2g). We derive a point estimate for the relative ri of each sample as the median of this set, r̂iN = median ⇣ �C j iN ⌘ , j 2 {j1, j2, . . . }. (14) Absolute subclonal-ratio estimates can then be derived using these relative estimates in a two-step estimation process (as opposed to the direct estimation above): we derive rN based on Eq. (12), and subsequently compute riN · rN to retrieve ri. Generating synthetic datasets We constructed synthetic datasets of 80 segments (of length varying between 120 and 800 bins) and 5 time-points as illustrated in Fig. 3a. For each segment, we generated sensitive segment copy number states (C(S)j) by randomly sampling from {1, 2, 3, 4, 5}, with neutral and close-to-neutral states occurring with higher frequency. Subclone-specific absolute CNs (C(S)j) were assigned by randomly sampling from C(A)j +{�2, �1, 0, 1, 2}, with no change (giving rise to clonal segments) having a higher weight. For each sample, si, we assigned purity and subclonal-ratio randomly from the ranges 0.04 < pi < 0.45 and 0.05 < ri < 0.8, with the exception of the baseline samples, where r1 < 0.04. We then recreated the measurement procedure of computing noise-ridden raw CN values in a given segment, j, by adding a normally distributed noise. The magnitude (standard deviation) of the noise was controlled by the noise level parameter, � (representing di↵erences arising from e.g. sequencing depth) and the CN of the segment (reflecting higher variance in higher CN states): rawCbini = 2 + pi � (1 � ri)C(A)j + riC(S)j � 2 � + Normal(0, f(�, C j i )). The final CN value of each segments, Ĉ j i , was computed as the mean of all rawC bin i contained in the segment. In addition, we selected 2.5-15% of segments as unstable, and re- sampled their tumour-specific CN value to be independent of ri. Fig. 3b-c show parameters of a synthetic sample and its copy number profile. 20 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Generating in vitro and in silico cell line mixtures HGSOC cell line OVCAR4 was obtained from Prof Fran Balkwill (Barts Cancer Institute, UK) and grown in DMEM media containing 10% FBS and 1% penicillin/streptomycin. A resistant/subclonal HGSOC cell line (Ov4Cis) was generated by culturing an aliquot of the ancestral OVCAR4 cell line in increasing concentrations of cisplatin. For further details on cell culture and the celll lines, see [29]. We then extracted genomic DNA from both cell lines and from blood samples from healthy volunteers using QIAamp DNA Micro Kit (Qiagen, Hilden, Germany). Genomic DNA from the three sources was mixed in varying proportions (Table S1), measured as the mass of DNA inputted from each source, to a total of 20 ng DNA per sample and subjected to sonication with the Covaris M220 system. Libraries were prepared using the NEBNext Ultra II kit (New England Biolabs, Hitchin, United Kingdom) with 4 cycles of PCR amplification, indexed with unique dual indexing primers and sequenced on Illumina NovaSeq 6000 to a mean depth of 1.3x. In silico mixtures were generated by bioinformatically mixing sequencing reads of DNA derived from the ancestral/sensitive, subclonal/resistant tumour cell lines and healthy blood cells. Similarly to synthetic samples, for each in silico sample we randomly assigned purity, 0.1 < pi < 0.45, and subclonal-ratio, 0.05 < ri < 0.8. We then sampled reads (using samtools view -s) from aligned read (bam) files of ‘pure’ ancestral, subclonal and normal samples (B0, B1 and N0) in proportions to match pi(1 � ri), piri and 1 � pi, respectively. We also varied the total number of reads per sample (as a proxy for sequencing depth and consequently measurement noise), and generated 30-30 samples with 50, 20, 10, and 5 million total reads each. Processing lpWGS samples Fastq files derived from lpWGS samples (generated via sequencing cell line mixtures or obtained from [24]) were aligned to the human reference genome (version hg19, using bwa). We then processed bam files using the QDNAseq R package [19] employing DNAcopy for segmentation [42]. QDNAseq produced two copy number values for each genomic bin: a raw pre-segmentation and a segmented value grouping bins of equal CN together. The CN of bins on the pre-defined blacklist of QDNAseq and of those with <75% mappability was set to NA. Raw and segmented CN values for all cell line samples are available from https://github.com/elakatos/liquidCNA_data. Since QDNAseq returns normalised CN values (with neutral state at 1), we multiplied all values by 2 before proceeding with the estimation algorithm and re-normalised segment CN values to be centred at 2 exactly. We then re-defined segment boundaries using the ensemble of samples as regions of constant CN in all samples. This way break-points present in only a sub-set of samples (such as a subclone-specific SCNA) gave rise to segments handled separately for all samples. Updated segments with length below 6 mega- bases (120 bins of 50kb (cell line mixtures) or 12 bins of 500kb (patient cfDNA samples)) 21 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://github.com/elakatos/liquidCNA_data https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ were excluded from the downstream analysis to filter out short segments sensitive to localised measurement biases. Finally, we curated each segment CN by discarding bins with the most extreme 2.5% of raw segment values, and re-calculating the segment CN value as the mean of normal distribution fitted to the remaining raw CNs. We found that this curation had negligible e↵ect for most segments, but successfully improved assigned segment CN values for more error-prone genomic regions. Data availability Aligned sequencing data from HGSOC cell lines and in vitro mixtures (listed in table S1) are available from the European Nucleotide Archive (accession PRJEB42332). Raw and post-segmentation copy number values for these samples are available from https: //github.com/elakatos/liquidCNA_data. Code availability Estimation functions of liquidCNA implemented in R (version 4.0.3), an illustrative ex- ample in a Jupyter notebook and code generating and analysing synthetic and in silico data are available from https://github.com/elakatos/liquidCNA. 22 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://github.com/elakatos/liquidCNA_data https://github.com/elakatos/liquidCNA_data https://github.com/elakatos/liquidCNA https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ References [1] Siravegna, G., Marsoni, S., Siena, S. & Bardelli, A. Integrating liquid biopsies into the management of cancer. Nature Reviews Clinical Oncology 14, 531–548 (2017). URL https://doi.org/10.1038/nrclinonc.2017.14. [2] Ng, S. B. et al. Individualised multiplexed circulating tumour dna assays for monitor- ing of tumour presence in patients after colorectal cancer surgery. Scientific reports 7, 40737–40737 (2017). URL https://pubmed.ncbi.nlm.nih.gov/28102343. [3] Rothwell, D. G. et al. Utility of ctdna to support patient selection for early phase clinical trials: the target study. Nat Med 25, 738–743 (2019). [4] Khan, K. H. et al. Longitudinal liquid biopsy and mathematical modeling of clonal evolution forecast time to treatment failure in the prospect-c phase ii colorectal cancer clinical trial. Cancer Discov 8, 1270–1285 (2018). [5] Fernandez-Garcia, D. et al. Plasma cell-free dna (cfdna) as a predictive and prognostic marker in patients with metastatic breast cancer. Breast Cancer Research 21, 149 (2019). URL https://doi.org/10.1186/s13058-019-1235-8. [6] Conteduca, V. et al. Plasma tumour dna as an early indicator of treatment response in metastatic castration-resistant prostate cancer. British Journal of Cancer (2020). URL https://doi.org/10.1038/s41416-020-0969-5. [7] Nakamura, Y. et al. Clinical utility of circulating tumor dna sequencing in advanced gastrointestinal cancer: Scrum-japan gi-screen and gozila studies. Nature Medicine (2020). URL https://doi.org/10.1038/s41591-020-1063-5. [8] Diaz, L. A. J. et al. The molecular evolution of acquired resistance to targeted egfr blockade in colorectal cancers. Nature 486, 537–540 (2012). [9] Bettegowda, C. et al. Detection of circulating tumor dna in early- and late-stage human malignancies. Sci Transl Med 6, 224ra24 (2014). [10] Newman, A. M. et al. An ultrasensitive method for quantitating circulating tumor dna with broad patient coverage. Nature Medicine 20, 548–554 (2014). URL https: //doi.org/10.1038/nm.3519. [11] Parkinson, C. A. et al. Exploratory analysis of tp53 mutations in circulating tumour dna as biomarkers of treatment response for patients with relapsed high-grade serous ovarian carcinoma: A retrospective study. PLoS medicine 13, e1002198 (2016). URL https://europepmc.org/articles/PMC5172526. [12] Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899–905 (2010). URL https://doi.org/10.1038/ nature08822. 23 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1038/nrclinonc.2017.14 https://pubmed.ncbi.nlm.nih.gov/28102343 https://doi.org/10.1186/s13058-019-1235-8 https://doi.org/10.1038/s41416-020-0969-5 https://doi.org/10.1038/s41591-020-1063-5 https://doi.org/10.1038/nm.3519 https://doi.org/10.1038/nm.3519 https://europepmc.org/articles/PMC5172526 https://doi.org/10.1038/nature08822 https://doi.org/10.1038/nature08822 https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ [13] Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: The next generation. Cell 144, 646–674 (2011). URL https://doi.org/10.1016/j.cell.2011.02.013. [14] Sansregret, L., Vanhaesebroeck, B. & Swanton, C. Determinants and clinical impli- cations of chromosomal instability in cancer. Nature Reviews Clinical Oncology 15, 139–150 (2018). URL https://doi.org/10.1038/nrclinonc.2017.198. [15] Li, X. et al. Temporal and spatial evolution of somatic chromosomal alterations: a case-cohort study of barrett’s esophagus. Cancer Prev Res (Phila) 7, 114–127 (2014). [16] Hieronymus, H. et al. Tumor copy number alteration burden is a pan-cancer prog- nostic factor associated with recurrence and death. Elife 7 (2018). [17] Rubin, C. E. et al. Dna aneuploidy in colonic biopsies predicts future development of dysplasia in ulcerative colitis. Gastroenterology 103, 1611–1620 (1992). [18] Zaccaria, S. & Raphael, B. J. Accurate quantification of copy-number aberrations and whole-genome duplications in multi-sample tumor sequencing data. Nature Commu- nications 11, 4301 (2020). URL https://doi.org/10.1038/s41467-020-17967-y. [19] Scheinin, I. et al. Dna copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. Genome Res 24, 2022–2032 (2014). [20] Adalsteinsson, V. A. et al. Scalable whole-exome sequencing of cell-free dna reveals high concordance with metastatic tumors. Nature Communications 8, 1324 (2017). URL https://doi.org/10.1038/s41467-017-00965-y. [21] Van Roy, N. et al. Shallow whole genome sequencing on circulating cell-free dna allows reliable noninvasive copy-number profiling in neuroblastoma patients. Clin Cancer Res 23, 6305–6314 (2017). [22] Hovelson, D. H. et al. Rapid, ultra low coverage copy number profiling of cell-free dna as a precision oncology screening strategy. Oncotarget 8, 89848–89866 (2017). [23] Chin, S.-F. et al. Shallow whole genome sequencing for robust copy number profil- ing of formalin-fixed para�n-embedded breast cancers. Experimental and Molecular Pathology 104, 161–169 (2018). URL http://www.sciencedirect.com/science/ article/pii/S0014480017306470. [24] Chen, X. et al. Low-pass whole-genome sequencing of circulating cell-free dna demonstrates dynamic changes in genomic copy number in a squamous lung cancer clinical cohort. Clinical Cancer Research 25, 2254–2263 (2019). URL https://clincancerres.aacrjournals.org/content/25/7/2254. https:// clincancerres.aacrjournals.org/content/25/7/2254.full.pdf. 24 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1016/j.cell.2011.02.013 https://doi.org/10.1038/nrclinonc.2017.198 https://doi.org/10.1038/s41467-020-17967-y https://doi.org/10.1038/s41467-017-00965-y http://www.sciencedirect.com/science/article/pii/S0014480017306470 http://www.sciencedirect.com/science/article/pii/S0014480017306470 https://clincancerres.aacrjournals.org/content/25/7/2254 https://clincancerres.aacrjournals.org/content/25/7/2254.full.pdf https://clincancerres.aacrjournals.org/content/25/7/2254.full.pdf https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ [25] Belic, J. et al. mfast-seqs as a monitoring and pre-screening tool for tumor-specific aneuploidy in plasma dna. Adv Exp Med Biol 924, 147–155 (2016). [26] Vanderstichele, A. et al. Chromosomal instability in cell-free dna as a highly specific biomarker for detection of ovarian cancer in women with adnexal masses. Clin Cancer Res 23, 2223–2231 (2017). [27] Taylor, F., Bradford, J., Woll, P. J., Teare, D. & Cox, A. Unbiased detection of somatic copy number aberrations in cfdna of lung cancer cases and high-risk controls with low coverage whole genome sequencing. Adv Exp Med Biol 924, 29–32 (2016). [28] Wei, T. et al. Genome-wide profiling of circulating tumor dna depicts landscape of copy number alterations in pancreatic cancer with liver metastasis. Mol Oncol 14, 1966–1977 (2020). [29] Hoare, J. et al. Platinum resistance induces diverse evolutionary trajecto- ries in high grade serous ovarian cancer. bioRxiv (2020). URL https: //www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378. https:// www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378.full.pdf. [30] Nelson, L. et al. A living biobank of ovarian cancer ex vivo models reveals profound mitotic heterogeneity. Nature Communications 11, 822 (2020). URL https://doi. org/10.1038/s41467-020-14551-2. [31] Network, C. G. A. R. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011). URL https://pubmed.ncbi.nlm.nih.gov/21720365. [32] Soria, J.-C. et al. A phase ib dose-escalation study of the safety and pharmacoki- netics of pictilisib in combination with either paclitaxel and carboplatin (with or without bevacizumab) or pemetrexed and cisplatin (with or without bevacizumab) in patients with advanced non–small cell lung cancer. European Journal of Cancer 86, 186 – 196 (2017). URL http://www.sciencedirect.com/science/article/pii/ S0959804917312522. [33] Housman, G. et al. Drug resistance in cancer: an overview. Cancers (Basel) 6, 1769–1792 (2014). [34] Gatenby, R. A., Silva, A. S., Gillies, R. J. & Frieden, B. R. Adaptive therapy. Cancer Res 69, 4894–4903 (2009). [35] Enriquez-Navas, P. M., Wojtkowiak, J. W. & Gatenby, R. A. Application of evolu- tionary principles to cancer therapy. Cancer Res 75, 4675–4680 (2015). [36] Zhang, J., Cunningham, J. J., Brown, J. S. & Gatenby, R. A. Integrating evo- lutionary dynamics into treatment of metastatic castrate-resistant prostate can- cer. Nature Communications 8, 1816 (2017). URL https://doi.org/10.1038/ s41467-017-01968-5. 25 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378 https://www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378 https://www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378.full.pdf https://www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378.full.pdf https://doi.org/10.1038/s41467-020-14551-2 https://doi.org/10.1038/s41467-020-14551-2 https://pubmed.ncbi.nlm.nih.gov/21720365 http://www.sciencedirect.com/science/article/pii/S0959804917312522 http://www.sciencedirect.com/science/article/pii/S0959804917312522 https://doi.org/10.1038/s41467-017-01968-5 https://doi.org/10.1038/s41467-017-01968-5 https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ [37] Choudhury, A. D. et al. Tumor fraction in cell-free dna as a biomarker in prostate can- cer. JCI Insight 3 (2018). URL https://doi.org/10.1172/jci.insight.122109. [38] Phallen, J. et al. Direct detection of early-stage cancers using circulating tumor dna. Science translational medicine 9, eaan2415 (2017). URL https://pubmed.ncbi. nlm.nih.gov/28814544. [39] Mouliere, F. et al. High fragmentation characterizes tumour-derived circulating dna. PLOS ONE 6, 1–10 (2011). URL https://doi.org/10.1371/journal.pone. 0023418. [40] Underhill, H. R. et al. Fragment length of circulating tumor dna. PLOS Genetics 12, 1–24 (2016). URL https://doi.org/10.1371/journal.pgen.1006162. [41] Mouliere, F. et al. Enhanced detection of circulating tumor dna by fragment size analysis. Science Translational Medicine 10 (2018). URL https://stm.sciencemag. org/content/10/466/eaat4921. https://stm.sciencemag.org/content/10/466/ eaat4921.full.pdf. [42] Venkatraman, E. S. & Olshen, A. B. A faster circular binary segmentation algorithm for the analysis of array cgh data. Bioinformatics 23, 657–663 (2007). 26 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1172/jci.insight.122109 https://pubmed.ncbi.nlm.nih.gov/28814544 https://pubmed.ncbi.nlm.nih.gov/28814544 https://doi.org/10.1371/journal.pone.0023418 https://doi.org/10.1371/journal.pone.0023418 https://doi.org/10.1371/journal.pgen.1006162 https://stm.sciencemag.org/content/10/466/eaat4921 https://stm.sciencemag.org/content/10/466/eaat4921 https://stm.sciencemag.org/content/10/466/eaat4921.full.pdf https://stm.sciencemag.org/content/10/466/eaat4921.full.pdf https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2020_12_14_422697 ---- AdRoit: an accurate and robust method to infer complex transcriptome composition 1 AdRoit: an accurate and robust method to infer complex 1 transcriptome composition 2 3 Tao Yang1, Nicole Alessandri-Haber1, Wen Fury1, Michael Schaner1, Robert Breese1, Michael LaCroix-Fralish2, 4 Jinrang Kim1, Christina Adler1, Lynn E. Macdonald1, Gurinder S. Atwal1, Yu Bai1, * 5 6 Affiliations 7 1. Regeneron Pharmaceuticals, Inc., Tarrytown NY 10591 8 2. Cellular Longevity, Inc., San Francisco, CA 94103 9 10 *Corresponding author 11 12 Abstract 13 RNA sequencing technology promises an unprecedented opportunity in learning disease 14 mechanisms and discovering new treatment targets. Recent spatial transcriptomics methods 15 further enable the transcriptome profiling at spatially resolved spots in a tissue section. In 16 controlled experiments, it is often of immense importance to know the cell composition in 17 different samples. Understanding the cell type content in each tissue spot is also crucial to the 18 spatial transcriptome data interpretation. Though single cell RNA-seq has the power to reveal 19 cell type composition and expression heterogeneity in different cells, it remains costly and 20 sometimes infeasible when live cells cannot be obtained or sufficiently dissociated. To 21 computationally resolve the cell composition in RNA-seq data of mixed cells, we present AdRoit, 22 an accurate and robust method to infer transcriptome composition. The method estimates the 23 proportions of each cell type in the compound RNA-seq data using known single cell data of 24 relevant cell types. It uniquely uses an adaptive learning approach to correct the bias gene-wise 25 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 2 due to the difference in sequencing techniques. AdRoit also utilizes cell type specific genes 26 while control their cross-sample variability. Our systematic benchmarking, spanning from 27 simple to complex tissues, shows that AdRoit has superior sensitivity and specificity compared 28 to other existing methods. Its performance holds for multiple single cell and compound RNA-29 seq platforms. In addition, AdRoit is computationally efficient and runs one to two orders of 30 magnitude faster than some of the state-of-the-art methods. 31 32 Introduction 33 RNA sequencing is a powerful tool to address the transcriptomic perturbations in disease 34 tissues and help understand the underlying mechanism to develop treatments1. Due to the 35 presence of heterogeneous cell populations, bulk tissue transcriptome only characterizes the 36 averaged expression of genes over a mixture of different types of cells. The identity of 37 individual cell types and their prevalence remain unelucidated in the bulk data. However, 38 knowledge of the cell type composition and gene expression perturbation at the cell type level 39 is often critical to identifying disease-manifesting cells and designing targeted therapies. For 40 instance, the constitution of stromal and immune cells sculpts the tumor microenvironment 41 that is essential in cancer progression and control2–6. Excessive expression of cytokines in 42 particular leukocyte types underlines the etiology of many chronic inflammatory diseases 7–11. 43 Such information cannot be directly read out from the bulk RNA-Seq. 44 45 Recent breakthroughs in spatial transcriptomics methods enable characterizing whole 46 transcriptome-wise gene expressions at spatially resolved locations in a tissue section12. 47 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 3 However, it remains challenging to reach a single cell resolution while measuring tens of 48 thousands of genes transcriptome-wise. Some widely used technologies can achieve a 49 resolution of 50-100 μm, equivalent to 3–30 cells depending on the tissue type12,13. The 50 transcripts therein may originate from one or more cell types. Unlike the bulk RNA-seq, the 51 profiling data at each spot contains substantial dropouts as merely a few cells are sequenced, 52 imposing additional challenges to demystify the cell type content. We refer to bulk RNA-seq 53 and spatial transcriptomics data at the multi-cell resolution as compound RNA-seq data 54 hereafter. 55 56 The rapid development of single-cell RNA-seq (scRNA-seq) technologies has allowed for cell-57 type specific transcriptome profiling14. It provides the information missing from the compound 58 RNA-seq data. Nevertheless, the technologies have low sensitivity and substantial noise due to 59 the high dropout rate and the cell-to-cell variability. Consequently, scRNA-seq technologies 60 require a large number of cells (thousands to tens of thousands) to ensure statistical 61 significance in the results. In addition, the cells must remain viable during capture. These 62 requirements render the scRNA-seq technologies costly, prohibiting their application in clinical 63 studies that involve many subjects or cannot allow real time tissue dissociation and cell capture. 64 Furthermore, scRNA-seq technologies may not be well suited to characterizing cell-type 65 proportions in solid tissues because the dissociation and capture steps can be ineffective to 66 certain cell types 15–17. 67 68 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 4 As sequencing at the single cell level is not always feasible, in silico approaches have been 69 developed to infer cell type proportions from compound RNA-seq data18–24. The most common 70 strategy is to conduct a statistical inference through the maximum likelihood estimation 71 (MLE)25 or the maximum a posterior estimation (MAP)26 on a constrained linear regression 72 framework, wherein the unobserved mixing proportion of a finite number of cell types are part 73 of the latent variables to be optimized. 1921–24The deconvolution methods are often applied to 74 dissect the immune cell compositions in blood samples27–31. However, their performance in 75 more complex tissues, such as the nervous, ocular, respiratory and gastrointestinal organs, 76 remains unclear. These tissues often contain many cell types (10-102) and the difference among 77 related cells can be subtle, rendering the deconvolution a challenging task. For example, a 78 recent study on the mouse nervous system contains more than 200 cell clusters and many are 79 highly similar neuronal subtypes32. 80 81 Earlier works often utilized the transcriptome profiling of the purified cell populations to 82 estimate the gene expressions per cell type (e.g. Cibersort)19. More recently, acquiring cell type 83 specific expression from the scRNA-seq data was shown to be an intriguing alternative21–24. 84 Though it provides higher throughput by measuring multiple cell types in one experiment, 85 profiling at single cell level is substantially noisy. Deconvolution using scRNA-seq data as 86 reference can be biased by noise non-relevant to cell identities if not treated properly. 87 Moreover, the platform difference between the compound data and the single cell data cannot 88 be ignored. 89 90 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 5 To overcome these challenges, additional information from the data may be considered. A 91 recent method that weighs genes according to their expression variances across samples 92 greatly improved the accuracy22, highlighting the importance of gene variability in inferring cell 93 type composition. Some other methods and applications have pointed out the importance of 94 cell type specific genes24,28,31,33. In these works, the cell type specific expression was only used 95 to select the input genes (e.g., markers). Nonetheless, it measures how informative a gene is in 96 distinguishing cell types and thus can be incorporated as a part of the model. To address the 97 platform difference between the compound data and the single cell data it is usually assumed 98 there exists a single scaling factor or a linearly scaled bias for all genes that can be learned and 99 corrected accordingly22,23. This assumption is hardly held because the impact of the platform 100 difference to each gene is different. Though learning a uniform scaling factor would correct the 101 difference in the majority of genes, a few genes that remain significantly biased can easily 102 confound the estimation, especially under a linear model framework. Thus, a gene-wise 103 correction should be considered. 104 105 In this work, we presented a new deconvolution method, AdRoit, a unified framework that 106 jointly models the gene-wise technology bias, genes’ cell type specificity and cross-sample 107 variability. The method estimated the cell type constitution in the compound RNA-seq samples 108 using relevant single cell data as a training source. Genes used for deconvolution were 109 automatically selected from the single cell data based on their information richness. Uniquely, 110 it uses an adaptively learning approach to estimate gene-wise scaling factors, addressing the 111 issue that different platforms impact genes differently. The model of AdRoit is further 112 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 6 regularized to avoid collinearity among closely related cell subtypes that are common in 113 complex tissues. Over a comprehensive benchmarking data sets with a varying cell composition 114 complexity, AdRoit showed superior sensitivity and specificity to other existing methods. 115 Applications to real RNA-seq bulk data and spatial transcriptomics data revealed strong and 116 expected biologically relevant information. We believe AdRoit offers an accurate and robust 117 tool for cell type deconvolution and will promote the value of the bulk RNA-seq and the spatial 118 transcriptomics profiling. 119 120 Results 121 Overview of the AdRoit framework 122 AdRoit estimates the proportions of cell types from compound transcriptome data including but 123 not limited to bulk RNA-seq and spatial transcriptome. It directly models the raw reads without 124 normalization, preserving the difference in total amounts of RNA transcript in different cell 125 types. The method utilizes as reference the relevant pre-existing single cell RNA-seq data with 126 cell identity annotation. It selects informative genes, estimates the mean and dispersion of the 127 expression of selected genes per cell type, and constructs a weighted regularized linear model 128 to infer percent combinations (Fig. 1a). Because sequencing platform bias impacts genes 129 differently15,34,35, a uniform scaling factor for all genes does not sufficiently eliminate such bias. 130 A key innovation of AdRoit is that it uniquely adopts an adaptive learning approach, where the 131 bias was first estimated for each gene, then adjusted such that more biased gene is corrected 132 with a larger scaling factor (Fig. 1b). 133 134 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 7 We also attribute the success of AdRoit to the consideration of a comprehensive set of other 135 relevant factors including genes’ cross-sample variability, cell type specificity and collinearity of 136 expression profiles among closely related cell types. The cross-sample variability of a gene 137 confounds its biological expression variability due to the variety of cell types. The latter is 138 referred as the cell type specific expression that helps identify the cell type. AdRoit weighs 139 down genes with high cross-sample variability whilst weighs up those with an expression highly 140 specific to certain cell types. The definition of cross-sample variability and cell type specificity 141 also accounts for the overdispersion nature in counts data. Lastly, AdRoit adopted a linear 142 model to ensure the interpretability of the coefficients. At the same time, AdRoit included a 143 regularization term to minimize the impact of the statistical collinearity. Each of the factors 144 contributes an indispensable part to AdRoit, leading to an accurate and robust deconvolution 145 method for inferring complex cell compositions. 146 147 To evaluate the performance, we compared AdRoit with MuSiC22 and NNLS18,36 for bulk data 148 deconvolution, and stereoscope23 for spatial transcriptomics data deconvolution. When 149 evaluating the algorithms, a common practice is to pool the single cell data to synthesize a 150 “bulk” sample with the known ground truth of the cell composition. We measured the 151 performance by comparing the estimated cell proportions with true proportions using four 152 metrics: mean absolution difference (mAD), rooted mean squared deviation (RMSD) and two 153 correlation statistics (i.e., Pearson and Spearman). Both correlations are included because 154 Pearson reflects linearity, while Spearman avoids the artificial high scores driven by outliers 155 when majority of estimates are tiny. Good estimations feature low mAD and RMSD along with 156 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 8 high correlations. When estimating cell proportions for a synthetic sample, cells from this 157 sample are excluded from the input single cell reference (i.e., leave-one-out) to avoid 158 overfitting. We further applied AdRoit to real bulk RNA-seq data and validated the results by 159 available RNA fluorescence in-situ hybridization (RNA-FISH) data. The estimates were further 160 confirmed by relevant biology knowledge of human pancreatic islets. We also used AdRoit to 161 map cell types on spatial spots, and the accuracy was verified by in-situ hybridization (ISH) 162 images from Allen mouse brain atlas37. 163 164 AdRoit excels in datasets with both simple and complex cell constitutions 165 We started with a simple human pancreatic islets dataset that contains 1492 cells and four 166 distinct endocrine cell types (i.e., Alpha, Beta, Delta, and PP cells)38 (Extended Data Fig. 1a; 167 Supplementary Table 1). The synthesized bulk data were constructed by mixing the single cell 168 data at known proportions. Though all three methods achieved satisfactory performance 169 according to the evaluation metrics, AdRoit has slightly better performance as reflected by 170 scatterplots of estimated proportion vs. true proportion (Extended Data Fig. 1b, Supplementary 171 Table 2). It has moderately lower mAD (0.029 vs. 0.031 for MuSiC and 0.066 for NNLS), and 172 RMSD (0.039 vs. 0.046 for MuSiC and 0.095 for NNLS) and comparable correlations (Pearson: 173 0.99 vs 0.98 for MuSiC and 0.93 for NNLS; Spearman: 0.97 vs 0.98 for MuSiC and 0.91 for NNLS) 174 (Extended Data Fig. 1c). This performance was expected because there were only four cell types 175 with very distinct transcriptome profiles. Deconvoluting such data was a relatively easy task for 176 all three methods. 177 178 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 9 We then tested the methods on a couple of complex tissues that are more challenging to 179 deconvolute. One is the human trabecular meshwork (TM) tissue. We acquired published single 180 cell data that contains 8758 cells and 12 cell types from 8 donors39. The data include 3 similar 181 types of endothelial cells, 2 types of Schwann cells and 2 types TM cells (Supplementary Fig. 1; 182 Supplementary Table 3). Cells from each donor were pooled as a synthetic bulk sample. The cell 183 type proportions vary from <1% to 43%. These proportions were the ground truth cell 184 composition and were compared head-to-head with the estimated proportions inferred by 185 AdRoit, MuSiC and NNLS. For each synthetic bulk sample, estimations were performed using a 186 reference built from cells of other donors (i.e., leaving-one-out). In each of the 8 samples, the 187 estimates made by AdRoit best approximated the true proportions. In particular, AdRoit had 188 significantly lower mAD (0.016) and RMSD (0.025), and higher correlations (Pearson = 0.97; 189 Spearman = 0.94), comparing to MuSiC (mAD = 0.038; RMSD = 0.06; Pearson = 0.83; Spearman 190 = 0.73) and NNLS (mAD = 0.06; RMSD = 0.088; Pearson = 0.69; Spearman = 0.63) (Fig. 2a). We 191 further assessed the deviation of the estimates from the true proportions for each cell type. 192 AdRoit consistently had the lowest deviations from the true proportions for all cell types, as 193 well as the lowest variation among 8 samples (Fig. 2b, blue dots), indicating a higher robustness 194 over various cell types and samples. Notably, AdRoit only missed one rare cell type (true 195 proportion = 0.3%) out of 12 cell types in one sample, while MuSiC missed 1 to 5 cell types in 6 196 of the 8 samples, and NNLS missed 3 to 7 cell types in all 8 samples (Supplementary Fig. 2, 197 Supplementary Table 4). 198 199 AdRoit has better sensitivity and specificity 200 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 10 We next systematically addressed the sensitivity and specificity of these algorithms. In the 201 context of the cell type deconvolution, a false negative occurs when the proportion of an 202 existing cell type is predicted to be zero (or below a given threshold). Conversely, a non-zero 203 prediction (or above a given threshold) of an absent cell type results in a false positive. False 204 negatives and false positives measure the sensitivity and specificity of a deconvolution 205 algorithm, respectively. Both quantities are crucial to establish the utility of the algorithm. 206 Particularly, in real world applications, it is often difficult to know a prior what cell types exist in 207 a bulk sample, users may inform the algorithm to consider more possible cell types than what 208 are actually in the sample. False positive predictions in this situation would make the algorithm 209 unusable. 210 211 We designed a simulation to test the sensitivity and specificity. we selected 6 out of the 12 cell 212 types, i.e., Schwann-cell like cell, TM1, smooth muscle cell, melanocyte, macrophage and 213 pericyte, from each donor sample and pooled them within that sample to synthesize 8 new bulk 214 samples. The unselected 6 cell types are considered absent in the bulk samples. Some cell types 215 in presence are highly similar to those in absence, challenging the programs to pinpoint the 216 right cell type present in the bulk among similar candidates. We provided the full list of 12 217 single cell types as reference to the programs to estimate the cell type proportions. NNLS was 218 excluded from this evaluation due to its low benchmarking performance observed earlier (Fig. 219 2a, b). 220 221 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 11 Consistently across 8 samples, AdRoit had very accurate estimates for the 6 present cell types, 222 and zero or close-to-zero estimated values for the non-existing cell types in the synthetic bulk 223 data. MuSiC was notably less accurate on the 6 selected cell types, meanwhile it had many non-224 negligible values (>1% for 26 out 48 estimates) of the 6 cell types excluded in the 8 synthetic 225 samples (Fig. 2c, Supplementary Table 5). For example, smooth muscle cells accounted for 226 ~14% in donor 4 but was largely missed (~0.03%) by MuSiC. We noted that TM2 had false non-227 zero estimates from both methods though not included. This is because TM2 is easily mistaken 228 as TM1 due to their high similarity39. Nonetheless, AdRoit’s estimates of TM2 were consistently 229 small across samples (<1% for 44 out of 48 estimates), while MuSiC had significantly larger 230 estimates of TM2 that occasionally even exceeded the TM1 estimates (donors 5 and 8 in Fig. 2c 231 right). For a systematic comparison, we constructed the receiver operating characteristic (ROC) 232 curve by varying the threshold of detection (i.e., a cutoff below which the cell type was deemed 233 undetected) (Fig. 2d). AdRoit had significantly higher area under the curve (AUC) than MuSiC 234 (0.95 vs. 0.74), implying a dominantly better sensitivity and specificity. 235 236 AdRoit outperforms in deconvoluting closely related subtypes 237 To further evaluate AdRoit when multiple cell subtypes present in a complex tissue, we 238 performed scRNA-seq experiment on mouse lumbar dorsal root ganglion (DRG) from five mice. 239 Following the standard analysis pipeline (Methods), we obtained 3352 single cells after quality 240 control procedures. After clustering and annotation, we discovered 14 cell types including 241 multiple subtypes of neuronal cells (Fig. 3a, Supplementary Table 6). The heatmap of the top 242 marker genes showed distinct patterns of the major cell types as well as similar patterns of the 243 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 12 subtypes (Extended Data Fig. 2a), and the cell type proportions varied from 0.5% to 33.71% 244 (Extended Data Fig. 2b). These 14 cell types include 3 subtypes of neurofilament containing 245 neurons (i.e., NF_Calb1, NF_Pvalb, NF_Ntrk2.Necab2), 3 subtypes of non-peptidergic neurons 246 (i.e., NP_Nts, NP_Mrgpra3, NP_Mrgprd), and 5 subtypes of peptidergic neurons (i.e., PEP1_Dcn, 247 PEP1_S100a11.Tagln2, PEP1_Slc7a3.Sstr2, PEP2_Htr3a.Sema5a, PEP3_Trpm8). Also discovered 248 were tyrosine hydroxylase containing neurons (Th), satellite glia and endothelial cells. Such 249 complex compositions formed a challenging testing ground for evaluating the ability to 250 distinguish closely related cell types. We again did the leave-one-out deconvolution on five 251 synthesized bulk samples. 252 253 AdRoit had highly accurate estimations on all cell types across samples (Fig. 3b). It is worth to 254 mention that, for the rare cell types that account for less than 5%, AdRoit still had a good 255 estimation that is fairly close to the true proportions and never missed a single cell type, 256 showing that AdRoit is very robust on rare cell types. For example, 0.51% endothelial cells were 257 predicted to be 0.35%, and 1.05% NF2_Ntrk2.Necab2 cells were predicted to be 0.85% 258 (Supplementary Fig. 3, Supplementary Table 7). On the contrary, MuSiC and NNLS were notably 259 less accurate, especially for the cell types less than 5%, and missed multiple cell types including 260 some large cell clusters taking account of ~10% (PEP1_Slc7a3.Sstr2 cells of Sample5). We 261 further examined how much the variability of the estimates was in each individual sample. We 262 computed the 4 metrics to evaluate the performance on each of the 5 synthetic samples and 263 compared them head-to-head among the algorithms. This fine comparison showed AdRoit 264 significantly outperformed MuSiC and NNLS on every sample (Fig. 3c). Further, the performance 265 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 13 metrics of AdRoit were highly consistent across samples with the lowest variability among the 266 three methods. 267 268 AdRoit excels on simulated spatial transcriptomics data 269 Given the promising performance on complex tissues, we continued to test AdRoit’s 270 applicability to spatial transcriptomics data. Spatial transcriptomics data differs from bulk RNA-271 seq data in that each spot only contains transcripts from a handful of cells (3-30)12. Some of the 272 spots contain multiple cells of the same type, while others may have mixtures of cell types at 273 varying mixing percentages (e.g., spatial spots at the boundary of different cell types). Also, 274 because the mixture is a pool of only a few cells, the variations across spatial spots are 275 expected to be greater than in bulk samples. We simulated a large number of spatial spots 276 (3200 in total) by using sampled cells from the DRG single cell data above (Methods), then 277 compared AdRoit with Stereoscope over a range of simulation scenarios. 278 279 We first tested whether the methods could correctly infer a single cell type when the spots 280 contain cells from that same type. For each of the 14 cell types from DRG, we sampled 10 cells 281 and pooled them to form a spatial spot. We repeated the simulation for 100 times for a robust 282 testing, then used the full set of 14 cell types as reference to deconvolute the 1400 simulated 283 spots. Both methods were able to identify the correct cell types with indistinguishable accuracy 284 on the simulated cell types (i.e., estimates close to 1) and comparably low estimated values 285 (i.e., estimates close to zero) for other cell types not included when simulating the spots 286 (Extended Data Fig. 3). 287 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 14 288 We then continued a difficult scenario where we sampled cells from the 5 PEP subtypes and 289 mixed them. We created three simulation schemes for a comprehensive evaluation: 1) 5 PEP 290 subtypes had same percent of 0.2; 2) PEP1_Dcn was 0.1 and the other 4 were 0.225; 3) 291 PEP1_S100a11.Tagln2 and PEPE1_Dcn were 0.1, PEP2_Htr3a.Sema5a and PEP1_Slc7a3.Sstr2 292 were 0.2, and PEP3_Trpm8 was 0.4. Again, each simulation scheme was repeated 100 times. 293 Under each scheme, the estimates by AdRoit consistently centered around true proportions 294 and the other cell types had very low estimated values (close to zero) (Fig. 4a, Supplementary 295 Table 8). In comparison, though the estimates for the other cell types were also generally close 296 to zero, the estimates of the PEP cells by Stereoscope systematically deviated from the true 297 proportions for all three simulated schemes except for PEP1_S100a11.Tagln2. 298 299 We further expanded the simulated spatial spots to the mixture of 3 NP cell types and mixture 300 of 3 NF cell types. In addition, we sampled NP_Mrgpra3 cells and mixed them with other 301 distinct cell types (i.e., Th, satellite glia and endothelial), as well as NF_Calb1 cells mixed with 302 other distinct cell types, and PEP3_Trpm8 mixed with other distinct cell types. For all these 303 simulated spatial spots, AdRoit’s estimates were consistently centered at true proportions, 304 whereas Stereoscope’s estimates deviated in almost all simulated schemes (Extended Data Fig. 305 4, Supplementary Table 8). We speculate the main reason Stereoscope underperformed at 306 these simulated spots is that it normalizes the total UMI counts to the same number for all 307 cells. In real world, a spatial spot is unlikely to be a pool of cells that have the same total RNA 308 transcripts sampled, especially when a spot contains different cell types (e.g., immune cells 309 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 15 have about 10-fold less total UMIs than the neuronal cells or subtypes of neuronal cells). Our 310 simulation pooled the sampled cells by adding up the raw UMI counts per gene, which we 311 believe best mimics the real data. 312 313 Next, we asked how sensitive the methods are in detecting rare cell populations. We simulated 314 mixtures of 3 PEP subtypes (i.e., PEP1_Slc7a3.Sstr2, PEP2_Htr3a.Sema5a, PEP3_Trpm8) with a 315 series of low percent PEP3_Trpm8 (from 0.01 to 0.1 by 0.01), and the other two cell types 316 sharing the rest percentage equally (Methods). At each given percent, the simulation was 317 repeated 100 times. We then checked how accurately the percent of PEP3_Trpm8 cells was 318 estimated. The medians of AdRoit’s estimates were always close to the true proportions (Fig. 319 4b, red lines), whereas that of Stereoscope’s estimates were largely lower than true 320 proportions. Stereoscope also missed the majority of PEP3_Trpm8 cell type when the simulated 321 proportion was below 0.06. This comparison implied AdRoit is more advantageous in detecting 322 low percent cells. For a complete comparison, we also simulated 5 other types of cell mixtures 323 in the same way. At each given low percent, we computed how many times out of 100 the low 324 percent cell component was detected (estimates > 0.005). AdRoit had systematically higher 325 detection rates, as well as higher consistency across different cell mixtures (Fig. 4c, 326 Supplementary Table 9). Notably, at a simulated percent of 5%, AdRoit achieved >90% of 327 detention rate, making it a powerful tool in detecting rare cells. 328 329 Though MuSiC was not designed for deconvoluting spatial spots, theoretically it also can be 330 applied to spatial transcriptomics data. We thus also compared AdRoit to MuSiC on the same 331 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 16 sets of simulation data above. We observed AdRoit was also significantly more accurate over all 332 simulation scenarios of spatial spots (Fig. 4a, Extended Data Fig. 3 and 4, Supplementary Fig. 4), 333 and more sensitive when detecting low percent cells (Fig. 4b, c, Supplementary Fig. 5). 334 335 Application to real bulk RNA-seq data of human pancreatic islets 336 Though using synthetic bulk data based on mixing of single cells is a useful benchmarking 337 strategy, the bulk and single cell RNA-seq often use distinct RNA library preparation and 338 sequencing protocols. The capability of a method to deconvolute real bulk samples shall be 339 addressed to ensure it is useful in the real-world applications. We acquired 70 real human 340 pancreatic islets bulk samples from published studies38,40,41 (Supplementary Table 10) and used 341 single cell data of the same tissue38 as reference to infer the percentages of 4 endocrine cell 342 types (i.e., Alpha, Beta, Delta, PP). The 70 bulk samples were collected from 39 distinct donors, 343 including 26 healthy donors, and 13 donors with type 2 diabetes (T2D). Each donor contributed 344 1 to 5 replicated bulk RNA samples. 345 346 Replicates from the same donor are expected to have similar compositions and thus were used 347 to assess the reproducibility of the estimates from AdRoit. For all cell types, AdRoit had highly 348 consistent estimates for the same donors (Fig. 5a, Supplementary Table 11). The average 349 standard deviations did not exceed 1% for all 4 cell types (i.e., Alpha: 0.010; Beta: 0.008; Delta: 350 0.004; PP: 0.002). To seek an independent validation, we obtained cell sorting results by RNA-351 FISH for 4 of the 39 donors38 (Supplementary Table 12). The estimated cell proportions of the 4 352 were highly consistent with the percentages measured by RNA-FISH (Fig. 5b), and the 353 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 17 consistency held for both major cells (Alpha and Beta) and the minor cells (Delta and PP). 354 Reproducibility and independent validation showed AdRoit is reliable in deconvoluting real bulk 355 RNA-seq data. 356 357 We then asked if AdRoit can detect known biological differences between healthy and T2D 358 donors. Loss of functional insulin-producing Beta cells is a prominent characteristic of T2D42–44, 359 typically reflected by elevated level of hemoglobin A1c (HbA1c)45,46. Among the healthy donors, 360 the majority of Beta cell proportions estimated by AdRoit ranged from 50% to 75% (Fig. 5c), 361 agreed with the known percent range of Beta cells in human islets tissue47,48. A significant 362 decreasing of the estimated Beta cell proportions was seen in T2D patients (P value = 4.1e-6). 363 Further, a linear regression of estimated Beta cell proportions on HbA1c levels showed a 364 statistically significant negative association (P value = 1.8e-6). AdRoit adequately reflected the 365 cell composition difference between healthy donors and T2D patients. 366 367 Application to mouse brain spatial transcriptomics 368 We lastly demonstrated an application to the real spatial transcriptomics data. Given the 369 molecular architecture of brain tissue has been well studied, we chose mouse brain spatial 370 transcriptomics data generated by 10x genomics, containing 2703 spatial spots (Methods). The 371 reference single cell data were acquired from an independent study which contains a 372 comprehensive set of nervous cell types in brain32. We curated the cell types by merging highly 373 similar clusters and came down to a consolidated set of 46 distinct brain cell types (Methods, 374 Supplementary Table 13). 375 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 18 376 The cell contents inferred by AdRoit per spot appear to accurately match the expected cell 377 types at that location (Extended Data Fig. 5, Supplementary Table 14). For example, the three 378 subtypes of cortex excitatory neurons each occupied a sub-area in the cerebral cortex region. 379 As another example, the shape of hippocampal region was delineated by the estimated 380 percentages of dentate gyrus granule/excitatory neurons. For an independent validation, we 381 checked the consistency between estimated cell types with the in-situ hybridization (ISH) 382 images from Allen mouse brain atlas49. We chose 4 genes highly expressed in 4 brain regions 383 respectively, i.e., Spink8 for hippocampal field CA1, C1ql2 for dentate gyrus, Clic6 for choroid 384 plexus, and Synpo2 for thalamus32. The spots enriched with the 4 cell types (i.e., hippocampal 385 CA1 excitatory neuron type 2, dentate gyrus granule neuron type 2, choroid plexus cell, 386 thalamus excitatory neuron type 1), as mapped by AdRoit, precisely co-localized with the strong 387 signals of the 4 marker genes on the ISH images respectively (Fig. 5d). This agreement 388 confirmed that the spatial mapping of cell types by AdRoit is reliable. 389 390 Computational efficiency 391 Besides the accuracy and robustness, another major advantage of AdRoit is its magnitude 392 higher computational efficiency. AdRoit uses a two-step procedure to do the inference. The first 393 step prepares the reference on single cell data where per-gene means and dispersions are 394 estimated, and cell type specificity is subsequently computed. The built reference can be saved 395 and reused. We tested the running time on the reference building using the aforementioned 396 mouse brain single cell dataset containing ~15,000 cells. It took about 4.5 minutes on a CPU 397 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 19 that has 24 cores (23 used for parallel computing). The second step inputs the built reference 398 and target compound data and does the estimation. Deconvoluting ~2700 compound RNA-seq 399 samples took around 5 minutes. Therefore, AdRoit in total took less than 10 minutes and ~3Gb 400 memory usage on a regular CPU. As a comparison, MuSiC took about 1 hour and 37 minutes on 401 the same data using the same CPU. Stereoscope ran about 24 hours continuously with the 402 published parameter setting (-scb 256 -sce 75000 -topn_genes 5000 -ste 75000 -lr 0.01 -stb 100 403 -scb 100) on a powerful V100 GPU with 80 cores and 16G memory, which is prohibitive for 404 seeking a quick turnaround. 405 406 Discussion 407 In this work we have demonstrated that AdRoit is capable of deconvoluting the cell 408 compositions from the compound RNA-seq data with a leading accuracy, measured by the 409 consistency between the true and predicted cell proportions. Its advantage over the existing 410 state-of-the-art methods was verified over a wide range of use cases. In particular, AdRoit 411 excelled in complex tissues composed of more than ten different cell types with wide range of 412 cell proportions (e.g., trabecular meshwork, dorsal root ganglion). In both cases, AdRoit 413 performed significantly better than the comparators MuSiC and NNLS on deconvoluting bulk 414 RNA-seq data. AdRoit is also more accurate and sensitive than Stereoscope in demystifying 415 spatial transcriptomics spots, especially in detecting low percent cells. Previous benchmarking 416 often assumed the types of cells in the synthetic bulk data are not more or less than the cell 417 types collected in the reference, and thus the only unknown was the proportion of each cell 418 type. This assumption may not hold. Missing existing cell types or false predictions of non-419 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 20 existing ones can hinder the utility of an algorithm. Thus, besides the overall accuracy, we also 420 examined the sensitivity and specificity of the algorithms. We observed a superior sensitivity 421 and specificity in AdRoit, an important leverage for its usage in practice. 422 423 The reference single cell data used by AdRoit came from different platforms, such as the 10x 424 Genomics Chromium Instrument (the mouse dorsal root ganglion), and the Fluidigm C1 system 425 (the human pancreatic islets data). AdRoit consistently exhibited excellent performance across 426 all benchmarking datasets independent of their single cell sequencing technology platforms. 427 More importantly, this statement holds not only for deconvoluting the synthesized bulk data, 428 but also for the real bulk RNA-seq data. The latter typically does not apply the unique molecular 429 barcoding and requires a significantly different cDNA amplification procedure from what is used 430 in the single cell RNA-seq (Methods). Besides, the sequencing depth, read mapping and gene 431 expression quantification are dissimilar as well. The fact that AdRoit accurately dissected the 432 cell compositions in the real bulk samples based on the single cell reference data further 433 supports its cross-platform applicability. 434 435 We attribute the power of AdRoit to its comprehensive modeling of relevant factors. Firstly, we 436 think a common rescaling factor is not sufficient to correct the platform difference between 437 single cells and the compound data. Rather, the impact of platform difference to genes is quite 438 different and hardly is linearly scaled. Correcting such differences entails rescaling factors 439 specifically tailored to each gene. AdRoit uses an adaptive learning approach to estimate such 440 gene-wise correcting factor and does the correction in a unified model. In addition, the 441 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 21 contribution of a gene in a cell type to the loss function is jointly weighted by its specificity and 442 variability in a cell type, where specificity and variability are defined in a way accounting for the 443 overdispersion property of counts data. Our observations over the multiple benchmarking 444 dataset also show that the coexistence of similar cell types may have induced a collinearity 445 condition that negatively impacted the regression-based methods developed by others. Being 446 able to alleviate this problem gives AdRoit an edge to outperform. All these factors help AdRoit 447 to distinguish similar cell clusters while sensitive enough to separate rare cell types. 448 449 Technically, the input profiles of individual cell types to AdRoit does not necessarily come from 450 the single cell RNA-seq. Bulk RNA-seq profiles of individual isolated cell types can be used as 451 well. Nevertheless, using single cell RNA-seq data as the reference has a few key advantages. It 452 is a high throughput approach wherein multiple cell types can be interrogated simultaneously. 453 Prior knowledge of the cell types in presence as well as their specific gene markers are not 454 required, which allows novel cell types to be identified. Although detection of lowly expressing 455 genes has been a challenge for the single cell RNA-seq, significant enhancements have been 456 demonstrated. For example, the number of detectable genes currently can reach an order of 457 10,000 per cell and keeps improving50. As AdRoit focuses on the informative genes whose 458 expressions are generally high, the detection limit of the single cell RNA-seq does not impose a 459 significant drawback. Indeed, given the single cell reference profiles, AdRoit successfully 460 deconvoluted the real bulk RNA-seq data and spatial transcriptomics data. The results suggest 461 that, besides enriching our understanding of the bulk transcriptome data, AdRoit can leverage 462 the usage of the vast amount and continuously growing single cell data as well. 463 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 22 464 AdRoit is a reference-based deconvolution algorithm. A comprehensive collection of the 465 possible cell components is important. However, completeness may not always be guaranteed. 466 Even with the single cell acquisition that is independent of prior knowledge, rare and/or fragile 467 cell types may not survive through the capture procedure and hence are excluded. It is also 468 difficult to generate a solid reference profile for cells that are versatile from sample to sample 469 (e.g., tumor cells). Currently AdRoit deals implicitly with the components unknown to the 470 reference. If an unknown cell type reassembles one of the referenced ones, it may be 471 considered as part of the known cell type and their joint population is predicted. Such an 472 outcome is acceptable as treating two similar cell types as one is still biologically meaningful 473 although the resolution of the system may be compromised. If the unknown component is 474 dissimilar to all the known ones, it will be ignored by AdRoit because its representative markers 475 are unlikely among the top weighted genes associated with the known components. At the 476 same time, the distinct component is expected to have a unique gene expression pattern and 477 thus unlikely interferes significantly with the gene expressions from the known cell types. 478 Therefore, AdRoit essentially deconvolutes the relative populations among the known cell 479 components. For example, AdRoit was able to correctly uncover the populations of 4 endocrine 480 cell types from the human islet bulk data despite the absence of many other cell types such as 481 macrophages, Schwann cells and endothelial cells in the input single cell reference20. Although 482 under such a circumstance, the absolute percentages of the cells remain obscure, we expect 483 their relative proportions can be studied and valuable. A future improvement is to explicitly 484 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 23 model the unknown cell types and estimate their percentages upon the signals in the 485 compound data that cannot be explained by the contribution from the known components. 486 487 Methods 488 Gene selection 489 AdRoit selects genes that contain information about cell type identity, excluding non-490 informative genes that potentially introduce noise. There are two ways for selecting such 491 genes: 1) union of the genes whose expression is enriched in one or more cell types in the 492 single cell UMI count matrix. These genes are referred as marker genes; 2) union of the genes 493 that vary the most across all the cells in the single cell UMI count matrix, referred as the highly 494 variable genes. For marker genes, we recommend selecting top ~200 genes (P value < 0.05), 495 ranked by fold change, from each cell type for resolving complex compound transcriptome 496 data. Considering some genes may mark more than one cell types, we further require selected 497 markers presenting in no more than 5 cell types to ensure specificity. We also suggest select a 498 minimal of 1000 total number unique genes for an accurate estimation. If not satisfied, one 499 may consider expand the number of top genes and/or loose the P value cutoff. 500 501 AdRoit also offer the option to use highly variable genes. To avoid the selected highly variable 502 genes being dominated by large cell clusters whilst underrepresents small clusters, AdRoit first 503 balances the cell types in the single cell UMI count matrix by finding the median size among all 504 cell clusters, then sample cells from each cluster to make them equal to this size. Next, AdRoit 505 computes the variance of each gene across the cells in the balanced single cell UMI matrix. Due 506 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 24 to the well-known dispersion effect in RNA-seq data, directly computing variances from count 507 matrix can results in overestimation. We thus compute variances on the normalized data done 508 by variance-stabilizing transformation (VST)51. Genes with top 2000 large variances are then 509 selected. 510 511 In both ways, mitochondria genes were excluded as their expression do not have information of 512 cell identity. The results shown in current paper were based the marker genes as described 513 above. But we also demonstrated that using the balanced highly variable genes yields 514 comparably accurate estimations (Supplementary Fig. 6). 515 516 Estimate gene mean and dispersion per cell type 517 Modeling single cell RNA-seq data is challenging due to the cellular heterogeneity, technical 518 sensitivity, and noise. While the expression of some genes can be not detected by chance, other 519 genes may be found to be highly dispersed. These factors can lead to excessive variability even 520 within the same cell type. AdRoit combats high noise and computational complexity by building 521 models with estimated mean and dispersion per cell type. This strategy reduced the data 522 complexity while preserve the cell type specific information. 523 524 Although typical analyses of RNA-seq data starts with normalization, Adroit does not do 525 normalization prior to the mean estimation. Performing a normalization across all cell types 526 forces every cell type to have the same amount of RNA transcripts, measured by the total 527 unique molecular identifier (UMI) counts per cell. However, different cell types can have 528 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 25 dramatically different amounts of transcripts. For example, the amount of RNA transcripts in 529 neuronal cells is about 10 times fold of that in glial cells. Thus, normalization can falsely alter 530 the relative abundance of cell types, misleading the estimation of cell type percentages. To 531 avoid this problem, AdRoit models the means using the raw UMI counts. 532 533 Studies have shown that UMI counts follows negative binomial distribution52,53, we therefore fit 534 negative binomial distributions to single cells of each cell type and build the model based on 535 the estimated means and dispersions from the selected genes. More specifically, let 𝑋!"be the 536 set of single cell UMI counts of gene i ∈ 1,..,I for all cells in cell type k ∈ 1,…,K. I is the number 537 of selected genes, and K denotes number of cell types in the single cell reference. The 538 distribution of 𝑋!"follows negative binomial distribution, 539 𝑋!" ∼ 𝑁𝐵(𝜆!",𝑝!"), (1) 540 where 𝜆!" is the dispersion parameter of the gene i in cell type k, and 𝑝!" is the success 541 probability, i.e., the probability of gene i in cell type k getting one UMI. The two parameters are 542 estimated by MLE. The likelihood function is 543 𝐿𝐻(𝜆!",𝑝!"|𝑋!") = ∏ 𝑓(𝑋!"|𝜆!",𝑝!") #! !$% , (2) 544 where 𝑛" is the number of cells in cell type k, and f is the probability mass function of negative 545 binomial distribution. The MLE estimates are then given by 546 (𝜆&"2 ,𝑝&")2 = 𝑎𝑟𝑔max '"!,)"! 𝐿𝐻(𝜆!",𝑝!"|𝑋!"). (3) 547 Once success probability and dispersion are estimated, the mean estimates can be computed 548 numerically according to the property of negative binomial distribution, 549 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 26 𝜇!" = '#!* ∙)#!, %-)#!, , (4) 550 𝜎!" . = '#! * ∙)#!, (%-)#!, )$ . (5) 551 Estimation using MLE has been readily coded in many R packages. We choose ‘fitdist’ function 552 from ‘fitdistrplus’ package54 for its fast computation speed and flexibility in selecting 553 distributions. Estimations are done for each selected gene in each cell type, resulting in a 𝐼 × 𝐾 554 matrix of cell type means. 555 556 Cell type specificity of genes 557 Genes with cell-type specific expression patterns better represent cell types, thus are more 558 important when be used for resolving cell type composition. In line with this property, AdRoit 559 weights genes with high specificity more than less specific ones. Highly specific genes usually 560 have consistently high expression and thus relatively low variance among cells within a cell 561 type. To compute cell type specificity of a gene, we first identify the cell type in which the gene 562 has the highest expression (i.e., most specifically expressed cell type), then defines the 563 specificity of this gene as the mean-to-variance ratio within the cell type. A high ratio renders 564 high weight to the gene in the model. We use the estimated means and variances from 565 negative binomial fitting (𝜇!" and 𝜎!" . in eq. 4 and 5). Let 𝑘1 be the index of cell type that has the 566 highest mean expression of gene i, 567 𝑘1 = 𝑎𝑟𝑔max " {𝜇!"| 𝑘 𝜖 1…𝐾}, (6) 568 then the cell type specificity weight for gene i, denoting 𝑤! 2, is given by, 569 𝑤! 2 = 3"!% 4"!% $ , (7) 570 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 27 and it is computed for each gene in the set of selected genes. 571 572 Cross-sample gene variability 573 The variability of a gene contrasts how much stable a gene is across samples. The idea of 574 weighting genes based on variability across samples is first explored by Wang et al22, where 575 variability was defined as the cross-sample variance. By weighting down the high variability 576 genes, the authors achieved a great advantage over the traditional unweighted method. Genes 577 with low cross-sample variability better represent the population, hence are more trust-worthy 578 to be used to learn the cell composition. AdRoit incorporates the same notion to weight the 579 importance of genes, however, defines the variability in a more sophisticated way. Similar as 580 we define the cell type specificity, AdRoit utilizes mean and variance, and computes variance-581 to-mean ratio (VMR) to stand for cross-sample gene variability. But here the mean and variance 582 are computed across samples. The VMR is better scaled than the simple variance, and it can 583 avoid underweighting genes that has low expression, while circumvent overweighting genes 584 hugely dispersed. 585 586 In addition, AdRoit extends the method to fit the case where multiple samples are not 587 available. We proposed three ways to compute the VMR, depending on whether multi-sample 588 data is available. Typically, the compound transcriptome data to be deconvolved have multiple 589 samples. In bulk RNA-seq data, multiple samples are usually included to control for biological 590 variability. In spatial transcriptome data, the spatial dots can be seen as multiple samples. 591 Therefore, we first consider computing the cross-sample gene variability from compound 592 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 28 transcriptome data. In case multi-sample for compound data is not available, AdRoit utilizes the 593 single cell reference, and synthesizes compound samples by pooling all cells belonging to the 594 same sample. If multi-sample is not available for both data, AdRoit subsample single cells and 595 pool them to make pseudo samples. Let 𝑌!5 denote the counts of sequences for gene i in 596 sample j ∈ 1,…,J, then 597 𝑌!5 ∼ 𝑁𝐵(𝜆!5,𝑝!5), (8) 598 where 𝜆!5 is the dispersion parameter of the gene i in sample j, and 𝑝!5 is the success 599 probability. Again, we use MLE to get the estimates 𝜆&62and 𝑝&6G, following which cross-sample 600 mean and variance can be numerically computed: 601 𝜇! 2 = '#&* ∙)#&, %-)#&, , (9) 602 (𝜎! .)2 = '#&* ∙)#&, 7%-)#&, 8 $, (10) 603 and cross-sample variability for gene i is then defined as 604 𝑉𝑀𝑅! = (4" $)' 3" ' = % 9" (, (11) 605 where 𝑤! : is later used in the model. The cross-sample variability weight is computed for each 606 gene in the set of selected genes. 607 608 Gene-wise scaling factor to correct platform bias 609 When linking the compound data to the single cell data, rescaling factor is often used to 610 account for the library size and platform difference. The existing methods adopt a single 611 rescaling factor for each unit of sample, i.e., all genes of a single sample are multiplied by the 612 same factor22,23. This operation is based on a strong assumption that the impact of platform 613 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 29 difference to every gene is the same and linearly scaled among different cell types, which is 614 hardly true. In addition, because estimates can be easily affected by outliers in linear model, 615 estimation of cell proportions can be steered away from the truth by extremely high expression 616 genes. Therefore, applying a uniform scaling factor to all gene is inappropriate. 617 618 To overcome this problem, AdRoit instead estimates gene-wise scaling factors via an adaptive 619 learning strategy and rescales each gene with its respective scaling factor. To proceed, we first 620 input the mean gene expression from the compound samples (𝜇! 2in eq. 9) and the estimated 621 means of each cell type from the single cell data (𝜇!" in eq. 4), then apply a traditional non-622 negative least square regression (NNLS) to get a rough estimation of the proportions of each 623 cell type, denoting 𝜏". For each gene, a predicted mean expression (∑ 𝜏"G;" 𝜇!" in eq. 13) is 624 computed as the weighted sum of the means of each cell type wherein the weights are the 625 roughly estimated proportions. The regression equation is given by, 626 𝜇! 2 = 𝐴 ∙ (∑ 𝜏";" 𝜇!" + 𝜀), 0 < 𝜏", ∑ 𝜏" ; " = 1 (12) 627 where A is a constant to ensure 𝜏"’s sum to 1 and 𝜀 is the error term. We use ‘nnls’ function in 628 the ‘nnls’ package55 to estimate 𝜏"’s. Next, we calculate the ratio between the mean expression 629 from compound samples and the predicted means, and define the gene-wise rescaling factor as 630 the logarithm of the ratio plus 1, 631 𝑟! = log ( 3" ) ∑ =!, * ! 3"! + 1). (13) 632 Given the dispersion property of count data, the logarithm of the ratio is a more appropriate 633 statistic as it results in relatively stable scaling factors. The addition of 1 avoids taking logarithm 634 on zero. By multiplying the flexible gene-wise rescaling factor, the “outlier” genes will be 635 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 30 pushed toward the truth regression line direction, while the genes around the true regression 636 lines are less affected (Fig. 1b). 637 638 Weighted and regularized model 639 We next designed a model that incorporates all these factors to do the actual estimation of cell 640 type proportions. AdRoit builds upon non-negative least square regression model. It gives high 641 weights to the genes with high cell type specificity and low cross-sample variability. This was 642 done by optimizing a weighted sum of squared loss function L, where the weights consist of 643 two components (𝑤! : in eq. 7, 𝑤! 2 in eq. 11). The gene-wise scaling factor tailored for each gene 644 effectively corrects the bias due to technology difference between compound sample and 645 single cell data (𝑟!in eq 13). In cases of complex tissues (e.g., neural tissues) where many highly 646 similar subtypes are common, closely related subtypes can have strong collinearity, leading to 647 overestimation of some cell types whilst underestimate or miss some others. AdRoit handles 648 this problem by including a L2 norm of the estimates as the regularization component. Denote 649 𝛽" as the unscaled coefficient for cell type k. For a compound transcriptome sample j, the loss 650 function is given by, 651 𝐿5(𝛽%,…,𝛽;|𝑦!5,𝑤! :,𝑤! 2,𝑟!,𝜇&"G) = ∑ 𝑤! : ∙ 𝑤! 2 ∙ (𝑦!5 − 𝑟! ∙ ∑ 𝛽"𝜇&"G;" ). > ! + ∑ 𝛽" .; " . (14) 652 Then the coefficient 𝛽" can be estimated by minimizing the loss function with the constraint 653 𝛽%,…,𝛽; > 0, 654 𝛽%2,…,𝛽;2 = argmax ?+,…,?* ?+,…,?*AB𝐿5. (15) 655 The estimation is done by a gradient projection method by Byrd et al56. We derive the gradient 656 function by taking partial derivative of the loss function with w.r.t. 𝛽", 657 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 31 𝐺" = ∇?!𝐿5 = −2∑ 𝑟! ∙ 𝜇&"G ∙ 𝑤! : ∙ 𝑤! 2 ∙ ^𝑦!5 − 𝑟! ∙ ∑ 𝛽"𝜇&"G;" _ + > ! 2𝛽". (16) 658 AdRoit uses the function ‘optim’ from the R package ‘stats’ to do the estimation57, providing the 659 loss function (eq. 15) and the gradient (eq. 16). To get the final estimates of cell type 660 proportions, we rescale the coefficients 𝛽"’s to ensure a summation of 1, 661 𝜃" = ?!* ∑ ?!* * ! . (17) 662 Each compound sample j is independently estimated by the model described above. 663 664 Simulation of bulk RNA-seq and spatial transcriptomics data 665 Bulk RNA-seq data used for benchmarking are synthesized by adding up the raw UMI reads per 666 gene from all single cells of a sample regardless of cell types. Denote 𝑡" as a cell in cell type k, 667 and 𝑡" ∈ 1, …, 𝑇", where 𝑇" is the number of cells in cell type k. Let 𝑌!5 D be the read count of 668 gene i in a synthesized bulk sample j, and 𝑋!5E! be the UMI count of the gene, then 669 𝑌!5 D = ∑ ∑ 𝑋!5E! F! E! ; " . 670 The true proportion of cell type k is given by, 671 𝜃" B = F! ∑ F! * ! . 672 673 To simulate spatial transcriptomic spots, we first sample 10 cells without replacement from 674 each cell type and added them up, then mix them with designed proportions. For example, to 675 simulate a spot with 𝑝" percent of cell type k, the read count 𝑌!5 G of gene i in a spatial spot j is 676 given by, 677 𝑌!5 G = ∑ 𝑝";" ∑ 𝑋!"#%B#$% , 678 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 32 where 𝑋!"G is UMI count of gene i in a sampled cell n of cell type k. For each mixing scheme, the 679 simulation is repeated 100 times. 680 681 Evaluation statistics 682 We compared the estimated cell type proportions with the ground truth by calculating 4 683 statistics. The mAD and RMSD are given by, 684 𝑚𝐴𝐷 = ∑ HI!-I! , H*! ; , 685 𝑅𝑀𝑆𝐷 = ∑ 7I!-I! ,8 $* ! ; . 686 Pearson correlation coefficient is computed as, 687 𝜌) = ∑ 7I!-I!JJJJ8KI! ,-I! ,JJJJL*! M∑ 7I!-I!JJJJ8 * ! $M∑ KI! ,-I! ,JJJJL $* ! , 688 where 𝜃"ggg and 𝜃" Bggg are means of the estimated proportions and true proportions, respectively. 689 Spearman correlation coefficient is given by, 690 𝜌G = ∑ (N!-N!JJJJ)KN! ,-N! ,JJJJL*! M∑ (N!-N!JJJJ) * ! $M∑ KN! ,-N! ,JJJJL $* ! , 691 where 𝑟"is the rank of 𝜃". 692 693 Single cell RNA sequencing of mouse dorsal root ganglion 694 As described previously58, lumbar DRGs were isolated from adult C57BL/6 mice and transferred 695 to a dissociation buffer (Dulbecco's modified Eagle's medium supplemented with 10% heat-696 inactivated Fetal Calf Serum) (Gibco; cat # A38400-02). To generate a single cell suspension, 697 DRGs were subjected to a 2 step-enzymatic dissociation followed by a mechanical dissociation. 698 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 33 In brief, DRGs were first incubated with 0.125% collagenase P from Clostridium histolyticum 699 (Roche Applied Science; cat # 11249002001) for 90 minutes in an Eppendorf Thermomixer C 700 (37°C; intermittent 750 rpm shaking for about 10 sec every 2 minutes). Then, DRGs were 701 transferred to a Hank's Balanced Salt Solution (HBSS, Mg2+ and Ca2+ free; Invitrogen) 702 supplemented with 0.25% Trypsin (Worthington biochemical corp.; cat # LSoo3707) and 703 0.0025% EDTA and incubated for 10 minutes at 37°C in the Eppendorf Thermomixer C. Trypsin 704 was neutralized by the addition of 2.5 mg/ml MgSO4 (Sigma; cat #M-3937) and DRGs were 705 triturated with Pasteur pipettes. The resulting cell suspension was passed through a 70 µm 706 mesh filter to remove remaining chunks of tissues and centrifuged for 5 minutes at 2500 rpm at 707 room temperature. The pellet was resuspended in HBSS (Ca2+, Mg2+ free; Invitrogen) and the 708 cell suspension was run on a 30% Percoll Plus gradient (Sigma GE17-5445-02) to further remove 709 debris. Finally, cells were resuspended in PBS supplemented with 0.04% BSA at a concentration 710 of 200 cells/µl and cell viability was determined using the automated cell analyzer 711 NucleoCounter® NC-250™. The suspended single cells were loaded on a Chromium Single Cell 712 Instrument (10X Genomics) with about 6000 cells per lane to minimize the presence of 713 doublets. 2000-3000 cells per lane were recovered. RNA-seq libraries were constructed using 714 Chromium Single Cell 3’ Library, Gel Beads & Multiplex Kit (10X Genomics). Single end 715 sequencing was performed on Illumina NextSeq500. Read 1 starts with a 26-bp UMI and cell 716 barcode, followed by an 8-bp i7 sample index. Read 2 contains a 55-bp transcript read. Sample 717 de-multiplexing, alignment, filtering, and UMI counting were conducted using Cell Ranger 718 Single-Cell Software Suite59 (10X Genomics, v2.0.0). Mouse mm10 Genome assembly and UCSC 719 gene model were used for the alignment. 720 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 34 721 Data preprocessing 722 DRG single cell data 723 The UMI data output from Cell Ranger Single-Cell Software Suite (10X Genomics, v2.0.0) was 724 analyzed using Seurat package60 to assess the cell quality and identify cell types, similar to what 725 described previously39. Cells with the number of detected genes less than 500 or over 15000, or 726 with a UMI ratio of mitochondria encoded genes versus all genes over 0.1 were also removed. 727 The UMI data was normalized by the ‘NormalizeData’ method in Seurat with default settings. 728 To avoid potential sample-to-sample variation caused by technical variation at various 729 experiment steps, we employed Seurat data integration method. The top 2000 variable genes 730 of each of the 5 samples were identified using ‘FindVariableFeatures’ with 731 selection.method=‘vst’. Based on the union of these variable genes, the anchor cells in each 732 sample were identified by ‘FindIntegrationAnchors’. All the samples were then integrated by 733 ‘IntegrateData’. We subsequently scaled the integrated data (‘ScaleData’) and performed 734 dimension reduction (‘RunPCA’). Cells were then clustered based on the first 15 principal 735 components by applying ‘FindNeighbors’ and ‘FindClusters’ (resolution=0.6, algorithm=1). 736 Marker genes for each cluster were identified using ‘FindAllMarkers’. Parameters were used 737 such that these genes were expressed in at least 25% of the cells in the cluster, and on average 738 2-fold higher than the rest of cells with a multiple-testing adjusted Wilcoxon test p value of less 739 than 0.01. The specificity of the canonical cell type-specific genes or cell cluster-specific genes 740 were further examined by visualizations (Extended Data Fig. 2) and used to define the cell type 741 for each cluster. At the end, the original UMI data from 17271 genes and 3352 cells that passed 742 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 35 the quality control were organized into a matrix (genes as rows and cell identifiers as columns). 743 This matrix, together with the cell type label for each cell therein, were loaded into AdRoit as 744 reference profiles. 745 746 Mouse brain single cell data 747 The scRNA-seq reference data of the mouse brain were obtained from Zeisel et. al32. Among all 748 the available data, we only retained 96,572 cells that were acquired from the brain regions, had 749 an assigned cell type by the authors and a minimal total UMI of 1000. These cells corresponded 750 to 183 clusters at the finest taxonomy level in the original study. As many of the clusters are 751 highly similar, we decided to merge some of them to simplify the reference landscape. First, the 752 top 50 cluster enriched markers were derived using Scanpy61 via the ‘rank_genes_groups’ 753 function (method=‘wilcoxon’), following the normalization (‘normalize_per_cell’), log 754 transformation (‘log1p’) and regressing out (‘regress_out’) the variances associated with the 755 total UMI and the percentage of mitochondrial chromosome encoded genes per cell. Then, the 756 pair-wise overlapping p-values among the clusters were calculated using the top 50 marker 757 genes assuming the hypergeometric null distribution. Last, clusters with overlapping p-values 758 more significant than 1e-10 were merged and new names were assigned by combinedly 759 considering the original annotation, the molecular features and the specificity to certain brain 760 regions. A total of 46 cell types were determined that cover all the 12 brain regions and their 761 important substructures37 (Supplementary Table 13). To make the reference dataset more 762 manageable in size and more balanced in the representation of cell types, we down sampled 763 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 36 each cluster to no more than 360 cells. A final set of 14,666 cells over 46 cell types were used 764 for the deconvolution of the mouse brain spatial transcriptome data. 765 766 Human Islets 767 We used the 1492 high quality human islets single cell and annotation from Xin et al38. The 768 RPKM expression table was directly downloaded and used as is. The RNA-FISH data was also 769 from this study38. For the real bulk human pancreatic islets data38,40,41, the read counts table 770 were deconvoluted. Only data from donors with HbA1C level available were included in the 771 regression of Beta cell proportion on HbA1C level (Fig. 4c, Supplementary Table 10). 772 773 Trabecular Meshwork 774 We downloaded the raw sequence data and followed the same analysis procedure as in Patel et 775 al39 for quality control and cell type identification. 776 777 Mouse Brain Spatial transcriptomics data by 10x Visium platform 778 The filtered cell matrix, tissue image and the spatial coordinates of a coronal section of an adult 779 C57BL/6 mouse brain from the 10x Genomics were available for download and used as is. 780 781 Mouse Brian ISH images 782 The ISH images were directly downloaded from Allen mouse Brain Atlas37 by searching the gene 783 names. THE images were used with further editing except for cropping. 784 785 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 37 Data availability 786 DRG single cell data are deposited at NCBI GEO (accession number: GSE163252) . The bulk RNA-787 seq and RNA-FISH data for human pancreatic islets were initially published as aggregated data 788 where the data processing and experimental procedure were described therein38,40,41. We 789 acquired the individual sample data from the authors and released them along with the current 790 study (Supplementary Table 10 and Supplementary Table 12). The other public data analyzed in 791 this study are available from: GEO (human pancreatic islets single cell data: GSE81608); NCBI 792 (human trabecular meshwork single cell data: PRJNA616025; mouse brain single cell data: 793 SRP135960). Mouse brain spatial transcriptomic data was downloaded from the 10x Genomics 794 website (https://support.10xgenomics.com/spatial-gene-795 expression/datasets/1.1.0/V1_Adult_Mouse_Brain_Coronal_Section). 796 797 Code availability 798 AdRoit’s source code is available on Github (https://github.com/TaoYang-dev/AdRoit). 799 800 Software 801 The statistical analyses were done with R statistical software (v3.6.0)57 and python (v3.7.2)62. 802 The packages used include Seurat (v3.0.1)60, scanpy (v1.6.0)61, dplyr (v0.8.0.1)63, doParallel 803 (v1.0.14)64, data.table (v1.12.4)65, fitdistrplus (v1.1-1)54, nnls (v1.4)55. 804 805 Reference 806 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 38 1. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. 807 Nature Reviews Genetics (2009) doi:10.1038/nrg2484. 808 2. Chu, G. C., Kimmelman, A. C., Hezel, A. F. & DePinho, R. A. Stromal biology of pancreatic 809 cancer. Journal of Cellular Biochemistry (2007) doi:10.1002/jcb.21209. 810 3. Bussard, K. M., Mutkus, L., Stumpf, K., Gomez-Manzano, C. & Marini, F. C. Tumor-811 associated stromal cells as key contributors to the tumor microenvironment. Breast 812 Cancer Research (2016) doi:10.1186/s13058-016-0740-2. 813 4. Munn, D. H. & Bronte, V. Immune suppressive mechanisms in the tumor 814 microenvironment. Current Opinion in Immunology (2016) 815 doi:10.1016/j.coi.2015.10.009. 816 5. Gonzalez, H., Hagerling, C. & Werb, Z. Roles of the immune system in cancer: From tumor 817 initiation to metastatic progression. Genes and Development (2018) 818 doi:10.1101/GAD.314617.118. 819 6. Garner, H. & de Visser, K. E. Immune crosstalk in cancer progression and metastatic 820 spread: a complex conversation. Nature Reviews Immunology (2020) 821 doi:10.1038/s41577-019-0271-z. 822 7. Singh, U. P. et al. Chemokine and cytokine levels in inflammatory bowel disease patients. 823 Cytokine (2016) doi:10.1016/j.cyto.2015.10.008. 824 8. Van Lint, P. & Libert, C. Chemokine and cytokine processing by matrix metalloproteinases 825 and its effect on leukocyte migration and inflammation. J. Leukoc. Biol. (2007) 826 doi:10.1189/jlb.0607338. 827 9. Zelová, H. & Hošek, J. TNF-α signalling and inflammation: Interactions between old 828 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 39 acquaintances. Inflammation Research (2013) doi:10.1007/s00011-013-0633-0. 829 10. Koelman, L., Pivovarova-Ramich, O., Pfeiffer, A. F. H., Grune, T. & Aleksandrova, K. 830 Cytokines for evaluation of chronic inflammatory status in ageing research: Reliability 831 and phenotypic characterisation. Immun. Ageing (2019) doi:10.1186/s12979-019-0151-1. 832 11. Landskron, G., De La Fuente, M., Thuwajit, P., Thuwajit, C. & Hermoso, M. A. Chronic 833 inflammation and cytokines in the tumor microenvironment. Journal of Immunology 834 Research (2014) doi:10.1155/2014/149185. 835 12. Ståhl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial 836 transcriptomics. Science (2016) doi:10.1126/science.aaf2403. 837 13. Vickovic, S. et al. High-definition spatial transcriptomics for in situ tissue profiling. Nat. 838 Methods (2019) doi:10.1038/s41592-019-0548-y. 839 14. Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 840 (2009) doi:10.1038/nmeth.1315. 841 15. Denisenko, E. et al. Systematic assessment of tissue dissociation and storage biases in 842 single-cell and single-nucleus RNA-seq workflows. Genome Biol. (2020) 843 doi:10.1186/s13059-020-02048-6. 844 16. Nguyen, Q. H., Pervolarakis, N., Nee, K. & Kessenbrock, K. Experimental considerations 845 for single-cell RNA sequencing approaches. Frontiers in Cell and Developmental Biology 846 (2018) doi:10.3389/fcell.2018.00108. 847 17. Tanay, A. & Regev, A. Scaling single-cell genomics from phenomenology to mechanism. 848 Nature (2017) doi:10.1038/nature21350. 849 18. Abbas, A. R., Wolslegel, K., Seshasayee, D., Modrusan, Z. & Clark, H. F. Deconvolution of 850 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 40 blood microarray data identifies cellular activation patterns in systemic lupus 851 erythematosus. PLoS One (2009) doi:10.1371/journal.pone.0006098. 852 19. Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. 853 Nat. Methods (2015) doi:10.1038/nmeth.3337. 854 20. Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas 855 Reveals Inter- and Intra-cell Population Structure. Cell Syst. (2016) 856 doi:10.1016/j.cels.2016.08.011. 857 21. Tsoucas, D. et al. Accurate estimation of cell-type composition from gene expression 858 data. Nat. Commun. (2019) doi:10.1038/s41467-019-10802-z. 859 22. Wang, X., Park, J., Susztak, K., Zhang, N. R. & Li, M. Bulk tissue cell type deconvolution 860 with multi-subject single-cell expression reference. Nat. Commun. (2019) 861 doi:10.1038/s41467-018-08023-x. 862 23. Andersson, A. et al. Single-cell and spatial transcriptomics enables probabilistic inference 863 of cell type topography. Commun. Biol. 3, 565 (2020). 864 24. Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues 865 with digital cytometry. Nat. Biotechnol. (2019) doi:10.1038/s41587-019-0114-2. 866 25. Myung, I. J. Tutorial on maximum likelihood estimation. J. Math. Psychol. (2003) 867 doi:10.1016/S0022-2496(02)00028-7. 868 26. Bassett, R. & Deride, J. Maximum a posteriori estimators as a limit of Bayes estimators. 869 Math. Program. (2019) doi:10.1007/s10107-018-1241-0. 870 27. Zhao, Y. & Simon, R. Gene expression deconvolution in clinical samples. Genome 871 Medicine (2010) doi:10.1186/gm214. 872 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 41 28. Chiu, Y. J., Hsieh, Y. H. & Huang, Y. H. Improved cell composition deconvolution method 873 of bulk gene expression profiles to quantify subsets of immune cells. BMC Med. 874 Genomics (2019) doi:10.1186/s12920-019-0613-5. 875 29. Kang, K. et al. CDSeq: A novel complete deconvolution method for dissecting 876 heterogeneous samples using gene expression data. PLoS Comput. Biol. (2019) 877 doi:10.1371/journal.pcbi.1007510. 878 30. Qiao, W. et al. PERT: A Method for Expression Deconvolution of Human Blood Samples 879 from Varied Microenvironmental and Developmental Conditions. PLoS Comput. Biol. 880 (2012) doi:10.1371/journal.pcbi.1002838. 881 31. Zaitsev, K., Bambouskova, M., Swain, A. & Artyomov, M. N. Complete deconvolution of 882 cellular mixtures based on linearity of transcriptional signatures. Nat. Commun. (2019) 883 doi:10.1038/s41467-019-09990-5. 884 32. Zeisel, A. et al. Molecular Architecture of the Mouse Nervous System. Cell (2018) 885 doi:10.1016/j.cell.2018.06.021. 886 33. Donovan, M. K. R., D’Antonio-Chronowska, A., D’Antonio, M. & Frazer, K. A. Cellular 887 deconvolution of GTEx tissues powers discovery of disease and cell-type associated 888 regulatory variants. Nat. Commun. (2020) doi:10.1038/s41467-020-14561-0. 889 34. Phipson, B., Zappia, L. & Oshlack, A. Gene length and detection bias in single cell RNA 890 sequencing protocols. F1000Research (2017) doi:10.12688/f1000research.11290.1. 891 35. Chen, G., Ning, B. & Shi, T. Single-cell RNA-seq technologies and related computational 892 data analysis. Frontiers in Genetics (2019) doi:10.3389/fgene.2019.00317. 893 36. Chen, D. & Plemmons, R. J. Nonnegativity constraints in numerical analysis. in The Birth 894 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 42 of Numerical Analysis (2009). doi:10.1142/9789812836267_0008. 895 37. Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 896 (2007) doi:10.1038/nature05453. 897 38. Xin, Y. et al. RNA Sequencing of Single Human Islet Cells Reveals Type 2 Diabetes Genes. 898 Cell Metab. (2016) doi:10.1016/j.cmet.2016.08.018. 899 39. Patel, G. et al. Molecular taxonomy of human ocular outflow tissues defined by single-900 cell transcriptomics. Proc. Natl. Acad. Sci. 117, 12856 LP – 12867 (2020). 901 40. Xin, Y. et al. Pseudotime ordering of single human B-cells reveals states of insulin 902 production and unfolded protein response. Diabetes (2018) doi:10.2337/db18-0365. 903 41. Gutierrez, G. D. et al. Gene signature of proliferating human pancreatic a cells. 904 Endocrinology (2018) doi:10.1210/en.2018-00469. 905 42. Cerf, M. E. Beta cell dysfunction and insulin resistance. Frontiers in Endocrinology (2013) 906 doi:10.3389/fendo.2013.00037. 907 43. Maedler, K. & Donath, M. Y. Beta-cells in type 2 diabetes: a loss of function and mass. 908 Hormone research (2004). 909 44. Donath, M. Y. et al. Mechanisms of β-cell death in type 2 diabetes. Diabetes (2005) 910 doi:10.2337/diabetes.54.suppl_2.S108. 911 45. Calanna, S. et al. Alpha- and beta-cell abnormalities in haemoglobin A1c-defined 912 prediabetes and type 2 diabetes. Acta Diabetol. (2014) doi:10.1007/s00592-014-0555-5. 913 46. Kanat, M. et al. The Relationship Between β-Cell Function and Glycated Hemoglobin. 914 Diabetes Care 34, 1006 LP – 1010 (2011). 915 47. Nepton, S. Beta-Cell Function and Failure. in Type 1 Diabetes (2013). doi:10.5772/52153. 916 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 43 48. Dolenšek, J., Rupnik, M. S. & Stožer, A. Structural similarities and differences between 917 the human and the mouse pancreas. Islets (2015) doi:10.1080/19382014.2015.1024405. 918 49. Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 919 445, 168–176 (2007). 920 50. Vieth, B., Parekh, S., Ziegenhain, C., Enard, W. & Hellmann, I. A systematic evaluation of 921 single cell RNA-seq analysis pipelines. Nat. Commun. (2019) doi:10.1038/s41467-019-922 12266-7. 923 51. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome 924 Biol. (2010) doi:10.1186/gb-2010-11-10-r106. 925 52. Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-926 seq data using regularized negative binomial regression. Genome Biol. (2019) 927 doi:10.1186/s13059-019-1874-1. 928 53. Svensson, V. Droplet scRNA-seq is not zero-inflated. Nature Biotechnology (2020) 929 doi:10.1038/s41587-019-0379-5. 930 54. Delignette-Muller, M. L. & Dutang, C. fitdistrplus: An R package for fitting distributions. J. 931 Stat. Softw. (2015) doi:10.18637/jss.v064.i04. 932 55. Mullen, Katharine M., I. H. M. van S. nnls: The Lawson-Hanson algorithm for non-933 negative least squares (NNLS). R Packag. version 1.4 (2012). 934 56. Byrd, R. H., Lu, P., Nocedal, J. & Zhu, C. A Limited Memory Algorithm for Bound 935 Constrained Optimization. SIAM J. Sci. Comput. (1995) doi:10.1137/0916069. 936 57. The R Core Team. R: A Language and Environment for Statistical Computing. R 937 Foundation for Statistical Computing (2019). 938 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 44 58. Alessandri-Haber, N. et al. Hypotonicity induces TRPV4-mediated nociception in rat. 939 Neuron (2003) doi:10.1016/S0896-6273(03)00462-8. 940 59. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. 941 Commun. (2017) doi:10.1038/ncomms14049. 942 60. Stuart, T. et al. Comprehensive Integration of Single-Cell Data. Cell (2019) 943 doi:10.1016/j.cell.2019.05.031. 944 61. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: Large-scale single-cell gene expression data 945 analysis. Genome Biol. (2018) doi:10.1186/s13059-017-1382-0. 946 62. van Rossum, G. & Drake, F. L. Python 3 Reference Manual. Scotts Valley, CA (2009). 947 63. Wickham, H. & Francois, R. dplyr: A Grammar of Data Manipulation. R Packag. version 948 0.4.2. (2015). 949 64. Weston, S., Calaway, R. & Tenenbaum, D. doParallel: Foreach Parallel Adaptor for the 950 Parallel Package. Cran (2014). 951 65. Dowle, M. & Srinivasan, A. data.table: Extension of ‘data.frame’. R Package Version 952 1.12.8. Manual (2019). 953 954 Acknowledgements 955 We thank Yurong Xin for pointing us to the relevant public data resource. We also thank Gabor 956 Halasz and Yuan Zhu for the advice to algorithm design. 957 958 Author contributions 959 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 45 T.Y., Y.B., W.F., N.A.-H., M.L.-F., L.E.M. and G.S.A. designed the research. T.Y., Y.B., and W.F. 960 developed the algorithm. T.Y., Y.B., W.F. and J.K. participated in the data analyzing. M.S. and 961 R.B. performed the DRG tissue collection. C.A. performed the single cell library preparation and 962 sequencing experiment. T.Y., Y.B., N.A.-H. and G.S.A. wrote the manuscript. 963 964 Competing interests 965 T.Y., Y.B., W.F. and G.S.A. have filed a patent application relating to the AdRoit computational 966 framework. M.L.-F. is an employee of Cellular Longevity. All other authors are employees and 967 shareholders of Regeneron Pharmaceuticals, although the manuscript’s subject matter does not 968 have any relationship to any products or services of this corporation. 969 970 Figure legends 971 Fig. 1: Schematic representation of AdRoit computational framework. a, AdRoit inputs bulk or 972 spatial RNA-seq data, single cell RNA-seq data and cell type annotations. It first selects 973 informative genes and estimates their means and dispersions, based on which the cell type 974 specificity of genes is computed. Depending on multi-sample availability, cross-sample gene 975 variability is estimated from compound data, or single cell samples (dashed arrow). Lastly the 976 gene-wise scaling factors are estimated using both compound data and single cell data. These 977 computed quantities are fed to a weighted regularized model to infer the transcriptome 978 composition. b, A mock example to illustrate the role of gene-wise scaling factor. Ideally, an 979 accurate estimation of slop (i.e., cell proportion) would be the slope of the green line, however 980 direct fitting would result in the red line due to the impact of the outlier genes. Outlier genes 981 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 46 can be induced due to platform difference affecting genes differently. AdRoit adopts an 982 adaptive learning approach that first learns a rough estimation of the slop (red line), then 983 moves the outlier genes toward it such that the more deviated genes will be moved more 984 toward the true line (i.e., longer arrows). After the adjustment, the new estimated slop (blue 985 line) is closer to the truth (green line), thus is a more accurate estimation. 986 987 Fig.2: Benchmark on simulated bulk data synthesized from trabecular meshwork (TM) single 988 cells data. a, AdRoit has the closest estimation to the true cell proportion comparing to MuSiC 989 and NNLS. Each dot is a cell type from one donor. b, For each cell type in TM, AdRoit has the 990 smallest differences from the true cell type proportion and the smallest variance of estimates 991 across the 8 donors. For each cell type, a dot on the graph denotes a donor, and the bars 992 represent the 1.5 × interquartile ranges. Estimation was done by using the single cell as 993 reference leaving out the donor used for synthesizing bulk. c, AdRoit’s estimates are more 994 accurate and specific than MuSiC’s estimates on synthetic bulk that contains partial cell types. 995 The synthetic bulk was simulated by using only 6 out of the 12 cell types per donor, then 996 estimated with the reference of 12 cell types. AdRoit has notably fewer false positive estimates 997 of the 6 cell types not included, and more accurate estimation of the 6 cell types used for 998 synthesizing bulk. d, Receiver operating characteristic (ROC) curve shows AdRoit has a 999 significantly higher AUC than MuSiC (0.95 vs 0.74), meaning better sensitivity and specificity. 1000 1001 Fig. 3: Benchmark on scRNA-seq data from dorsal root ganglion (DRG) where these exist many 1002 closely related subtypes of neuronal cells. a, 14 cell types were identified from scRNA-seq 1003 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 47 samples of 5 mice, including multiple subtypes of neurofilaments (NF), peptidergic (PEP) and 1004 non-peptidergic (NP) neurons. b, Benchmarking with the synthetic data shows AdRoit’s 1005 estimation of cell type proportions are highly accurate. In particular, AdRoit achieves 1006 reasonably high accuracy when the cells are rare (e.g., < 5%). Each dot represents a cell type 1007 from one sample. c, For each individual sample, mAD, RMSD, Pearson and Spearman 1008 correlations were computed and compared across three methods. AdRoit has the lowest mAD 1009 and RMSD, and highest Pearson and Spearman correlations. In addition, AdRoit’s estimation is 1010 also the most stable across samples. Each dot on the boxplot is a sample. Estimation was done 1011 by using the single cell reference leaving out the sample used for synthesizing bulk. 1012 1013 Fig. 4: AdRoit is more accurate and sensitive than Stereoscope on spatial spots simulated 1014 from real DRG cells. a, AdRoit and Stereoscope estimations on simulated spatial spots that 1015 contains 5 PEP neuron subtypes. True mixing proportions were denoted by the red dashed 1016 lines. Three schemes were simulated: 1) the proportions of 5 PEP cell types are the same and 1017 equal to 0.2; 2) PEP1_Dcn is 0.1 and the other 4 are 0.225; 3) PEP1_Dcn and 1018 PEP1_S100a11.Tagln2 are 0.1, PEP1_Slc7a3.Sstr2 and PEP2_Htr3a.Sema5a 0.2 are 0.2, and 1019 PEP3_Trpm8 is 0.4. In all simulation schemes, AdRoit’s estimates are more consistently 1020 centered around the true proportions than Stereoscope’s estimates. b, AdRoit is more accurate 1021 in estimating rare cells in spatial spots. The spots were simulated by simulating mixtures of 3 1022 PEP cell types (i.e., PEP1_Slc7a3.Sstr2, PEP2_Htr3a.Sema5a and PEP3_Trpm8), with a series of 1023 low percent of PEP3_Trpm8 cell type from 1% to 10% and the other two cell types sharing the 1024 rest proportion equally. AdRoit’s estimates are systematically closer to the true simulated 1025 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 48 proportions than Stereoscope’s estimates. c, AdRoit is consistently more sensitive than 1026 Stereoscope in detecting low percent cells (estimates > 0.5% deemed as detected) in simulated 1027 spots of 1) low percent of NF_Calb1 mixed with NF_Pvalb and NF2_Ntrk2.Necab2, 2) low 1028 percent of NP_Mrgpra3 mixed with NP_Mrgprd and NP_Nts, 3) low percent of PEP3_Trpm8 1029 mixed with PEP1_Slc7a3.Sstr2 and PEP2_Htr3a.Sema5a, 4) low percent of NF_Calb1 mixed with 1030 Th, satellite glia and endothelial, 5) low percent of NP_Mrgpra3 mixed with Th, satellite glia and 1031 endothelial, and 6) low percent of PEP_Trpm8 mixed with Th, satellite glia and endothelial. 1032 1033 Fig. 5: Applications to real bulk human islets RNA-seq data and mouse brain spatial 1034 transcriptome data. a, AdRoit’s estimates on real human Islets bulk RNA-seq data were highly 1035 reproducible for the repeated samples from same donor. b, AdRoit estimated cell type 1036 proportions agreed with the RNA-FISH measurements. c, AdRoit estimated Beta cell 1037 proportions in type 2 diabetes patients are significantly lower than that in healthy subjects. In 1038 addition, the estimated proportions have a significant negative linear association with donors’ 1039 HbA1C level. d, The spatial mapping of 4 mouse brain cell types is consistent with the ISH 1040 images of 4 marker genes from Allen mouse brain atlas37 respectively. The 4 genes, Spink8 1041 (marker of hippocampal field CA1), C1ql2 (marker of Dentate Gyrus), Clic6 (marker of Choroid 1042 Plexus), Synpo2 (marker of Thalamus) were identified as markers of corresponding tissues by 1043 Zeisel et al32. 1044 1045 Extended Data Fig. 1: Benchmark three methods on human pancreatic islets data. a, Human 1046 islets single cell data contains 4 cell types from 18 subjects including two major cell types Alpha 1047 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 49 and Beta cells, and two minor cells PP and Delta cells38. The cell proportion varies across 1048 different subjects. b, c, AdRoit achieves leading accuracy when applied to the bulk data 1049 synthesized from the single cell data. Each dot on scatterplot is a cell type from one subject. 1050 Estimation was done by using the single cell reference leaving out the subject used to 1051 synthesize bulk. 1052 1053 Extended Data Fig. 2: Dorsal root ganglion single cell shows 14 cell types including 3 subtypes 1054 of neurofilament, 3 subtypes of non-peptidergic neurons, and 5 subtypes of peptidergic 1055 neurons. a, Heatmap of top markers shows distinction between cell types as well as similarity 1056 between subtypes. b, The proportion of each cell type varies from 0.5% to 33.71% across 1057 different samples. 1058 1059 Extended Data Fig. 3: Comparing the performance on estimated simulated spatial spots of 14 1060 pure cell type respectively. a, Estimates by AdRoit and b, estimates by Stereoscope are 1061 comparably accurate. Simulations were done by sampling cells from the same cell type and 1062 adding up the read counts per gene. For each of the 14 cell types of the DRG tissue, we 1063 repeated the simulation 100 times. The results shown were a summary of 100 simulations for 1064 each cell type. For both methods, the median estimates of the sampled cell type were close to 1065 1 (red lines), whereas the cell type not sampled has zero or close-to-zero values. 1066 1067 Extended Data Fig. 4: The comparison of AdRoit and Stereoscope on the simulated spots of 1068 additional cell mixing schemes. 5 more types of mixed spatial spots were simulated: 1) mixture 1069 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 50 of 3 neurofilaments (NF); 2) mixture of 3 non-peptidergic (NP) cell types; 3) NF2_Ntrk2.Necab2 1070 mixing with Th, satellite glia and endothelial; 4) NP_Nts mixing with Th, satellite glia and 1071 endothelial; and 5) PEP3_Trpm8 mixing with Th, satellite glia and endothelial. Each simulation 1072 was repeated 100 times. Consistently for all simulation schemes, AdRoit’s estimates were 1073 always closer to the true simulated proportions (red lines), whereas Stereoscope’s estimates 1074 largely deviated from the true proportions. 1075 1076 Extended Data Fig. 5: Spatial mapping of 46 cell types with AdRoit quantitative depicts the 1077 content in each spot. Spatial transcriptomics data was downloaded from 10x genomics 1078 (https://support.10xgenomics.com/spatial-gene-1079 expression/datasets/1.1.0/V1_Adult_Mouse_Brain_Coronal_Section). The reference single cells 1080 were sampled from Zeisel et al32 and curated into 46 cell types. 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 51 Figures 1091 Fig. 1 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 52 Fig. 2 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 53 Fig. 3 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 54 Fig. 4 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 55 Fig. 5 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 56 Extended Data Fig. 1 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 57 Extended Data Fig. 2 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 58 Extended Data Fig. 3 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 59 Extended Data Fig. 4 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 60 Extended Data Fig. 5 1212 1213 1214 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 10_1101-2021_01_06_425544 ---- Complex Systems Analysis Informs on the Spread of COVID-19 Complex Systems Analysis Informs on the Spread of COVID-19 Xia Wang1, Dorcas Washington2, Georg F. Weber3* 1 University of Cincinnati Department of Mathematical Sciences, Cincinnati, OH, USA 2 University of Cincinnati Health Science Library, Cincinnati, OH, USA 3 University of Cincinnati Academic Health Center, Cincinnati, OH, USA * send correspondence to: Georg F. Weber, James L. Winkle College of Pharmacy, University of Cincinnati, 231 Albert Sabin Way, OH 45267-0514. E-mail: georg.weber@uc.edu, phone 513- 558-0947. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ Abstract The non-linear progression of new infection numbers in a pandemic poses challenges to the evaluation of its management. The tools of complex systems research may aid in attaining information that would be difficult to extract with other means. To study the COVID-19 pandemic, we utilize the reported new cases per day for the globe, nine countries and six US states through October 2020. Fourier and univariate wavelet analyses inform on periodicity and extent of change. Evaluating time-lagged data sets of various lag lengths, we find that the autocorrelation function, average mutual information and box counting dimension represent good quantitative readouts for the progression of new infections. Bivariate wavelet analysis and return plots give indications of containment versus exacerbation. Homogeneity or heterogeneity in the population response, uptick versus suppression, and worsening or improving trends are discernible, in part by plotting various time lags in three dimensions. The analysis of epidemic or pandemic progression with the techniques available for observed (noisy) complex data can aid decision making in the public health response. Keywords COVID-19, epidemiology, new infections, complex systems, autocorrelation, fractal dimension, average mutual information, wavelet analysis .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ Introduction The spread of infectious diseases depends on pathogen factors (virulence), host factors (immunity), and – on the population level – on countermeasures taken by the community. Such measures cover a broad spectrum of possible engagements, and they may be highly consequential for the course of an epidemic or a pandemic [1]. The analysis of acute infectious progression in a society is critical for gauging the effectiveness of public health responses, but it is made difficult through the non-linear nature of the underlying process. Conventional approaches of reductionist research or common linearization techniques are not meaningfully applicable. Various strategies have been employed to account for the complexity of infectious propagation. The spread of COVID-19 has been modeled with machine learning [2], networks of compartments [3] and cellular automata [4]. Power laws have been inferred [5]. Such investigations are of value, even though they are inevitably based on idealizing assumptions. In addition to modeling approaches, the analysis of actually observed data is of critical importance. The numbers in such data sets are noisy, and they are eminently non-linear (also described as “complex data” or “observed chaotic data” [6]). Complex systems research has made techniques and algorithms available to extract information from observed non-linear data series. The manifestations of the COVID-19 pandemic have varied widely among geographic areas, when compared across countries [3,7,8] as well as across US states [9], depending on when the virus reached them, what the population characteristics were at the time of onset, and what actions were taken in response to the infectious spread. Here, we set out to investigate underlying patterns. We apply basic tools of complex systems research to compare the spread of COVID-19 in distinct countries, characterized by their varying approaches to the pandemic, from its beginning stages through early or late October 2020. Further, we compare various regions within the USA, which has left major decisions to the individual states. Patterns are discernible in Fourier and wavelet analyses. Order can be detected in time-lagged plots. Therefrom, quantitative measurements are obtainable, including autocorrelation, average mutual information, fractal dimension, and embedding dimension, which inform on the pandemic progression. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ Methods Source data: Here we analyze the new infections per day, either as absolute numbers or as rates per 10,000 inhabitants. The source data utilized for the present analysis came from Bing COVID- 19 Tracker (www.bing.com/covid). Fourier spectrum and univariate wavelet analysis: Fourier analysis evaluates the spectral density by relative numbers of new infections (case rates per 10,000 inhabitants) versus frequency or versus period. Wavelet analysis does not assume stationarity in the time-series. Thus, it allows the study of localized periodic behavior. In particular, we look for regions of high-power in the frequency-time plot. The calculations for wavelet analyses of new infections were done in R. In WaveletComp, the null hypothesis, that there is no periodicity in the series, is tested via p-values obtained from simulation, where the model to be simulated can be chosen from a range of options [10]. The algorithm analyzes the frequency structure of uni- or bivariate time series using the Morlet wavelet. The time series to be analyzed was standardized, after detrending, in order to obtain a measure of the wavelet power, which is relative to unit-variance white noise and directly comparable to results of other time series. Detrending is accomplished using polynomial regression. Where indicated, all graphs are normalized to the same y-axis scale. Bivariate wavelet analysis: We conducted bivariate analysis of lagged data (t versus t+7 or t+14 or t+21) for joint periodicity. The concepts of cross-wavelet analysis provide tools for comparing the frequency contents by two time series as well as for drawing conclusions about their synchronicity at certain periods and across certain ranges of time. While cross-wavelet power corresponds to covariance in the time domain, wavelet coherence is a time-series measure similar to correlation. Two waves are coherent if they have a constant relative phase. The bivariate analysis results include the cross-wavelet power plot, the wavelet coherence plot, the average power plot and the phase difference image. The cross-wavelet power and coherence plot contain arrows showing the area of significant joint periods (significance level = 0.05). The direction of these arrows indicating the direction of phase differences. Up-right pointing arrows indicate that the two series are in phase and x(t) series leads, while down-right pointing arrows indicate the two series are in-phase and x(t+n) series leads. Similarly, up-left pointing arrows express that the two series are out of phase and x(t+n) series leads, while down-left pointing arrows express that the two series are out of phase and x(t) series leads. The arrows are only plotted within white contour lines indicating significance at the 10% level. A more explicit global view of the phase difference can be produced with (π/2, π) and (-π, - π /2) for out of phase and (-π /2, π /2) for in- .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ phase. The time-averaged cross-wavelet power provides a summarized view on the shared periods, the corresponding power and the statistical significance. Cross-wavelet plots may mark areas significant due to one series swinging widely, rather than two series sharing a joint period. To avoid this false positive readout, it is more appropriate to examine wavelet coherence plots, like the coefficient of correlation. It has a value range between 0 and 1 and it shows statistical significance only in areas where the two series actually share jointly significant periods. Return plots: From the total numbers of new infections, we generated return plots with increasing lags, plotting daily changes x(t+1), …, x(t+7) versus x(t) and weekly changes x(t+14), …, x(t+49) versus x(t). Short time lags tend to cluster around the 45o angle, whereas increasing time delays reveal the structure of the oscillations. When graphed in 3 dimensions, these diagrams can aid in reconstructing the underlying attractor. Autocorrelation: A time series sometimes repeats patterns or has other properties, whereby earlier values display some relation to later values. The autocorrelation statistic (serial correlation statistic) measures the degree of that affiliation as it refers to linear dependence. The magnitude of its dimensionless number reflects the extent of similarity. The formula for autocorrelation Rm is comprised of terms for autocovariance and variance 𝑎𝑢𝑡𝑜𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 = 𝑎𝑢𝑡𝑜𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑅𝑚 = 1 𝑁 ∑N―m t=1 (𝑥𝑡 ― 𝑥)(𝑥𝑡+𝑚 ― 𝑥) 1 𝑁 ∑N t=1 (𝑥𝑡 ― 𝑥) 2 Autocorrelation coefficients range from -1 to +1, with +1 indicating perfect synchrony and -1 reflecting exact mirror images. An absence of any correlation yields Rm = 0. Box counting dimension: The dimension of a fractal is best described as a non-integer. The dimension is a quantitative measure for the evaluation of geometric complexity by objects. A general relationship assumes 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 ∝ log (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑐𝑟𝑒𝑚𝑒𝑛𝑡𝑠) log ( 1𝑠𝑐𝑎𝑙𝑒 𝑠𝑖𝑧𝑒 ) Here, the characteristic of dimension is that it specifies the rate, at which the number of increments varies with scale size. We calculated the box counting dimension after binning into 16 x 16 squares of 2-dimensional return plots with various lags. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ Average mutual information: The average mutual information (ami) represents a non-linear correlation function, which indicates how much common information is shared by the measurements of x(t) and x(t+n). The average mutual information was calculated with the mutual function R package tseriesChaos. It estimates the mutual information index for a specified number of lags. The joint probability distribution function is estimated with a simple bi-dimensional density histogram. Embedding dimension: Here by R package nonlinearTseries, we first use the timeLag function to decide the optimal time lag 𝜏 based on the average mutual information and then by the estimateEmbeddingDim function to assess the optimal embedding dimension m. Then the optimal set of regressors related to x(t) is x(t- 𝜏), …, x(t-(m-1) 𝜏), x(t- m 𝜏). .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ Results 1. Comparison across Countries Across countries, a wide spectrum of measures was taken to curb the spread of SARS- CoV2. This resulted in a range of very different progression curves when graphing the numbers of new infections over time (Figure 1). India, Brazil, Sweden, Italy and the United States have been considered as hard-hit for their own internal reasons. France, Germany, over a long period Poland, and South Korea had tighter control and a less aggressive spread. All curves display close to linear ramp-up phases, followed by more or less irregular oscillations. The levels of success at suppressing the new infection rates diverged among countries, and several are experiencing a second peak. Wavelet methodology aids in studying periodic phenomena in time series, particularly in the presence of potential frequency changes over time. For cross-country evaluations, all graphs were plotted on the same scale (Figure 2A). Each country was also plotted on its own scale (Figure 2B). The univariate analysis of the time course for the countries under study shows prominence of the recent upswing in France (heat intensity on the right margin of the graph). By contrast, there is comparatively more successful management by Italy, Germany, Poland and South Korea through October 2020. India, Brazil, Sweden, and the United States display cyclical fluctuations of various durations, none of which have been contained. A period of 7 days is prominent in the fluctuations of most countries, which may reflect real cyclicity or weekly reporting habits. The worldwide data are displayed in Figure S1. For cross-country comparisons, we converted the new infection total numbers to new infection rates by relating them to 10,000 members of the population (Figure 3A). Similarly, complex systems can be analyzed with Fourier analysis. We first plotted Fourier power spectra versus frequency for the rates of new infections (Figure 3B). Spectral density range (high in Brazil, low in South Korea) and frequency distribution provide a readout for infectious spread. The spectral density of the normalized rates (identically scaled y-axes) (Figure 3C) confirmed good management of the pandemic spread in Germany, Poland, and South Korea (and to some degree in Italy). Despite the progressive increase in the numbers of infections in India, on a population basis, control has apparently not been lost through October 2020. By contrast, the power spectra for Brazil, Sweden, and France are reflective of potentially adverse developments. The United States display an anomaly with a periodic behavior that has a prominent cycle around 100 days. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ To gain a better understanding of the dynamics, with which disease spread occurs, we investigated progressive numbers of new infections in comparison to their increasing time lags. This approach may reveal periodicities or aid in the visualization of attractors. Expectedly, short time delays were associated with little change. With a lag time of about 7 days onward, distinct patterns emerged among countries. According to bivariate wavelet analysis for time-delayed data series (including the cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image), Italy, Germany and South Korea shared significantly joint periods of 1- 2 months in the comparison x(t) versus x(t+7). South Korea has comparatively high power and significant shared periods around 3 weeks at the early stage and later the significant shared periods are also 1-2 months. The remaining countries all have segments of shorter periods (around 7 days) and longer periods shared. For India, Brazil, France, USA and Poland, the shared 7-day period only appear significant in the later part of the series. Similar results are observed in the analyses for x(t) versus x(t+14) and x(t) versus x(t+21). The phase difference plots show that in the shared longer periods, x(t) are mostly in phase with x(t+7), while they gradually become out of phase in x(t) versus x(t+14) and x(t) versus x(t+21), thus making longer lags more discriminating and informative (Figure 4A and Figure S2A,B). A reduction in cross-wavelet power levels is apparent in Italy, Germany and South Korea. Poland and France are experiencing recent increases. India, Brazil and the USA have had protracted periods of high cross-wavelet power levels. Containment is associated with longer periodicity in the distribution of cross-wavelet power. This is the case for South Korea, Germany and Italy. High cross-wavelet power around a periodicity of 7 days is reflective of poor control. To generate informative return plots, we utilized 3 dimensions, which allows for the visualization of two lags from x(t) (or a from a later start point) and may reveal the pattern of an attractor. In this depiction, a rapid increase or decrease in new infections is reflected in a close- to straight line, oscillations generate a near-toroid attractor, while successful management shrinks the torus and moves it closer to the origin. Initially, we evaluated multiple time delays. Most discriminating were x(t)/x(t+7)/x(t+14), x(t+3)/x(t+7)/x(t+14), and x(t+5)/x(t+14)/x(t+28) (Figure 4B). The progressive increase in new cases over the time period in India is reflected in a predominantly linear curve on each scale. The wide fluctuations in Brazil generate a largely disordered appearance. Disorder is also apparent in Sweden. France initially managed the pandemic well, but is experiencing a dramatic upswing, which obscures order. Cyclical patterns, suggesting the outlines of attractors, are apparent in USA, Italy, Germany, and South Korea (where most data points are concentrated near the origin). Poland initially displayed a well- .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ contained attractor, but the recent substantial upswing in new infections is reflected in a linear progression from there (for separate analyses of the two phases, see Figure S3). We also calculated the embedding dimensions for the lagged data (Figure 4C). Germany has the highest embedding dimension of 10, followed by Poland with 9. Several countries have an embedding dimension of 7, including Brazil, Sweden, USA and South Korea. Italy and France have the embedding dimension equal to 5. India is unusual due to its longer lag period of 24 days. When the lag period is set at 7 days, the embedding dimension of India is also equal to 7. For the worldwide data, the calculated embedding dimension is 7 with a time lag of 1 (not shown). The autocorrelation of two data strings with short time lags is expected to be high (approaching 1.0) because there is little opportunity for dramatic change (high infection rates on day t likely produce similarly high numbers on the consecutive day t+1, while low numbers are followed by few new infections on the next day). Autocorrelation may remain high for extended lags in the initial ramp-up and at the oscillatory stage, depending on the regularity of the fluctuations. A society that succeeds in curbing the disease spread will leave the highly correlated initial ramp-up and consecutive oscillatory phases, thus displaying a gradual decrease in values at the longer lags. The decline in the autocorrelation numbers of progressively lagged data by country appeared to be reflective of the stringency, with which the pandemic was addressed (Figure 5A). From a lag of 6 onward, Poland and South Korea have substantially declining values (although due to the recent steep upswing in new infections, Poland deviates from the trend at very long lags), Germany shows a dramatic lowering at a lag of 21 and above. By contrast, India and Brazil stay uniformly high. So do the global numbers, which are inherently heterogeneous. The average mutual information reflects information shared by the measurements of x(t) and x(t+n). Expectedly, it declines with increasing lag. Poland starts with a relatively low value (1.15 at t versus t+1) and shows a rapid decrease with longer lag. It then stays around at a low level of 0.15 from lags of 21 to 49 days. France displays a gradually decreasing trend with the average mutual information starting at 1.60 and ending at 0.34 at the lag of 49 days. India shows a similar pattern as France but with much higher average mutual information (due to the constant uptick in numbers), ranging between 2.61 and 1.37. Four other countries, including Germany, USA, Sweden and Brazil, all express relatively flat average mutual information values, staying around levels of 2.20 for the USA and Brazil, 1.5 for Germany, and 1.3 for Sweden. Reflecting progressively improved control, Italy and South Korea also have decreasing trends, but much flatter at 1.96-1.36 for Italy and 1.26 to 0.66 for South Korea, respectively (Figure 5B). .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ A rapid increase in new infections is reflected in a small fractal dimension (practically approximated by the box counting dimension with values between 1 and 2) of the 2-dimensional return plots with progressive lags. Intermediate phases are characterized by higher fractal dimensions (approaching 2), depending on the nature of the oscillations. Conversely, successful management through the reduction in new infections should be reflected in a contraction of the attractor on the return plot, which is assessable through the box counting dimension. A trend is displayed in the comparisons from shorter to longer lag periods. Distinct management strategies across different countries generate a heterogeneous pattern worldwide, rendering the fractal dimension high regardless of the lag in x(t+n) versus x(t) plots. Steep increases in new infections (Poland, India) have dimensions close to 1. Intermediate phases are characterized by higher numbers. Successful fights against the pandemic (South Korea) are causative for declining size dimensions with increasing lag (Figure 5C). 2. Comparison across US States Within the USA, individual states have encountered a rather wide range of progression phenotypes in the spread of new COVID-19 infections (Figure 6). This is due to variations in international connectedness and population density (reflected in the early peaks in the Northeastern states New York and Massachusetts), holiday travel (Florida), policy decisions and other factors. Wavelet analysis of new infections (one scale across all states) shows good control (right side of the graph) after initial affliction (left area) for Massachusetts and New York, which having had early spikes in new infections have achieved good success in containment. Through the observation period, control has not been maintained in Ohio. The periodicity in individual states (each on their own scales) is poorly defined, except for Florida and Ohio, where 7 days yield a prominent signal (Figure 7A,B). We normalized the new infection numbers to rates by relating them per 10,000 inhabitants (Figure 8A). Figure 8B shows the periodogram for the 6 states under investigation with frequencies between 0 and 0.10 (the graph is almost flat for the higher frequencies). There exist clear heterogeneous patterns in the comparison among these states. New York and Massachusetts display steadily decreasing spectral density values from the longest period to around 1-2 weeks (corresponding to a frequency range around 0.07-0.14). Florida and Texas .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ share similar patterns with a few low spikes in their periodograms after the first 3 highest ones. The graph for California flattens out after the lowest three frequencies, with the longest period (the whole series) having the highest value. Ohio’s pattern is quite unique with fluctuating values from the longest periods through around 5-6 weeks. The Fourier power spectrum for the infection rates (Figure 8C) indicates similar periodic patterns as in the periodograms of Figure 8B. These patterns are less prominent due to the adjustment to the same y-axis scale (the scale reflects the magnitude of the positive rates, the shape shows the evolution of the disease). We conducted bivariate wavelet analysis on the time-lagged data (Figure 9A and Figure S4). The shared synchronicity segments between x(t) and x(t+n) can be grouped into shorter periods (around 7 days) and longer periods (approximately 3 weeks, 1 month, 2 months). New York does not display substantial joint short periods. Ohio and Texas mainly have correlation at the end of the series around the 7-day period. Massachusetts experiences joint periodicity around the 7-day period at the early stage of the series. Florida and California have joint periods in the middle of the observation time frame. The levels of average cross-wavelet power are higher in states with poor control (x-axes scales for Florida, Ohio). The peak power shifts toward higher periodicity with improved control (y-axes scales for New York, Massachusetts). The return plots in 3 dimensions, utilizing the same time lags as for the countries, seemed to reflect contraction of the attractor in Massachusetts, cyclicity in New York, Florida and California, no containment in Texas, and an ejecting diagonal in Ohio which may reflect exacerbation (Figure 9B). The embedding dimensions varies among states, such that the most contained states (New York, Massachusetts) have the lowest embedding dimension (Table 1). The autocorrelation for return plots of increasing lags show a progressive decline in the numbers of New York and Massachusetts, which implemented strong containment measures after having been afflicted early. The values decline less steeply for Texas and California. Ohio displays an anomaly with increasing values for very long lags. The state, while not heavily afflicted on a per capita basis, never achieved containment, only a stationary level, and has since experienced another wave (Figure 10A). Up to a maximum lag of 49 days, the average mutual information for the 6 US states under study ranges between 1.0 and 2.0. Overall, all states show a slightly decreasing pattern except for California, which is relatively leveled at a value of 2.0 (Figure 10B). Unexpectedly, the box counting dimension (Figure 10C) is less discerning than it was for the evaluation across countries. This may be due to the much lower power conveyed by smaller population sizes. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ Discussion In the present investigation we find that the analysis tools for observed complex data can aid in the interpretation of pandemic spread across communities. Difficulties in analyzing the non- linear patters of infectious disease spread may be tamed by applying the tools of complex systems research. The approach can reveal patterns, where a simple time course of new cases does not. Further, non-linear analysis allows the study into various facets of the process, depending on whether the starting data are new cases, hospitalizations, deaths or other readouts. Maps can be generated and evaluated for their fractal dimensions [11]. The operational approximation of Lyapunov exponents may be meaningful, although they were largely uninformative for the present study (Supplemental Figure S5). Among the countries analyzed, South Korea has had the most successful control of the pandemic spread according to low intensity in univariate wavelet analysis, low spectral density range in Fourier analysis, low spectral density of the normalized rates, a reduction in cross- wavelet power levels according to bivariate wavelet analysis and longer periodicity in the distribution of cross-wavelet power. Further, declining box counting dimensions, autocorrelation values with increasing time lag, and decreasing trends (at a low slope) in average mutual information confirm containment. Cyclical patterns in return plots, suggesting the outlines of attractors, are apparent and most data points are concentrated near the origin of the graph. Germany exhibited good management through October 2020 according to univariate wavelet analysis, spectral density in the power spectrum of the normalized rates, a reduction of cross- wavelet power levels in bivariate wavelet analysis, longer periodicity in the distribution of cross- wavelet power, a dramatic lowering of autocorrelation values at a lag of 21 and above, and relatively flat average mutual information values, staying around levels of 1.5. Cyclical patterns in return plots suggest the outlines of an attractor. Good control by Italy consecutive to the early impact and through October 2020 is reflected in low intensity and fluctuation when applying univariate wavelet analysis, in a reduction of cross-wavelet power levels for bivariate wavelet analysis of time-delayed data, longer periodicity in the distribution of cross-wavelet power, and decreasing trends (at a low slope) in average mutual information. Cyclical patterns in return plots, suggesting the outlines of an attractor, are apparent. Poland had two distinct phases. By univariate wavelet analysis and density in the power spectrum of normalized rates, there was indication of good management through October 2020. According to bivariate wavelet analysis for time-delayed data series and return plots, the recent substantial upswing in new infections is reflected, which also results in box counting dimensions close to 1. From a lag of 6 onward, Poland .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ has substantially declining autocorrelation values, although due to the recent steep upswing in new infections, the trend reverses at very long lags. The average mutual information starts with a relatively low value (1.15 at t versus t+1) and shows a rapid decrease with longer lag, staying level from lags of 21 to 49 days. In the United States, univariate wavelet analysis displays cyclical fluctuations of various durations, none of which have been contained. According to bivariate wavelet analysis for time-delayed data series, there have been protracted periods of high cross- wavelet power levels. Cyclical patterns in return plots, suggesting the outlines of attractors, are apparent. The USA expresses relatively flat average mutual information values, staying around levels of 2.20. In France, univariate wavelet analysis of the time course shows prominence of the recent upswing (heat intensity on the right margin of the graph), the power spectrum is reflective of potentially adverse developments. The second wave of infections is apparent in bivariate wavelet analysis and in the obscured order in return plots. France displays a gradually decreasing trend of average mutual information. India expresses cyclical fluctuations of various durations in univariate wavelet analysis, none of which have been contained. On a population bases, the spectral density suggests that control has not been lost through October 2020. Bivariate wavelet analysis shows protracted periods of high cross-wavelet power levels, return plots reflect the progressive increase in new cases over the time period in a predominantly linear curve on each scale, box counting dimensions are close to 1, and autocorrelation values stay uniformly high with increasing time lag. India displays a gradually decreasing trend of average mutual information. Brazil experiences cyclical fluctuations of various durations in univariate wavelet analysis, none of which have been contained. By Fourier analysis, the spectral density range is high. The power spectrum is indicative of potentially adverse developments. According to bivariate wavelet analysis, there have been protracted periods of high cross-wavelet power levels. In return plots, the wide fluctuations generate a largely disordered appearance. The autocorrelation values stay uniformly high. Brazil expresses relatively flat average mutual information values, staying around levels of 2.20. Sweden shows cyclical fluctuations of various durations in univariate wavelet analysis, none of which have been contained. The power spectrum is reflective of potentially adverse developments. In return plots, disorder is apparent. Sweden expresses relatively flat average mutual information values. Prima facie, the curves of new infections versus time for three Western European countries, France, Italy, and Germany, appear similar. Complex systems analysis reveals the upswing in France to be much more perilous than the increases in the curves of new infections by the other two countries. The management of infectious spread also requires improvements in .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ the United States, Sweden and Brazil. The selection of the observation period can dramatically influence the results. Poland was initially very successful in containing the pandemic, but then experienced a substantial upswing. Analyzing these two phases individually or in conjunction yields very different data sets, which inform about distinct aspects of the infectious progression. The fluctuations of new infections in an epidemic or a pandemic pose challenges to the evaluation whether a decline reflects true containment (“rounding the corner”) or just the calm before another wave. The readouts of non-linear systems analysis can aid in making such a distinction. A complex occurrence that experiences containment will strive toward a point attractor in phase space and move toward the origin. Such a progression is represented in a declining fractal dimension, and the transition from fluctuations (often associated with a torus attractor) toward limitation of new cases is expected to reduce the autocorrelation. One constraint of complex systems analysis is the need for large data sets. In this regard, the availability of about 230 data points (daily new cases March through October 2020) for each geographic area in this study is somewhat low. The robustness of pertinent studies increases with larger data sets over time. Reporting errors could have a non-trivial impact, and may be reflected in the frequent occurrence of a peak at 7 days in the spectral analysis (possibly indicating weekly totals). This problem can be addressed by utilizing moving averages. The homogeneity or heterogeneity in management by the community under study determines the noise level. The worldwide numbers of new infections have a lot of background due to varying patterns across countries. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ Acknowledgements GFW is supported by NIH grant CA224104. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ References [1] Christakis NA. Apollo’s Arrow: The Profound and Enduring Impact of Coronavirus on the Way We Live. New York (Hachette Book Group) 2020. [2] Mehta M, Julaiti J, Griffin P, Kumara S. Early Stage Machine Learning–Based Prediction of US County Vulnerability to the COVID-19 Pandemic: A Machine Learning Approach. JMIR Public Health and Surveillance 2020;6: e19446. [3] Wang K, Ding L, Yan Y, Dai C, Qu M, Jiayi D, Hao X. Modelling the initial epidemic trends of COVID-19 in Italy, Spain, Germany, and France. PLoS ONE 2020;15:e0241743. [4] Bin S, Sun G, Chen C-C. Spread of Infectious Disease Modeling and Analysis of Different Factors on Spread of Infectious Disease Based on Cellular Automata. Int J Environ Res Public Health 2019;16:4683. [5] Blasius, B. Power-law distribution in the number of confirmed COVID-19 cases. Chaos 2020;30:093123. [6] Abarbanel HDI. Analysis of Observed Chaotic Data. Switzerland (Springer Nature) 1995. [7] Chakraborty I, Maity P. COVID-19 outbreak: Migration, effects on society, global environment and prevention. Science of the Total Environment 2020;728:138882. [8] Bertacchini F, Bilotta E, Pantano PS. On the temporal spreading of the SARSCoV-2. PLoS ONE 2020;15:e0240777. [9] White ER, Hébert-Dufresne L. State-level variation of initial COVID-19 dynamics in the United States. PLoS ONE 2020;15:e0240648. [10] Roesch A, Schmidbauer H. WaveletComp: Computational Wavelet Analysis. R package version 1.1. 2018. https://CRAN.R-project.org/package=WaveletComp [11] Păcurar C-M, Necula B-R. An analysis of COVID-19 spread based on fractal interpolation and fractal dimension. Chaos, Solitons & Fractals 2020;139,110073. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ Tables and Figures Figure 1: Time-Course of Disease Spread by Country. Numbers of new cases, x(t), per day versus time (t, indicating the date). Shown are the curves for (top to bottom, left to right) the globe, India, Brazil, Sweden, Italy, USA, France, Germany, Poland, and South Korea. Note the different scales of the y-axes. Figure 2: Univariate Wavelet Analysis. Cross-Wavelet Power Spectrum in the Time-Period Domain. The x-axis (index) displays the time progression, whereas the y-axis depicts the length of the period. White contour lines indicate significance of periodicity on the 0.1 level for probability of error. Lines represent the ridge of cross-wavelet power. The color bar reveals the power gradient. A) All countries on the same scale. B) Each country on its own scale. Figure 3: Fourier Analysis. A) New Infection Rates. Daily reported new numbers of infections divided by 10,000 inhabitants. The x-axis shows the calendar date. B) Power Spectrum. Fourier power spectra versus frequency for. new infections per 10,000 inhabitants per day in each of 9 countries. C) Normalized Power Spectrum. Spectral density (y-axis) versus period (in days) for infection rates per 10,000 inhabitants (x-axis). The curve shows the smoothed spectral density estimates. All y-axes have the same scale. Figure 4: Time-Lagged Data Analysis. A) Bivariate Wavelet Analysis. Shown are cross- wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right in each row) Time-lagged data were used for x(t)/x(t+14) (for the lags x(t)/x(t+7) and x(t)/x(t21) see Figure S2). White contour lines indicate significance for joint periodicity, black arrows depict the phase difference in the areas with significant joint periods. The solid red dots on the average power plot (the third from the left) depict significant joint periods at a probability of error of 0.1. Where shown, the color bars reveal the ranges of cross-wavelet power levels. B) Return Plots in 3 Dimensions. Time-lagged return plots in 3 dimensions are shown, from left to right, for x(t)/x(t+7)/x(t+14), x(t+3)/x(t+7)/x(t+14), and x(t+5)/x(t+14)/x(t+28). Each country of interest has its own row. C) Embedding Dimension. The plots show how Cao’s algorithm uses 2 functions in order to estimate the embedding dimension from the time series (the E1(d) and E2(d) functions), where d denotes the dimension. Figure 5: Readouts of Complexity for Lagged Data on COVID-19 Spread by Country. A) Autocorrelation. Bar graph of the autocorrelation in COVID-19 spread with each bar color .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ representing a different country. The selected time lags are indicated on the x-axis, all are calculated versus x(t). B) Average Mutual Information. Bar graph of average mutual information in COVID-19 spread with each bar color representing a different country. The selected time lags as indicated on the x-axis are all calculated versus x(t). C) Fractal Dimensions. Box counting dimensions are calculated for 2-dimensional return plots of increasing lags, x(t+7) versus x(t) through x(t+35) versus x(t). 9 countries are evaluated, and the worldwide numbers are shown on the left. Poland is represented twice, over the entire evaluation period through 12 October 2020 (which contains a steep incline) and over the shorter phase of containment through 18 September 2020 (cont. = contained period). Figure 6: Time-Course of Disease Spread for Individual US States. Numbers of new cases, x(t), per day versus time (t, indicating the date). Shown are the curves for (top to bottom, left to right) Massachusetts, New York, Florida, Texas, California, and Ohio. Figure 7: Univariate Wavelet Analysis. Wavelet power spectrum in the time-period domain. Contour lines indicate significance of periodicity with 0.1 significance level. Black lines indicate the ridge of wavelet power. The color bar reveals the power gradient. A) All states on the same scale. B) Each state on its own scale. Figure 8: Fourier Analysis. A) New infection Rates. Daily reported new numbers of infections divided by 10,000 inhabitants (infection rates). The x-axis shows the calendar date. B) Power Spectrum. Periodogram plot on the series of the new infection rates. The x-axis is the frequency (per day) and the y-axis represents the spectral density. The y-axis ranges vary among graphs. C) Normalized Power Spectrum. Spectral density versus period (in days) for infection rates. All y-axes have the same scale. Figure 9: Time-Lagged Data Analysis by US State. A) Bivariate Wavelet Analysis. Shown are cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right on each row) Time-lagged data were used for x(t)/x(t+14) (for the lags x(t)/x(t+7) and x(t)/x(t21) see Figure S4). White the contour lines indicate significance of joint periodicity, black arrows indicate the phase difference in the areas with significant joint periods. The solid red dots on the average power plot (the third from the left) reflect significant joint periods at a significance level of 0.1. B) Return Plots in 3 Dimensions. Time-lagged return plots in 3 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ dimensions are shown, from left to right, for x(t)/x(t+7)/x(t+14), x(t+3)/x(t+7)/x(t+14), and x(t+5)/x(t+14)/x(t+28). Each state under investigation has its own row. Figure 10: Readouts of Complexity for Time-Lagged Data by U.S. State. 6 US states have been evaluated. A) Autocorrelation. Bar graph of the autocorrelation in COVID-19 spread with each bar color representing a different US state. The selected time lags are indicated on the x- axis, all are calculated versus x(t). B) Average Mutual Information. Bar graph of average mutual information in COVID-19 spread with each bar color representing a different state. The selected time lags are indicated on the x-axis, all are calculated versus x(t). C) Fractal Dimensions. Box counting dimensions are calculated for 2-dimensional return plots of increasing lags, x(t+7) versus x(t) through x(t+35) versus x(t). Table 1: Embedding Dimension for Time-Lagged Data by U.S. State. Embedding dimensions were calculated according to Cao’s algorithm, which uses 2 functions in order to estimate the embedding dimension from the time series. The table shows the calculated time lags and embedding dimensions for each U.S. state under study. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ Supplement Figure S1: Power Spectrum and Univariate Wavelet Analysis for Worldwide New Cases. A) Wavelet analysis and model fit (minimum power level: 0, significance level: 0.05, only coi: false, only ridge: false). B) Fourier analysis. Figure S2: Bivariate Wavelet Analysis by Country. The graphs represent cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right on each row) Time-lagged data were used for x(t)/x(t+7) (A) and x(t)/x(t21) (B). White contour lines depict joint significance of periodicity. Black arrows reflect the phase difference in the areas with significantly joint periods. The solid red dots on the average power plot (the third from the left) indicate significantly joint periods at a probability of error 0.1. The color bars reveal the cross- wavelet power levels. Figure S3: Return Plots in 3 Dimensions for Poland. New infections per day. Top) Entire Observation Period. 10th March 2020 through 7th November 2020. Middle) Contained Phase. Partial time frame through 18th September 2020. Bottom) Exacerbating Phase. Partial time frame from 1st September 2020. Figure S4: Bivariate wavelet analysis by US state. The graphs display cross-wavelet power plot, wavelet coherence plot, average power plot and phase difference image (from left to right on each row) Time-lagged data were used for x(t)/x(t+7) (A) and x(t)/x(t21) (B). White contour lines indicate significance of joint periodicity. Black arrows indicate the phase difference in the areas with significantly joint periods. The solid red dots on the average power plot (the third from the left) indicate significance at a level of 0.1. Figure S5: Evolution of Lyapunov exponents over time. For a discrete mapping x(t+1) = F(x(t)) we calculate the local expansion of the flow by considering the difference of 2 trajectories. The Lyapunov characteristic exponent can be approximated as 𝜆 ≈ ln (|𝑥𝑛+1 ― 𝑦𝑛+1|/|𝑥𝑛 ― 𝑦𝑛|) for 2 points xn,yn close to each other on the trajectory [https://www.math.tamu.edu/~mpilant/math614/Matlab/Lyapunov/LorenzSpectrum.pdf]. The changes of Lyapunov exponents are presented for the return plots of lags x(t+6) versus x(t), x(t+14) versus x(t), x(t+21) versus x(t), and x(t+35) versus x(t). A) Countries. Shown are ranges over 250 days. B) US States. Shown are ranges over 200 days. Mass. = Massachusetts. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425544doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425544 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_01_06_425546 ---- bioRxiv.org - the preprint server for Biology Skip to main content Home About Submit ALERTS / RSS Search for this keyword Advanced Search Subject Areas All Articles Animal Behavior and Cognition Biochemistry Bioengineering Bioinformatics Biophysics Cancer Biology Cell Biology Clinical Trials Developmental Biology Ecology Epidemiology Evolutionary Biology Genetics Genomics Immunology Microbiology Molecular Biology Neuroscience Paleontology Pathology Pharmacology and Toxicology Physiology Plant Biology Scientific Communication and Education Synthetic Biology Systems Biology Zoology View by Month 10_1101-2021_01_06_425550 ---- Improving variant calling using population data and deep learning Improving variant calling using population data and deep learning Nae-Chyun Chen1, ‡,∗, Alexey Kolesnikov2, Sidharth Goel2, Taedong Yun2, Pi-Chuan Chang2, †, and Andrew Carroll2, †,∗ 1Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA 2Google Health, Palo Alto, CA 94304 and Cambridge, MA 02142, USA corresponding author: cnaechy1@jhu.edu; awcarroll@google.com †These authors contributed equally to this work. ‡Work performed while an intern at Google Health. January 6, 2021 Abstract Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filter- ing which trades recall for precision. In this study, we modify DeepVariant to add a new channel encoding population allele frequencies from the 1000 Genomes Project. We show that this model reduces variant calling errors, improving both precision and recall. We assess the impact of using population-specific or diverse reference panels. We achieve the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with differ- ent ancestry from the training data even when the ancestry is also excluded from the reference panel. 1 Background Variant calling [1–3] identifies the positions in an individual genome which differ from a reference or population, and is used to characterize a single sample or build large research cohorts [4, 5]. Variant calling is non-trivial, because of sequencing errors, systematic errors in mapping to repetitive and variable regions [6], and imbalanced sampling of alleles needed to identify a heterozygous variant from a homozygous one. 1 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 Variant calling can be improved by jointly genotyping multiple samples together [7– 9], but the raw sequence data for a cohort is not always available, and this process is computationally expensive. Instead, large-scale reference panels from a wide range of populations can provide similar information [4, 5]. Recent studies use such information to improve alignment accuracy and reduce biases in alignment [10–12], but there has been little work to incorporate population data with variant calling. Because far more variants are transmitted than arise de novo, real variants in a pop- ulation tend to recur at various frequencies [13], while false positives are often either not seen elsewhere in a population, or are seen with a consistent signature [14]. Researchers use this knowledge to filter variant calls, often with rules which lose recall for a gain in precision [15]. More sophisticated machine-learning methods to filter are used in larger cohorts, such as gnomAD, but these also trade recall for precision and also only operate on variant calls and summary information [4]. We reason that including population-level information at an earlier stage in variant calling, when the full read-level data is available, might allow for more effective use of population data. To do this, we adapted DeepVariant [2], which represents BAM infor- mation as a multi-dimensional pileup and uses a Convolutional Neural Network (CNN) to call variants. Because DeepVariant learns the features important for variant classifica- tion directly from the data, it allows us to feed in the population allele information as an additional channel. We trained population-aware models and compared them with the default DeepVari- ant v1.1 models which are agnostic of population information. The population-aware approach reduces the number of errors for all tested datasets, including WGS and WES reads, when using the allele frequencies from 1000Genomes. It also shows stronger error reduction efficacy for lower-coverage read sets. While traditional filtering approaches will increase precision at the expense of recall, we observe improvements to both precision and recall with this method. When incorporating population data, it is also important for fairness and equity to understand how it changes the accuracy of methods for individuals with ancestries out- side of those used in the development of the population resources. It is known that many genomic databases have collected more data for the European population than others [16–18]. We demonstrate that even using frequencies from a genetically distinct popula- tion, the population-aware model still performs similarly as the baseline. We find that a reference panel consisting of all ancestries in the 1000 Genomes Project (1000Genomes) outperforms a reference panel with only one of the 1000Genomes population groups, even when that population matches the sample being called. This implies that maximizing the diversity of ancestries in population resources has the potential to improve variant calling for all populations. The Genome in a Bottle (GIAB) truth sets used to train DeepVariant are from Eu- ropean, Ashkenazi, and Asian ancestry. To assess whether the addition of the refer- ence panel information improves variant calling for populations outside of the popula- 2 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 tions represented in training, we use high quality PacBio HiFi [19] data from the Human Genome Structural Variation Consortium for an individual of Puerto Rican ancestry as an evaluation set. We show that an Illumina model using the reference panel has superior concordance with the highly accurate PacBio HiFi variant calls compared to an Illumina model without the reference panel. 2 Results 2.1 Population information improves DeepVariant performance DeepVariant converts input from a BAM file into a pileup image with 6 channels, repre- senting 1) bases, 2) base qualities, 3) mapping quality, 4) strand, 5) supports variant, and 6) base differs from reference. We modified DeepVariant v1.1 to take an additional input channel, the allele-frequency (AF) of the variant [20]. We trained DeepVariant models with and without the AF channel with the testing samples held out. We first compared the whole-genome sequencing (WGS) variant calling accuracy for sample HG003, sequenced with 35x coverage from the PrecisionFDA v2 Truth Challenge [21], using the latest GIAB v4.2.1 truth set [22] (Figure 1). HG003 is not used in the training of these DeepVariant models, and so acts as an independent holdout to evaluate their quality. The population-aware model has superior accuracy than default DeepVariant v1.1 in both precision and recall for both types of variants. It has an overall error reduction of 1514 (4.8%). For SNPs, the error rate (defined as 1-F1 score) decreases from 0.0041 to 0.0038; for indels, the error rate decreases from 0.0044 to 0.0043. Notably, the population- aware model improves SNP false discovery rate (FDR, defined as 1-precision) from 0.0019 to 0.0015, equivalent to an error reduction of 1,096 (17.7%) variants. We then down-sampled the HG003 reads from 35x to 21x to evaluate the performance of the models with lower-coverage datasets. The population-aware method demonstrates a larger improvement in accuracy over default DeepVariant v1.1 by reducing 5,119 (9.5%) overall errors. The error rate decreases from 0.0062 to 0.0056 for SNPS, and 0.0124 to 0.0113 for indels. Similar to using the 35x read set, the population-aware model shows the strongest improvement to reduce false-positive SNPs, reducing FDR from 0.0040 to 0.0031, equivalent to 3,015 (22.5%) errors. We further evaluated the performance of the models using two whole-exome sequenc- ing (WES) datasets from a recently released set of genome and exome data [23] (Figure 2). For both WES datasets, the population-aware model outperforms DeepVariant v1.1 in overall number of errors. It has an overall error reduction of 53 (9.9%) for the IDT dataset, and 13 (6.5%) for the Oslo dataset. It has a slightly higher rate for SNPs for the Oslo dataset, from 0.00087 to 0.00092, but the difference is smaller than the gain for indels for that dataset. The population-aware model tends to have a larger lead on precision for both types of variants compared to the baseline, but still has similar or better recall. 3 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 0.000 0.005 0.010 0.015 v1.1-35x AF-35x v1.1-21x AF-21x v1.1-35x AF-35x v1.1-21x AF-21x v1.1-35x AF-35x v1.1-21x AF-21x 1-F1 1-Precision (FDR) 1-Recall (FNR) INDEL 0.000 0.002 0.004 0.006 0.008 v1.1-35x AF-35x v1.1-21x AF-21x v1.1-35x AF-35x v1.1-21x AF-21x v1.1-35x AF-35x v1.1-21x AF-21x 1-F1 1-Precision (FDR) 1-Recall (FNR) SNP Figure 1: WGS variant calling error rates for HG003. All results are evaluated using the GIAB v4.2.1 truth set in the high-confidence regions. v1.1: DeepVariant v1.1; AF: the population-aware model that uses the allele-frequency channel. The column label suffixes show the average coverage of the read sets. Lower values correspond to better accuracy. 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo 1-F1 1-Precision (FDR) 1-Recall (FNR) INDEL 0.000 0.002 0.004 0.006 0.008 0.010 0.012 v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo v1.1-IDT AF-IDT v1.1-Oslo AF-Oslo 1-F1 1-Precision (FDR) 1-Recall (FNR) SNP Figure 2: WES variant calling error rate for HG003. The IDT results (“*-IDT”) are GRCh38-based and evaluated using the GIAB v4.2.1 truth set; the Oslo datasets (“*-Oslo”) are GRCh37-based and evaluated using the GIAB v3.3.2 truth set. v1.1: DeepVariant v1.1; AF: the population-aware model that uses the allele-frequency channel. Lower values correspond to better accuracy. 4 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 2.2 Model-specific errors for population-aware models Intuitively, population information helps DeepVariant decide whether to make a call based on the commonness of a variant, especially for cases where the variant calling confidence levels are low. With a population-aware model, a variant caller should be more likely to make a positive variant call for a candidate with high allele frequency, and is less likely to make a call when seeing a rare candidate variant. To understand the influence of allele frequencies in the model, we design an analy- sis framework to compare a population-agnostic model with a population-aware model. We call this a model-specific error analysis. We stratify the errors into three groups: population-resolved, population-induced and common. The population-resolved vari- ants are called correctly with the allele frequency model, but called incorrectly when us- ing the baseline model. We say such errors are “rescued” by population information. The population-induced errors are specific to the population-aware model, i.e. they are in- duced by the extra features. The common group contains errors called by both models. The common errors are viewed as ones more difficult to solve without major changes in the data processing pipeline, such as variant caller, upstream computational methods, or sequencing technology. Thus, in this analysis we focus on the first two groups. For sim- plicity, we only considered bi-allelic calls in this analysis, which are the majority of overall errors. We used the 35x HG003 WGS dataset to perform the model-specific error analysis. Af- ter extracting model-specific erroneous calls, we matched the calls with the 1000Genomes variants to obtain associated allele frequencies. We first examined the relationship be- tween allele frequency (AF) and variant allele fraction (VAF), which is the fraction of reads supporting an alternate allele in a given sample, of each false-positive call. There is an ob- servable distinction between the population-induced group and the population-resolved group in the VAF-AF plots (Figure 3, left and middle panels). Among the population- resolved false-positive errors, more than two third (71.0%) are uncommon (allele fre- quency ≤ 5%) among the 1000Genomes samples, whereas there are only 11.4% uncom- mon variants for population-induced false positives. This observation supports the hy- pothesis that the population-aware model uses allele frequency to adjust its variant calls. We then investigated bi-allelic false-negative errors, as shown in the right panel in Fig- ure 3. Variant allele fraction for false negatives are not always available because many false negatives are not identified as a variant candidate due to reasons including low read coverage, incorrect mapping or insufficient sensitivity in variant candidate discovery. Thus, we only evaluated the allele frequency distribution for false negatives. We noticed a significant difference in the number of common variants (with greater than 5% allele frequency). Among all population-resolved false negatives, 94.6% (1,683 out of 1,780) are common variants. For population-induced false negatives, 59.2% (607 out of 883) are un- common. The model-specific analysis highlights the difference of the DeepVariant models with or without the AF channel. With the additional population information, DeepVari- 5 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 Figure 3: Errors specific to a population-agnostic model (in blue) and a population-aware model (in red) using 35x HG003 WGS data. ant is capable of adjusting the calls according to the commonness of a variant and shows improvements in both precision and recall. 2.3 Performance on zero-frequency variants A potential concern for population-aware variant calling models is increasing false neg- ative rate for novel alleles. Since it is not trivial to define a set of truly novel variants in the 1000 Genomes Project, we extracted variants with zero allele frequency to investigate the impact when population information is included in a variant calling model. Using the GIAB v4.2.1 truth set, there are 32,256 (1.0%) SNPs and 3,193 (0.6%) indels that have zero allele frequency for sample HG003. We then use the zero-frequency variant set to evaluate recall of actual variant calls using hap.py [3]. We observed that the recall on zero-frequency variants underperforms the rest using all DeepVariant models, regardless of variant types and whether to utilize population information. With 35x reads, the false-negative rate (FNR, or 1-recall) of the population- agnostic model is 0.1855 for SNPs and 0.2474 for indels (Figure 4). The FNRs further in- crease to 0.1945 for SNPs and 0.2643 for indels when using the population-aware model. When using 21x reads, the drop in accuracy gets larger for both types of variants. This is consistent with our analysis that the population-aware DeepVariant model requires stronger evidence (higher-quality pileup images) to call zero-frequency variants, thus re- ducing recall. Further, the population information has a stronger influence in variant call- ing for low-coverage datasets. Despite the disadvantages, the negative impact on zero- frequency variants is small compared to overall error reduction. To better understand the zero-frequency variants, we called variants using the Deep- Variant PacBio model with the PrecisionFDA v2 35x HG003 reads set sequenced with the PacBio HiFi technology [21]. The FNRs for the zero-frequency variants improve to 0.0481 6 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 0.00 0.05 0.10 0.15 0.20 0.25 0.30 1 -R e c a ll ( F N R ) v1.1 AF v1.1-21x AF-21x v1.1 AF v1.1-21x AF-21x INDEL SNP Figure 4: The false negative rate (FNR) of zero-frequency variants for HG003 with differ- ent models. Lower values correspond to better accuracy. for SNPs and 0.0868 for indels. The large difference in recall/FNR indicates that many of the zero-frequency variants are hard to genotype using Illumina reads, and may not be novel mutations relative to samples in reference panels. In the future, reference panels utilizing high-quality long reads will likely provide better allele frequency estimates and improve the population-aware model performance. 2.4 Assessing biases using different 1000Genomes populations It is important to understand if the inclusion of population information reduces Deep- Variant’s performance for populations that are not well represented, especially when they have a large genomic difference with the reference panel. We first note that Ashke- nazi Jewish, the ethnicity of the HG003, is not among the 26 ethnicities collected by 1000Genomes. Using a testing sample not in the reference panel reduces the risk of bias. Second, we ran inference on the population-aware model using reference panels of alleles frequencies. We split the 1000Genomes sample into five groups based on the superpopu- lation labels (African, AFR; Admixed American, AMR; East Asian, EAS; European, EUR; South Asian, SAS) and calculated allele frequencies for each super-population. We show that all population-aware approaches outperform for SNPs but underperform for indels when evaluated using HG003 (Figure 5). When considering the overall number of errors, only the model inferred with EAS frequencies calls more errors than the baseline, but the deficit (494, or 1.6%) is small. We also compared the performance of using different superpopulation frequencies and observed a correlation between variant calling accuracy and the distance between the tested sample and ethnicity groups. According to the principal component (PC) analysis performed by gnomAD v3 [4], Ashkenazi Jewish is closer to the European populations and is farther from East Asian and African in the PC1-PC2 space. We observed that using frequencies from a genetically closer population usually resulted in higher variant calling 7 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS 1-F1 1-Precision (FDR) 1-Recall (FNR) INDEL 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS 1-F1 1-Precision (FDR) 1-Recall (FNR) SNP Figure 5: Variant calling accuracy when inferring 35x Illumina reads from HG003 using default DeepVariant v1.1 (v1.1), allele frequencies in the entire 1000Genomes (All) and five 1000Genomes superpopulations (EUR, AMR, SAS, AFR and EAS). Lower values cor- respond to better accuracy. accuracy. Using EUR frequencies outperforms using other population frequencies, only falling behind using the entire 1000Genomes. On the other hand, using EAS frequencies results in the highest numbers of errors among all population-aware methods. We point out that using 1000Genomes frequencies from all populations results in the lowest number of errors among all population-aware results, suggesting an advantage to using a diverse population than finding a genetically similar group. This finding echoes our previous statement that we anticipate the population-aware variant calling model to improve further with larger-scaled and more diverse population callsets. 2.5 Silver-standard truth set for HG00733 Genome-in-a-bottle (GIAB) truth variant sets provide gold standards to benchmark vari- ant callers, but until now there are only three samples (HG002-HG003-HG004, the Ashke- nazi trio) with curated calls in difficult-to-map regions added in the v4.2.1 release [22]. Further, the samples are from the same ancestry, making it challenging to perform a generalized benchmarking considering the genetic diversity of the human population. To deal with this difficulty, it is desirable to have other high-quality variant sets from non-GIAB samples, preferably from ancestries not covered by GIAB. Thus, we called variants using the DeepVariant PacBio model with 32x high-coverage PacBio HiFi reads [24] for HG00733, a Puerto Rican (labelled as PUR under the AMR superpopulation in 1000Genomes) sample. The DeepVariant PacBio model has a SNP F1 score higher than 8 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 0.000 0.002 0.004 0.006 0.008 v1.1-30x AF-30x v1.1-18x AF-18x v1.1-30x AF-30x v1.1-18x AF-18x v1.1-30x AF-30x v1.1-18x AF-18x 1-F1 1-Precision (FDR) 1-Recall (FNR) SNP Figure 6: Variant calling results when evaluated using HG00733 data, compared to the PacBio-DeepVariant silver-standard truth set. Lower values correspond to better accu- racy. 99.9% and is one of the most accurate models using PacBio HiFi data [22]. We used the DeepVariant HG00733 PacBio SNP calls as a “silver-standard” truth set and benchmarked the performance for models using Illumina reads. We excluded the Puerto Rican popula- tion when calculating allele frequencies to avoid biases in favor of the population-aware models. We used 30x Illumina WGS reads sequenced by the New York Genome Center to test all HG00733 models. Because the 1000Genomes has a collection of PUR samples, we excluded all PUR samples and re-calculated allele frequencies for both 1000Genomes and the AMR superpopulation. The population-aware model has a lower SNP error rate (0.0041 vs. 0.0043), FDR (0.0022 vs. 0.0023) and FNR (0.0059 vs. 0.0062) than the baseline for HG00733 (Figure 6). The number of SNP errors is reduced by 1,353 (4.82%). Similar to the finding using HG003, the population-aware model performs strongly with a down-sampled (18x) read set. The error rate for the 18x read set is reduced from 0.0056 to 0.0051, and the SNP error reduction is 3,145 (8.5%). We also tested the model using different superpopulation fre- quencies (Figure 7). All but the EAS population-aware model has lower SNP error rates than the baseline. When inferred using the EAS allele frequencies, the SNP error rate in- creased from 0.0043 to 0.0044, equivalent to 878 (3.1%) more errors. All population-aware models, including EAS, outperform the baseline on FDR and only EAS has a higher FNR than the baseline (0.0066 vs. 0.0062). 3 Discussion We designed a new population-aware DeepVariant model which can incorporate both base- and read-level information with the population information. We find that population- aware models reduce error rates by 4.9% for WGS and 6.5-9.9% for WES compared to population-agnostic baselines (default DeepVariant v1.1) The relative advantage of the population-aware model increases at lower coverage (4.9% reduction at 35x and 9.5% at 21x). The increased accuracy at lower coverage suggests that population information is 9 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS v1.1 All EUR AMR SAS AFR EAS 1-F1 1-Precision (FDR) 1-Recall (FNR) SNP Figure 7: Number of SNP errors when evaluated using 30x WGS reads from a Puerto Ri- can sample HG00733. All models other than v1.1 are population-aware, inferred using alleles frequencies from different populations. Lower values correspond to better accu- racy. most valuable in difficult examples, where read-level information alone may not be suffi- cient for confident calling. In population sequencing projects, this finding could be rele- vant to the question of whether to sequence more individuals at lower coverage, or fewer at a high coverage. When sequencing for a species without a reference panel, it is possible that sequencing more, diverse individuals at lower coverage could still retain compara- ble accuracy to traditional methods which do not incorporate population information in calling. We evaluate potential biases introduced by population information in variant call- ing by comparing population-aware models that use allele frequencies from different 1000Genomes superpopulation. This experiment simulates a scenario where the tested sample is genetically distinct from the reference panel. Only one population-aware method (inferred with EAS frequencies) underperforms the baseline in total number of errors, but with a small deficit. Furthermore, using allele frequencies calculated from the entire 1000Genomes outperforms population-specific methods. This finding implies that a di- verse population can provide more benefits than using a homogeneous one, even when the homogeneous population is more genetically similar with the tested sample. This finding may inform efforts to build population or country-specific resources. Increasing the number of samples for a given population will improve accuracy for that population, but the inclusion of samples from diverse populations will also improve the resource. We believe that the accuracy of the population-aware model can further improve with a larger and more diverse population callset in the future, reinforcing the benefit of collaboration between nation-scale efforts. We provide an additional “silver-standard” SNP set for a Purto Rican sample, HG00733, a population not present in the labeled training data. We used high-coverage PacBio HiFi reads and an accurate DeepVariant PacBio model to generate this high-quality call set. This method can provide high-confidence SNP calls for non-GIAB samples and increase population diversity when assessing variant calling results. Similar to the results using HG003 data, we show that the proposed model has strong performance compared to the 10 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 baseline, and only suffers slight loss of accuracy when inferred using a distinct popu- lation. When more high-coverage PacBio HiFi data become available in the future, the high-quality calls generated by DeepVariant can provide a more diversified dataset for variant calling benchmarking and downstream analysis. Despite greater overall accuracy, we note that the population-aware model under- performs on variants with zero allele frequencies in 1000Genomes. Although the dis- advantage is small compared to the overall gain, this results suggests that the decision of whether to use population-aware models should consider the end goal. If reducing po- tential false positives is a larger concern, the use of a population-aware method could be recommended, but if the goal is to maximize recall of rare or novel variants, traditional methods could be preferred. We also notice that all tested Illumina models performed poorly on the zero-frequency variants, regardless of using population information or not. By analyzing the variants with PacBio reads, we point out many zero-frequency variants in 1000Genomes located in difficult-to-map regions, but likely not genetically novel in the population. This suggests that the power of population-aware methods should increase as large panels of long-read population data become available. 4 Methods 4.1 Training the model We trained the model following the procedure described in [2], with additional Illumina WGS datasets included [23]. Variants in chromosomes 1 to 19 are used as the training ex- amples, and those in chromosome 21 and 22 are used for tuning. Variants in chromosome 20 are never used in the training process. 4.2 Datasets The model is evaluated using the GIAB v4.2.1 truth set for HG003 across whole genomes [22]. We also generated another high-quality SNP set using DeepVariant v0.10 and HG00733 PacBio HiFi data [24] across the whole genome. We used the intersection of high-confidence regions of HG002, HG003, and HG004 (GIAB v4.2.1) as the high-confidence regions for the HG00733 SNP set. The read sets used for experiments are listed in Table 1 and the read sets for supporting experiments are provided in Table 2. 4.3 Allele matching algorithm When incorporating population information in DeepVariant, we need to match a variant candidate with a cohort variant. However, this is not a straightforward task since a vari- ant can be represented in multiple formats [3, 26]. A common approach is to normalize variants, such as using bcftools norm [27], but that’s not sufficient for complicated 11 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 Table 1: Testing datasets. Sample Ethnicity Truth variant Dataset HG003 Ashkenazi Jewish v4.2.1 (GRCh38) 35x Illumina WGS [22] 100x Illumina WES [23] HG003 Ashkenazi Jewish v3.3.2 (GRCh37) 300x Illumina WES [25] HG00733 Puerto Rican DeepVariant v0.10 PacBio SNP calls (GRCh38) 30x Illumina WGS (NYGC) Table 2: Other datasets used in this study. Sample Ethnicity Dataset HG003 Ashkenazi Jewish 35x PacBio HiFi [22] HG00733 Puerto Rican 32x PacBio HiFi [24] cases. We designed an algorithm that constructed local haplotypes and performed pre- cise allele matching (Figure 8). The algorithm starts with querying all cohort variants VC overlapped with a window [startv, endv), where startv and endv are the starting and ending positions of a variant candidate v respectively. The queried cohort variants and the candidate variant form set V ≡ v ∪ V C. Then the window is extended to the small- est starting position and the largest ending position within V , as [startV , endV ), where startV ≡ min(startu)∀u ∈ V and endV ≡ max(endw)∀w ∈ V . Local reference haplotype is queried from the reference genome in window [startV , endV ]. For each variant allele in V , its allele haplotype is updated in this window. If there’s a perfect match between a cohort allele haplotype and a candidate allele haplotype, the allele frequency of the cohort allele is added to an allele frequency dictionary, using the alternate allele of the candidate variant as key. Afterwards, DeepVariant looks up the dictionary when processing reads overlapped with the candidate variant. 4.4 Allele frequency channel for DeepVariant To make full advantages of the CNN-based classifier of DeepVariant, allele frequencies need to be encoded in pileup images. We apply a logarithmic transformation to gain resolution for low-frequency signals. For each variant candidate, an additional allele fre- quency channel is added to the pileup image. In this channel, a read is colored by the transformed frequency of its allele at the variant candidate position. A read can carry multiple alternate alleles with different frequencies, so its color intensity may vary across pileup images, where the variant candidates differ. An alternative method to encode al- lele frequencies is to include the information as features in the fully-connected layers [28], but this approach sacrifices the capability to incorporate allele frequencies with base- and read-level information and thus is not adopted. 12 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 Cohort variants position=80 REF=TTTCCA ALT=T,TTTCCATTCCA AF=3.99E-4,2E-4 position=86 REF=TTCCAG ALT=T AF=1.198E-3 Variant candidate position=85 REF=ATTCCAG ALT=AT Reference:80-92 TTTCCATTCCAG a b b Updated haplotypes TTTCCA-----T----- T----------TTCCAG TTTCCATTCCATTCCAG TTTCCA-----T----- c TTTCCAT TTTCCAG TTTCCATTCCATTCCAG TTTCCAT d dict(AT=0.001198, ATTCCAG=0.998802) Variant candidate position=85 REF=ATTCCAG ALT=AT Cohort variants Cohort variant 1 position=80 REF=TTTCCA ALT=T,TTTCCATTCCA AF=0.0004, 0.0002 Cohort variant 2 position=86 REF=TTCCAG ALT=T AF=0.0012 Reference:80-92 TTTCCATTCCAG Updated haplotypes TTTCCA-----T----- T----------TTCCAG TTTCCATTCCATTCCAG TTTCCA-----T----- TTTCCAT TTTCCAG TTTCCATTCCATTCCAG TTTCCAT Candidate frequency AT:0.0012 Figure 8: An example for the allele matching algorithm. This algorithm first queries cohort variants overlapped with the variant candidate. These cohort variants and the candidate determine the window where haplotypes are updated. The frequencies of matched allele haplotypes are then updated for the variant candidate as a dictionary. In this diagram, haplotypes are updated with dashes to keep sequenced aligned for better visualization. In practice, dash-free haplotypes are generated by the allele matching algorithm. To enable the allele frequency channel, users need to enable flag --use allele frequency and provide DeepVariant cohort variants in VCF format with flag --population vcfs. 4.5 Model-specific error analysis We compared actual variant calls with GIAB v4.2.1 truth variants using bcftools isec. Variants specific to actual calls are regarded as false positives, and those specific to the truth set are regarded as false negatives. We generated the false-positive and false-negative sets for two models, and then applied bcftools isec again to obtain model-specific false positives and false negatives. For both sets, we applied the allele matching algo- 13 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 rithm to obtain allele frequencies for the variants. For the false-positive sets, we extracted variant allele fractions from the VCF files generated by DeepVariant. 4.6 1000Genomes frequencies from the DeepVariant-GLnexus pipeline We used the 1000Genomes reference panel generated with the DeepVariant-GLnexus pipeline (v3) [8] for all population-aware experiments, including training and inferring the models. We fill the missing genotypes with the reference genotypes with bcftools +missing2ref to make sure all variants have the same denominator. 5 Availability of data and materials The DeepVariant source code is available at https://github.com/google/deepvariant under the BSD-3-Clause License. The PacBio-based HG00733 SNP set is available at https://console.cloud.google.com/storage/browser/brain-genomics-public/ research/allele_frequency/HG00733_SNP_set. The pre-trained population-aware DeepVariant models are available at https://console.cloud.google.com/storage/ browser/brain-genomics-public/research/allele_frequency/pretrained_ model_WGS (WGS) and https://console.cloud.google.com/storage/browser/ brain-genomics-public/research/allele_frequency/pretrained_model_WES (WES). The VCF files used in this study are available at https://console.cloud. google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/ cohort_dv_glnexus_opt/v3_missing2ref (GRCh38) and https://console.cloud. google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/ cohort_dv_glnexus_opt/v3_GRCh37_missing2ref (GRCh37). 6 Ethics approval and consent to participate Not applicable. 7 Consent for publication Not applicable. 8 Competing interests AK, SG, TY, PC and AC are employees of Google LLC and own Alphabet stock as part of the standard compensation package. This study was funded by Google LLC. 14 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://github.com/google/deepvariant https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/HG00733_SNP_set https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/HG00733_SNP_set https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WGS https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WGS https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WGS https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WES https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WES https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_GRCh37_missing2ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_GRCh37_missing2ref https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_GRCh37_missing2ref https://doi.org/10.1101/2021.01.06.425550 9 Funding All compute resources used in this work were provided by Google, LLC. AK, SG, TY, PC and AC are full-time, salaried employees of Google, LLC. NC con- tributed to this work as a salaried intern of Google, LLC. 10 Acknowledgments We thank Babak Alipanahi, Gunjan Baid, Daniel Cook, Alexander D’Amour, Hojae Lee, Cory McLean, Maria Nattestad and other colleagues at Google for their feedback on this manuscript and the project in general. The HG00733 Illumina data were generated at the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1. 11 Authors’ contributions NC, AK, PC and AC designed the method. NC, AK and PC implemented the software. NC and PC performed the experiment. NC, AK, SG, TY, PC and AC analyzed the re- sults. NC, PC and AC wrote the manuscript. All authors read and approved the final manuscript. References 1. DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philippakis, A. A., Del Angel, G., Rivas, M. A., Hanna, M., et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 43, 491 (2011). 2. Poplin, R., Chang, P.-C., Alexander, D., Schwartz, S., Colthurst, T., Ku, A., New- burger, D., Dijamco, J., Nguyen, N., Afshar, P. T., et al. A universal SNP and small- indel variant caller using deep neural networks. Nature biotechnology 36, 983–987 (2018). 3. Krusche, P., Trigg, L., Boutros, P. C., Mason, C. E., Francisco, M., Moore, B. L., Gonzalez- Porta, M., Eberle, M. A., Tezak, Z., Lababidi, S., et al. Best practices for benchmark- ing germline small-variant calls in human genomes. Nature biotechnology 37, 555–560 (2019). 4. Karczewski, K. J., Francioli, L. C., Tiao, G., Cummings, B. B., Alföldi, J., Wang, Q., Collins, R. L., Laricchia, K. M., Ganna, A., Birnbaum, D. P., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434– 443 (2020). 15 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 5. 1000 Genomes Project Consortium et al. A global reference for human genetic varia- tion. Nature 526, 68–74 (2015). 6. Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014). 7. Lin, M. F., Rodeh, O., Penn, J., Bai, X., Reid, J. G., Krasheninina, O. & Salerno, W. J. GLnexus: joint variant calling for large cohort sequencing. BioRxiv, 343970 (2018). 8. Yun, T., Li, H., Chang, P.-C., Lin, M. F., Carroll, A. & McLean, C. Y. Accurate, scalable cohort variant calls using DeepVariant and GLnexus. bioRxiv (2020). 9. Poplin, R., Ruano-Rubio, V., DePristo, M. A., Fennell, T. J., Carneiro, M. O., Van der Auwera, G. A., Kling, D. E., Gauthier, L. D., Levy-Moonshine, A., Roazen, D., et al. Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv, 201178 (2017). 10. Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reducing reference bias using multiple population reference genomes. BioRxiv (2020). 11. Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome biology 21, 1–28 (2020). 12. Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., Jones, W., Garg, S., Markello, C., Lin, M. F., et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology 36, 875–879 (2018). 13. Witherspoon, D. J., Wooding, S., Rogers, A. R., Marchani, E. E., Watkins, W. S., Batzer, M. A. & Jorde, L. B. Genetic similarities within and between human populations. Genetics 176, 351–359 (2007). 14. Abramovs, N., Brass, A. & Tassabehji, M. Hardy-Weinberg Equilibrium in the Large Scale Genomic Sequencing Era. Frontiers in Genetics 11, 210 (2020). 15. Pedersen, B. S., Brown, J. M., Dashnow, H., Wallace, A. D., Velinder, M., Tvrdik, T., Mao, R., Best, H. D., Bayrak-Toydemir, P. & Quinlan, A. R. Effective variant filter- ing and expected candidate variant yield in studies of rare human disease. BioRxiv (2020). 16. Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019). 17. Martin, A. R., Kanai, M., Kamatani, Y., Okada, Y., Neale, B. M. & Daly, M. J. Clinical use of current polygenic risk scores may exacerbate health disparities. Nature genetics 51, 584–591 (2019). 18. McGuire, A. L., Gabriel, S., Tishkoff, S. A., Wonkam, A., Chakravarti, A., Furlong, E. E., Treutlein, B., Meissner, A., Chang, H. Y., López-Bigas, N., et al. The road ahead in genetics and genomics. Nature Reviews Genetics 21, 581–596 (2020). 16 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425550 19. Wenger, A. M., Peluso, P., Rowell, W. J., Chang, P.-C., Hall, R. J., Concepcion, G. T., Ebler, J., Fungtammasan, A., Kolesnikov, A., Olson, N. D., et al. Accurate circular con- sensus long-read sequencing improves variant detection and assembly of a human genome. Nature biotechnology 37, 1155–1162 (2019). 20. Carroll, A. & Chang, P.-C. Improving the Accuracy of Genomic Analysis with DeepVariant 1.0 https://ai.googleblog.com/2020/09/improving-accuracy-of- genomic-analysis.html. 2020. (accessed: 2020-12-11). 21. Olson, N. D., Wagner, J., McDaniel, J., Stephens, S. H., Westreich, S. T., Prasanna, A. G., Johanson, E., Boja, E., Maier, E. J., Serang, O., et al. precisionFDA Truth Chal- lenge V2: Calling variants from short-and long-reads in difficult-to-map regions. bioRxiv (2020). 22. Wagner, J., Olson, N. D., Harris, L., Khan, Z., Farek, J., Mahmoud, M., Stankovic, A., Kovacevic, V., Wenger, A. M., Rowell, W. J., et al. Benchmarking challenging small variants with linked and long reads. BioRxiv (2020). 23. Baid, G., Nattestad, M., Kolesnikov, A., Goel, S., Yang, H., Chang, P.-C. & Carroll, A. An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development. bioRxiv. eprint: https://www.biorxiv.org/content/early/ 2020/12/11/2020.12.11.422022.full.pdf. https://www.biorxiv. org/content/early/2020/12/11/2020.12.11.422022 (2020). 24. Porubsky, D., Ebert, P., Audano, P. A., Vollger, M. R., Harvey, W. T., Marijon, P., Ebler, J., Munson, K. M., Sorensen, M., Sulovari, A., et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nature Biotechnology. ISSN: 1546-1696. https://doi.org/10.1038/s41587- 020-0719-5 (Dec. 2020). 25. Zook, J. M., Catoe, D., McDaniel, J., Vang, L., Spies, N., Sidow, A., Weng, Z., Liu, Y., Mason, C. E., Alexander, N., et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific data 3, 1–26 (2016). 26. Sun, C. & Medvedev, P. VarMatch: robust matching of small variant datasets using flexible scoring schemes. Bioinformatics 33, 1301–1308 (2017). 27. Li, H. A statistical framework for SNP calling, mutation discovery, association map- ping and population genetical parameter estimation from sequencing data. Bioinfor- matics 27, 2987–2993 (2011). 28. Yi, R., Chang, P.-C., Baid, G. & Carroll, A. Learning from Data-Rich Problems: A Case Study on Genetic Variant Calling. arXiv preprint arXiv:1911.05151 (2019). 17 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.06.425550doi: bioRxiv preprint https://ai.googleblog.com/2020/09/improving-accuracy-of-genomic-analysis.html https://ai.googleblog.com/2020/09/improving-accuracy-of-genomic-analysis.html https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422022.full.pdf https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422022.full.pdf https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422022 https://www.biorxiv.org/content/early/2020/12/11/2020.12.11.422022 https://doi.org/10.1038/s41587-020-0719-5 https://doi.org/10.1038/s41587-020-0719-5 https://doi.org/10.1101/2021.01.06.425550 Background Results Population information improves DeepVariant performance Model-specific errors for population-aware models Performance on zero-frequency variants Assessing biases using different 1000Genomes populations Silver-standard truth set for HG00733 Discussion Methods Training the model Datasets Allele matching algorithm Allele frequency channel for DeepVariant Model-specific error analysis 1000Genomes frequencies from the DeepVariant-GLnexus pipeline Availability of data and materials Ethics approval and consent to participate Consent for publication Competing interests Funding Acknowledgments Authors' contributions 10_1101-2021_01_06_425560 ---- Review and performance evaluation of trait-based between-community dissimilarity measures 1 Title 1 Review and performance evaluation of trait-based between-community dissimilarity measures 2 3 Author details 4 Attila Lengyel1* & Zoltán Botta-Dukát2* 5 *Centre for Ecological Research, Institute of Ecology and Botany, Alkotmány u. 2-4., H-2163 6 Vácrátót, Hungary 7 1 corresponding author, lengyel.attila@ecolres.hu 8 2 botta-dukat.zoltan@ecolres.hu 9 10 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 2 Abstract 11 1. In the recent years a variety of indices have been proposed with the aim of quantifying 12 functional dissimilarity between communities. These indices follow different 13 approaches to account for between-species similarities in the calculation of 14 community dissimilarity, yet they all have been proposed as straightforward tools. 15 2. In this paper we reviewed the trait-based dissimilarity indices available in the 16 literature, contrasted the approaches they follow, and evaluated their performance in 17 terms of correlation with an underlying environmental gradient using individual-based 18 community simulations with different gradient lengths. We tested how strongly 19 dissimilarities calculated by different indices correlate with environmental distances. 20 Using random forest models we tested the importance of gradient length, the choice of 21 data type (abundance vs. presence/absence), the transformation of between-species 22 similarities (linear vs. exponential), and the dissimilarity index in the predicting 23 correlation value. 24 3. We found that many indices behave very similarly and reach high correlation with 25 environmental distances. There were only a few indices (e.g. Rao’s DQ, and 26 representatives of the nearest neighbour approach) which performed regularly poorer 27 than the others. By far the strongest determinant of correlation with environmental 28 distance was the gradient length, followed by the data type. The dissimilarity index 29 and the transformation method seemed not crucial decisions when correlation with an 30 underlying gradient is to be maximized. 31 4. Synthesis: We provide a framework of functional dissimilarity indices and discuss the 32 approaches they follow. Although, these indices are formulated in different ways and 33 follow different approaches, most of them perform similarly well. At the same time, 34 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 3 sample properties (e.g. gradient length) determine the correlation between trait-based 35 dissimilarity and environmental distance more fundamentally. 36 37 Keywords 38 beta diversity, dissimilarity index, distance metric, community ecology, functional traits 39 40 Abbreviations 41 CDF = cumulative distribution function, CWM = community-weighted mean, FDissim = 42 functional dissimilarity, VIS = variable importance score 43 44 Introduction 45 Understanding and explaining the variation of living communities along dimensions of space 46 and time have been in the focus of ecological research ever since. The widely applied scheme 47 by Whittaker (1960, 1972) to tackle questions of different aspects of community variation 48 divides community diversity into alpha (within-community), beta (between-community) and 49 gamma (across-community) components. It is no exaggeration to say that among these three, 50 beta diversity sparked the most controversy due to the multitude of ways how it can be 51 formulated (Tuomisto 2010a,b, Anderson et al. 2011, Podani & Schmera 2011, Baselga & 52 Leprieur 2015). One of the most popular approaches to beta diversity builds upon 53 quantification of variation between pairs of communities using dissimilarity indices 54 (Anderson et al. 2006, Legendre & De Cáceres 2013, Ricotta 2017). A broad spectrum of 55 such dissimilarity indices are available for many specific purposes providing elementary tools 56 for different fields of ecology and beyond (see reviews by Legendre & Legendre 1998, Podani 57 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 4 2000). Nevertheless, choosing from such many options requires a more or less subjective 58 decision from the researcher which may affect the final result of the analysis. Comparative 59 reviews of dissimilarity indices (Faith et al. 1987, Koleff et al. 2003) and evaluations of 60 effects of methodological decisions (Lengyel & Podani 2015) are inevitably helpful in making 61 these decisions. 62 The most popular, yet not exclusive, interpretations of diversity for long time considered 63 species as variables which are unrelated with each other. In the last two decades, however, the 64 functional approach to ecological questions gained unprecedented attention (Díaz & Cabido 65 2001, McGill et al. 2006). This approach relies on the fact that species are not all maximally 66 different from each other, rather they can be considered related with respect to similarities in 67 their traits thought to represent their roles in ecosystems (Violle et al. 2007). The need for 68 explicitly accounting for between-species relatedness generated a wave of methodological 69 improvements that introduced new methods in the calculation of diversity. Next to a lively 70 scientific discussion on how functional alpha diversity can be appropriately quantified (Mason 71 et al. 2005, Petchey & Gaston 2006, Villéger et al. 2008, Mouchet et al. 2010), suggestions 72 were made also for the expression of functional beta diversity (Swenson 2011, Botta-Dukát 73 2018, Chao et al. 2019). Among them, a large variety of indices for calculating dissimilarity 74 between pairs of communities on the basis of the traits of their species have been proposed 75 (e.g. Ricotta & Burrascano 2008, Cardoso et al. 2014, Ricotta & Pavoine 2015). Although 76 these indices have been introduced as straightforward measures for revealing between-77 community dissimilarity on the basis of traits, they have very different concepts behind, and 78 we still lack a comparative review of them. 79 In this paper we aim to provide an overview and a conceptual framework for the pairwise 80 functional dissimilarity (hereafter called FDissim) measures available in the literature to our 81 best knowledge. We start with a (1) short overview of the concept and indices of ecological 82 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 5 (dis-)similarity without accounting for relatedness of species, then (2) we review and classify 83 FDissim indices according to their conceptual basis, and (3) we test the performance of 84 FDissim indices. 85 86 Short overview of taxon-based (dis-)similarity methods 87 Most FDissim measures are generalizations of simple indices which were originally designed 88 for expressing dissimilarity based on species composition (that is, omitting similarities 89 between species). We start the review of trait-based (dis-)similarity measures with a brief 90 summary of these species-based indices. Then, we present a framework of approaches 91 including several families of trait-based dissimilarity indices. 92 Species-based indices 93 Most indices can be written in either similarity (s) or dissimilarity (d=1-s) form but when we 94 do not see necessary to specify the form, we call them ‘resemblances’. In the case of 95 presence/absence data, these indices are based on the well-known 2×2 contingency table 96 whose cells represent the number of species shared (denoted by a), as well as the number of 97 species occurring only in one of the communities (b and c). The fourth cell of the contingency 98 table quantifying the number of shared absences is disregarded by these indices and rarely 99 used in ecological analyses (but see Tamás et al. 2001). All these indices agree that they 100 express similarity as the proportion of shared diversity to total diversity. Hence, all of them 101 range between 0 and 1. In the case of presence/absence data the number of shared species, a, 102 in the numerator stands for shared diversity for all indices, while the denominators are 103 different. In the Sørensen index (sS) the denominator is the arithmetic mean of the species 104 numbers of the two communities, in Ochiai index (sO) it is their geometric mean, in 105 Kulczynski (sK) it is their harmonic mean, while in Simpson index (sSi) it is the richness of the 106 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 6 species poorer community. If the two communities are equally species-rich, then these indices 107 are equal, otherwise sS < sO < sK < sSi. In the Jaccard index (sJ), the denominator is the total 108 number of species in the two communities, while in Sokal & Sneath index (sSS) species 109 occurring in a single community are taken into account with double weight. There is a direct 110 and monotonic relationship between Jaccard, Sørensen, and Sokal & Sneath indices (see 111 Appendix S1). Table 1 summarizes the similarity and dissimilarity forms of the above indices. 112 For abundance data, the resemblance of two communities is derived from the summation of 113 species-wise differences, with the simplest interpretation being the Euclidean and the 114 Manhattan distances, respectively: 115 Eq. 1. ��������� � �∑ ��� � ��� �� ����� 116 Eq. 2. ��� ����� � ∑ �� � ��� ����� 117 where xij and xik are the abundance of species i in communities j and k, Sjk is the total number 118 of species in j and k. For both indices, the minimum is 0 but the maximum of Euclidean 119 distance is the square-root of the sum of squared abundances, while for Manhattan distance 120 the maximum is the sum of abundances. Obviously, their dependence on total abundance 121 makes these index values difficult to compare across samples; therefore, indices including a 122 standardization have become more popular in ecological studies. The standardization is 123 possible in several ways. The first option is to standardize raw species contributions to 124 between-community dissimilarity (xij-xik), and then to sum them. Therefore, each species-level 125 difference in abundance should be divided by a scaling factor in a way that maximal species-126 level difference is 1 and this difference is maximal if species present only one of the 127 compared communities. Summing xij and xik in the denominator satisfies this requirement and 128 gives a well-known distance measure, the Canberra index: 129 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 7 Eq. 3. ��� ����� � ∑ ������������������ �� ��� 130 However, Canberra index still ranges between 0 and Sjk. According to Ricotta & Podani 131 (2017), the normalized Canberra index can be derived by unweighted averaging of species 132 contributions: 133 Eq. 4. ���� ����� � � �� ∑ ��������� ��������� �� ��� 134 Alternatively, species-level differences can be divided by max(xij, xik). It also results unity, if 135 species occur only either of the plots. Ricotta & Podani (2017) called this modified Canberra 136 index, whose normalized version follows: 137 Eq. 5. ����� ����� � � �� ∑ ��������� ��� ���,���" �� ��� 138 Calculating from binary data, both normalized Canberra and normalized modified Canberra 139 result in Jaccard dissimilarity. 140 A different way of standardization is possible if raw species-level differences are summed and 141 divided by the sum of their theoretical maxima. In this case, the denominator can follow the 142 logic of Canberra index, thus leading to the Bray-Curtis index: 143 Eq. 6. �#� � ∑ ��������� ��� ��� ∑ �������" ��� ��� 144 Analogously with the normalized modified Canberra index, instead of the sum, the 145 denominator may contain the maximum of abundance, resulting in the formula known as 146 Marczewski-Steinhaus index: 147 Eq. 7. �� � ∑ ��������� ��� ��� ∑ ��� ���,���" ��� ��� 148 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 8 Worth to note that Bray-Curtis and Marczewski-Steinhaus indices calculated on 149 presence/absence data return the values of Sørensen index and Jaccard index in dissimilarity 150 form, respectively. Moreover, several abundance-based indices can be expressed if we 151 generalize a, b, and c quantities used during the definition of indices for presence/absence 152 data (Tamás et al. 2001). 153 Eq. 8. % � ∑ min ��� , ��� � ����� 154 Eq. 9. �% � ∑ �max ��� , ��� � � �� � ����� 155 Eq. 10. �% � ∑ �max ��� , ��� � � ��� � ����� 156 Substituting a, b and c with a’, b’ and c’ into the formula of Sørensen index gives Bray-157 Curtis, and doing so with Jaccard index results in the Marczewski-Steinhaus. Abundance 158 versions of all other presence/absence indices can be created in the same manner. 159 160 A classification of FDissim indices 161 FDissim indices incorporate trait information into the calculation of dissimilarity in different 162 ways. The simplest solution is when summary statistics or distributions are calculated for the 163 two communities and a measure of distance or segregation is calculated between them. We 164 call this the summary-based class, and in our review, we include two approaches within this, 165 the typical value approach and the distribution-based approach. In the second class we 166 include indices which utilize a symmetrical species by species (dis-)similarity matrix and link 167 it directly through matrix operations with the compositional matrix. We call this the 168 dissimilarity-based class which includes the probabilistic, the ordinariness-based, the 169 diversity partitioning, and the nearest neighbour approaches. The third class includes 170 methods which make use of between-species (dis-)similarities for classification of species; 171 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 9 therefore, we call it the classification-based class. The classification either transforms the 172 original structure of the dissimilarity matrix into discrete groups of species which can be used 173 as functional types, or expresses dissimilarities in a form of a tree-graph where between-174 species dissimilarities are organized in an inclusive hierarchy. This is a widespread approach 175 for accounting for phylogenetic relatedness, since phylogenies are commonly summarized in 176 the form of cladograms. Such methods heavily rely on the algorithm chosen for the 177 classification, including the decisions about the number of clusters and the method for 178 breaking tied values. Examples are provided by Hérault & Honnay (2007), Nipperess et al. 179 (2010), and Cardoso et al. (2014), while a review is available by Pavoine (2016). As there is 180 no general recommendation for the classification method, we omit this class from the 181 framework detailed below and the comparative test. The classification of trait-based 182 dissimilarity indices and their main properties are summarized on Table 2. 183 Typical value approach 184 Indices following this approach represent each community with a typical trait value, and 185 calculate a distance metric between them. The most commonly applied typical trait value is 186 the community weighted mean (CWM; Garnier et al. 2004). The rationale behind the CWM 187 can be linked with the mass ratio hypothesis (Grime 1998) stating that the effect of species on 188 ecosystem functioning is proportional to their relative abundances. Although, several issues 189 emerged regarding its limited applicability in statistical inference (Hawkins et al. 2017, Peres-190 Neto et al. 2017, Zeleny 2018) and its negligence of within-community variation (Muscarella 191 & Uriarte 2016), difference in CWM is still considered a reliable indicator of robust changes 192 in trait composition induced by selective forces like environmental matching or succession 193 (De Bello et al. 2007, 2013, Kleyer et al. 2012). Ricotta et al. (2015) investigated the 194 relatedness of the distance between CWMs with the probabilistic approach (see therein) and 195 showed its applicability on phylogenetic data. Due to its tolerable requirements for 196 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 10 computational capacity, Lengyel et al. (2020) used the Euclidean distance between trait 197 CWMs of phytosociological relevés for the trait-based numerical classification of grasslands 198 of Poland with a sample size of 6985 sites and 885 species. Another advantage of this method 199 is its Euclidean property. Besides the community-weighted mean, other typical values, e.g. the 200 median or the mode, might be considered depending on the scaling of the trait variable and on 201 specific research aims. 202 203 Distribution-based approach 204 Instead of typical values, the distribution of trait values is considered a more reliable 205 representative of the trait composition and variability of a community. Continuous 206 distributions can be defined by a density function, while discrete distributions by the 207 probabilities of the possible values, while both types can be characterized by a cumulative 208 distribution function (CDF). A useful analogue of the distance between typical values might 209 be distance between discrete distributions, density functions or CDFs. 210 If data is available on intraspecific trait variation, trait values forms a continuous distribution. 211 First, separate density functions have to be fitted within each species. Then, density function 212 of this community-level distribution can be calculated as weighted sum of species level 213 density functions (Carmona, de Bello, Mason, & Lepš, 2016). If such data is not available, we 214 can use relative abundances as estimates of probabilities of the corresponding trait values. 215 Pairs of trait values and their probability form a discrete distribution. 216 Similarity of density functions can be measured by their overlap (see Appendix S2 for 217 overview of overlap measures). Overlap functions between within-species trait distributions 218 has already been proved useful in the quantification of between-species niche segregation 219 (MacArthur & Levins 1967, Mouillot et al. 2005) or trait-based dissimilarity of species (Lepš 220 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 11 et al. 2006, De Bello et al. 2013). Nevertheless, they are perfectly applicable to the 221 community level as well. 222 Gregorius et al. (2003) proposed an index called delta for the quantification differences 223 between discrete trait distributions. Delta is the minimal sum of frequencies shifted from one 224 trait state to another trait state, weighted by the differences between the respective states. 225 Minimizing the sum of shifted frequencies is known in linear programming as the 226 transportation problem (Hitchcock 1941). Due to its relatively high computational demand, it 227 is unfeasible for large compositional and trait data matrices typically used in ecological 228 research, therefore, we exclude this index from our comparison. 229 Difference between two CDFs can be calculated at each possible trait values (i.e. not only the 230 observed ones), then the sum of them can be used as a trait-based dissimilarity measure. In 231 Appendix S3 we introduce the distance between CDFs in more detail. 232 233 Maximally distinct communities 234 Species-based dissimilarities, except Euclidean, Manhattan and (non-normalized) Canberra 235 distances, equal unity, which is their maximum, when the two compared communities do not 236 share any species. In this context, we could call such communities maximally distinct. 237 However, when traits are considered, two communities can be similar, even if they do not 238 share any species. For example, if all species of community A is replaced by a similar species 239 in community B, the two communities have no shared species, but from functional point of 240 view, they are similar. In this context, two communities are maximally distinct, when 241 similarity of any species from the first community is zero to any species in the other 242 community. It is a desirable property for a functional similarity index to take the value 0 if 243 and only if the two compared communities are maximally distinct. 244 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 12 245 Probabilistic approach 246 This approach can be traced back to the diversity framework proposed by Rao (1982), and 247 recently extended by Pavoine & Ricotta (2014). Rao’s within community diversity is defined 248 as the expected dissimilarity between two randomly drawn individuals from a single 249 community: 250 Eq. 11. ���� � ∑ ∑ �� � δ� � 251 where pi is the relative abundance of the ith species in the community and δij is the 252 dissimilarity between species i and j. This has become a widely used index of functional alpha 253 diversity (Botta-Dukát 2005). Likewise, a between-community component of diversity, 254 Q(p,q), can be defined as the dissimilarity between two random individuals, each selected 255 from different communities: 256 Eq. 12. ���, �� � ∑ ∑ �� � δ� � 257 Between community diversity can be expressed using within community diversity of the two 258 original communities (Q(p) and Q(q)) and the community with mean relative abundances; 259 � �&�' � �. 260 Eq. 13. 2� �&�' � � � 2 ∑ ∑ (��)� � (��)� � � δ� � �� ∑ ∑ ��� � � �� � � �261 2�� � � δ� � � �"�� �" 2 � ���, �� 262 Subtracting mean within community diversity from the between community diversity leads to 263 Rao’s dissimilarity (also called DISC): 264 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 13 Eq. 14. !* � ∑ ∑ ��������� � ∑ ∑ (�(�+���� �∑ ∑ )�)�+����2 � ∑ ∑ ��������� � � �"�� �"2 �265 2� ���� 2 � � �� � � �� � 266 where pi and qi are the relative abundances of species i in the two communities. Champely 267 and Chessel (2002) proved that if δ has squared Euclidean property, Rao quadratic entropy is 268 concave function, i.e. � �&�' � � is higher than or equal to mean of ���� and ����. Thus under 269 this condition, !* " 0. If 0 $ %� $ 1, ∑ ∑ �� � %� � , which is the weighted average of 270 between-species distances, also has to be within this range. Therefore, 0 $ !* $ 1. However, 271 DQ may be much less than 1, even if the two communities are completely distinct, when ���� 272 and ���� are high. Therefore, Pavoine & Ricotta (2014) suggested dividing DQ by its 273 theoretical maximum (see equations 3 and 4 in Pavoine & Ricotta 2014). They recognized 274 that the resulting indices are representatives of a broader family of indices, hereafter called 275 dsimcom, which are actually the implementations of Rao’s between-community and within-276 community components of diversity into the similarity formulae designed for 277 presence/absence data. For this index, it is necessary to introduce the similarity between 278 species, εij=1- δij. The expected similarity between individuals of different communities, 279 ' � ∑ ∑ � � � � (���� is taken analogous with the shared diversity, a, according to the parameters 280 of the similarity indices for presence/absence data disregarding species properties, while the 281 expected similarities within communities (' � ) � ∑ ∑ � � � � (���� and ' � * � ∑ ∑ ����(���� ) are 282 analogous with the species numbers (a+b, a+c). In this way, Pavoine & Ricotta (2014) 283 presented formulae following the Sokal & Sneath, Jaccard, Sørensen, and Ochiai indices. 284 Additionally, a formula analogous with Whittaker’s effective species turnover (β=γ/α-1; 285 Whittaker 1972, Tuomisto 2010a) is suggested for two communities, which in similarity form 286 is shown to be identical with the overlap index of Chiu et al. (2014). In this formulation 287 γ=A+B+C and α=(2A+B+ C)/2. Pavoine & Ricotta (2014) showed that members of the 288 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 14 dsimcom family provide meaningful values also if absolute abundances, percentage values or 289 binary occurrences are used instead of relative abundances. 290 When εij contains taxonomical similarities, its off-diagonal elements are 0, and A=a, B=b, and 291 C=c. 292 Worth to note the inherent link between DQ and CWMdis on the basis of the geometric 293 interpretation by Pavoine (2012) and Ricotta et al. (2015). Pavoine (2012) showed that if 294 between-species dissimilarities are in the form δij=(dij2)/2 and dij is Euclidean embeddable, DQ 295 is half the squared Euclidean distance between the centroids of two communities – a function 296 monotonically related with CWMdis, the simple Euclidean distance between centroids of 297 communities. As Ricotta et al. (2015) argue, if species relatedness is only described by a 298 dissimilarity matrix, which is the common case in phylogenetic analyses, species can be 299 mapped into a principal coordinate analysis ordination using dij. Given the Euclidean 300 embeddable property of dij, this ordination should produce S-1 or fewer ordination axes, all 301 with positive eigenvalues. Ordination scores for species can be used as traits, and therefore, 302 centroids of communities, and (squared) Euclidean distances between communities can be 303 calculated. In the special case when between-species dissimilarities are Euclidean distances, 304 DQ must be equal with the Euclidean distance between the weighted averages of traits, that is, 305 CWMdis. 306 It is also notable that Swenson et al. (2011) and Swenson (2011) use the quantity Q(p, q) as a 307 standalone index of pairwise beta diversity and call it Dpw or “Rao’s D”. The latter name is 308 misleading since Rao (1982) himself noted with Dij the DISC (or DQ) index. Q(p, q) measures 309 dissimilarity between two communities but the dissimilarity of a community from itself is not 310 zero. Swenson (2011) also presents a standardized version of Q(p, q) under the name Rao’s 311 H. With this formula the dissimilarity of a community to itself is scaled to 1, however, its 312 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 15 transformation to a meaningful scale where each community has dissimilarity value zero 313 towards itself is not elaborated. Due to this drawback, we do not consider these indices in our 314 review of functional dissimilarity measures. 315 Schmidt et al. (2017) proposed probabilistic indices with weighted and unweighted versions 316 for expressing community similarity on the basis of taxa interaction networks (called TINA, 317 taxa interaction-adjusted) and phylogenetic relatedness (PINA, phylogenetic interaction-318 adjusted). TINA and PINA differ only in what type of data the interaction matrix contains. 319 Notably, the functional formula of weighted TINA is identical with the Ochiai version of 320 dsimcom. However, the unweighted TINA, abbreviated TU, is not a special case of TINA, 321 which we consider an inconsistency. Therefore, we did not include TU as a separate index. 322 323 Ordinariness-based approach 324 With respect to functional alpha diversity, Leinster & Cobbold (2012) introduced the concept 325 of species ordinariness defined as the weighted sum of relative abundances of species similar 326 to a focal species within the same community, or in other words, the expected similarity of an 327 individual of the focal species and an individual chosen randomly from the same community. 328 According to Ricotta & Pavoine (2015) it is straightforward to replace abundances with 329 ordinariness values in the species-based (dis-)similarity indices. Following this concept, 330 Ricotta & Pavoine (2015) introduced a new family of trait-based similarity measures called 331 dissABC. dissABC applies the schemes of Jaccard, Sørensen, Ochiai, Kulczynski, Sokal & 332 Sneath, and Simpson indices. Either relative or absolute abundances can be chosen as input 333 values. Species ordinariness values can be calculated either with respect to the pooled species 334 list of the two communities under comparison, or to the total species list of the data matrix. 335 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 16 For species-based analyses, Ricotta & Podani (2017) suggested a general formula of distance 336 measures in which community dissimilarity is calculated by the weighted averaging of 337 species-level differences in abundance. From this formula, a normalized Canberra distance, 338 Bray-Curtis distance, Marczewski-Steinhaus index, and an evenness-based dissimilarity index 339 (Ricotta 2018) can be derived. According to Pavoine & Ricotta (2019), replacing species 340 abundances with species ordinariness values, a meaningful dissimilarity index can be 341 designed, which is called generalized_Tradidiss. Additionally, this index contains a factor 342 which weights the contribution of each species to the overall dissimilarity between the two 343 communities. This weight can be set to give even weight to all species or to weigh them 344 proportionally to their relative abundance in the pooled communities. 345 346 Diversity partitioning approach 347 Following the work of Hill (1973), a community with diversity of order q, qD, is as diverse as 348 a theoretical community containing qD equally abundant species. The order of diversity, q, 349 expresses the weight given to differences in species abundance, q = 0 representing the 350 presence/absence case, q = ∞ considering only the relative abundance of the most abundant 351 species in the community. Without accounting for interspecific similarities, there is emerging 352 consensus that using effective numbers (also called number of equivalents) is a 353 straightforward way for partitioning diversity into within-community (alpha), between-354 community (beta) and across-community (gamma) components (Jost 2007). Of these three, 355 the between-community component, beta diversity, can be interpreted as a form of 356 dissimilarity when applied for two communities (Ricotta 2017). Beta diversity can be derived 357 from alpha and gamma diversity in a multiplicative (beta = gamma/alpha) or an additive way 358 (beta = gamma – alpha). Jost (2007) and Chao et al. (2012) argued that multiplicative beta 359 diversity is a useful way for quantifying community differentiation; however, due to its 360 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 17 scaling between 1 and N (N being the number of communities) it is not comparable across 361 samples containing different numbers of communities. To remove this dependence, they offer 362 three solutions with which the value of multiplicative beta can be normed. Although, for 363 pairwise comparisons, N is always 2, it seems straightforward to follow these 364 recommendations, since the scaling between 0 and 1 has several advantages, and most other 365 indices also share this property. The rescaling formulae of Chao et al. (2012) embody 366 different concepts of community (dis-)similarity, which together we call the family of 367 multiplicative beta indices. The first formula is the relative turnover rate per community, 368 which is a linear transformation of beta to the normed scale. 369 Eq. 15. +��� ,-�� ,�- � � +) � 1�/�/ � 1� 370 Here 0 means identical species composition, while 1 indicates totally distinct communities. In 371 the pairwise comparison (N = 2), βturnover〈q〉 = q β - 1. 372 The second index measures homogeneity, and is a linear transformation of the inverse of beta. 373 With respect to the fact that the complement term of homogeneity is heterogeneity, we call its 374 dissimilarity form βheterogeneity: 375 Eq. 16. +�����,.� ���/ ,�- � 1 � 0 �0� � ��1 �1 � ���2 376 When N = 2, βhet〈q〉 = 2-2/ q β. With q = 0 (presence/absence case) the index is identical with 377 Jaccard index, while with q = ∞ (abundance case) it is the Morisita & Horn index. 378 The third index measures overlap between communities, whose counterpart is segregation, 379 thus we call it βsegr: 380 Eq. 17. +1�.��.���, ,�- � 1 � 30 �0� 1 )�� � � � � �)��4 51 � � � � �)�� 67 381 With q = 0, +1�.��.���, ,�- � +��� ,-�� ,�-, and both gives the Sørensen index. 382 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 18 According to Leinster & Cobbold (2012), it is possible to implement species similarities in the 383 calculation of effective numbers. This way, the meaning of qDZ, is the diversity of a 384 theoretical community with qDZ equally abundant and maximally different species. Hence, 385 both unevenness in the abundance structure and the between-species similarities decrease the 386 value of effective species number. Due to measuring diversity in effective numbers, it is 387 possible to partition diversity into alpha, beta, and gamma fractions (Leinster & Cobbold 388 2012; Botta-Dukát 2018) in the multiplicative way. Then, this multiplicative beta can be 389 rescaled using the formulae proposed by Chao et al. (2012). These indices behave consistently 390 only if abundances are taken into account as relative abundances. 391 392 Nearest neighbour approach 393 The earliest representatives of this family were shown by Clarke & Warwick (1998) and Izsák 394 & Prince (2001), then Ricotta & Burrascano (2008), and Ricotta & Bacaro (2010; see DCW 395 and DIP indices). Later Ricotta et al. (2016) introduced a new, general family called PADDis. 396 All these indices were primarily defined for presence-absence data type. The approach is 397 based on a re-definition of the b and c quantities of the 2×2 contingency table. Looking at 398 species as maximally different, and taking X and Y the two communities under comparison, b 399 can be viewed as the total uniqueness of community X. The uniqueness of a single species in 400 X is 1 if it is absent in Y, otherwise it is 0. Therefore, b is the sum of species uniqueness 401 values. However, from a functional perspective, the uniqueness of a species present only in X 402 should be between 0 and 1 if it is absent in Y but a similar species present there. Therefore, it 403 is possible to define the analogue of b which accounts for similarities between species: 404 Eq. 18. � � ∑ 1 � max� ����� � � � � ∑ max� ���� � 405 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 19 The same logic applies for c, which is the uniqueness of community Y, where C expresses the 406 degree of uniqueness: 407 Eq. 19. � � ∑ 1 � max� � ����� � �� � ∑ max� � ���� 408 Ricotta et al. (2016) define the A quantify as follows: 409 Eq. 20. ' � � �� � )� � �� � *� 410 Having A, B, and C defined as analogues of a, b, and c, it is now possible to design trait-based 411 similarity measures following the logics of Jaccard, Sørensen, Sokal & Sneath, Kulczynski, 412 Ochiai and Simpson indices. It is notable that Ricotta et al. (2016) define A as a quantity that 413 ensures the components B and C to add up to a + b + c but with no explicit biological 414 interpretation. Notably, DIP and DCW are identical with the Sørensen and Kulczynski forms of 415 PADDis. The generalization of DIP and DCW to relative abundances, DCW(Q), was also derived 416 by Ricotta & Bacaro (2010). For these two versions, it is not necessary to explicitly define the 417 A component. Using the relationships between Jaccard, Sørensen, Kulczynski, Ochiai and 418 Sokal & Sneath indices, from DCW(Q) it is theoretically possible to derive the extension of 419 PADDis to relative abundances; however, the biological interpretation of A remains dubious 420 in this framework. 421 422 Methods 423 The performance of FDissim indices can be reliably tested on data sets with known 424 background processes driving community assembly which is hardly possible to satisfy with 425 real data. Therefore, we compared the performance of FDissim indices using simulated data 426 sets. The data sets were generated using the comm.simul function of the comsimitv R package 427 (Botta-Dukát & Czúcz 2016, Botta-Dukát 2020). This function follows an individual-based 428 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 20 model for a meta-community comprising N communities and a regional pool of S species. 429 Local communities include J individuals, and are distributed equidistantly along a continuous 430 environmental gradient (with gradient values between 0 and 1). Each individual possesses 431 three traits: an ‘environmental’, a ‘competitive’ trait, and a neutral trait, all ranging on [0; 1]. 432 Intraspecific variation in trait values is neglected in the simulation, that is, individuals 433 belonging to the same species are identical. The environmental trait defines the optimum of 434 the species along the environmental gradient. The closer the position of a community along 435 the environmental gradient to the environmental trait value of a species, the more suitable it is 436 for that species: 437 Eq. 21. 89:; �:<:;= � � -��, 2� � � � -��, 2� ��� �����"� 4 . 438 where σ (sigma) is adjustable so as to change the niche width of the species, and hence, the 439 length of the gradient (see later). The competitive trait represents the resource acquisition 440 strategy of the individual. The more similar the latter value between two individuals, the 441 higher the competition is between them, which means that intraspecific competition is the 442 strongest. The neutral trait has no effect on community assembly, thus it is not considered in 443 our study. The simulation starts with the random assignment of all individuals of all 444 communities to species. The second step is a ‘disturbance’ event, when one individual ‘dies’ 445 in each community. This individual is to be replaced by an offspring of other individuals 446 within the same community or those of other communities. Each individual produces one 447 offspring or does not reproduce. Probability of reproduction depends on the strength of 448 competition. The offspring remains in the same community or randomly disperses into any of 449 the other communities. Finally, the dead individual is replaced by one new individual from 450 the seeds produced and dispersed. The probability that an individual of a certain species 451 replaces the dead individual is defined by the number of seeds of that species and the 452 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 21 suitability of the habitat. Steps between the disturbance event and the establishment of a new 453 individual constitute a single ‘generation’. Community composition is evaluated after lot of 454 generations. The strength of the environmental filtering can be adjusted by the sigma 455 parameter, respectively. When sigma is 0, all species are maximally specialist, which means 456 that they can occur only at the optimum point of the gradient (that is, at the exact value for the 457 environmental trait). If sigma is infinity, species are maximally generalist and all points along 458 the environmental gradient are equally suitable for them. Therefore, sigma is the parameter 459 which defines the suitability of each point of the gradient for each species based on its 460 distance from the respective optima. We generated data sets with sigma values 0.01, 0.1, 0.25, 461 0.5, 1 and 5 in order to simulate situations with different strength of environmental filtering. 462 The number of communities was 30, each community comprised 200 individuals, the number 463 of species in the species pool was 300, the simulation iterated for 100 generations, and we 464 allowed no intraspecific trait variation. For all the other parameters, we used the default 465 options. 466 However, it needed further explanation what real situations the six simulated levels of 467 environmental filtering represent. To provide a reference and assist interpretation, we 468 calculated two species-based beta-diversity measures, the multiplicative beta (Whittaker 469 1960) and the gradient length of the first axis of a detrended correspondence analysis (DCA) 470 ordination (Hill & Gauch 1980; Appendix S5, Fig. S5.1). The former gives the number of 471 distinct communities present in the total species pool of the gradient, while the latter is 472 minimal number of average niche breadths (also called turnover units) necessary for covering 473 the total gradient length. Moreover, we plotted the abundance of species in the sample units 474 along the gradient as a visual tool for assessing gradient length (Appendix S5, Fig. S5.2). All 475 these methods indicated that with sigma = 0.01 the gradient is extremely long: there are more 476 than 10 distinct communities and near 20 turnover units along the gradient. Samples with such 477 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 22 high beta diversity are very rare and special in real ecological research; therefore, findings 478 from simulations with sigma = 0.01 are mostly of theoretical importance. Beta diversity 479 values from sigma = 0.1 to sigma = 1 are more similar to real study situations, hence they 480 should be more relevant for practice. At sigma = 5, environmental filtering is practically not 481 operating, between-community variation is driven by interspecific relations and chance. 482 We calculated between-species dissimilarities as the Gower distance between their 483 environmental trait values which in this case equals the Euclidean distance scaled to [0; 1]. 484 These distances had to be transformed to similarities according to the requirements of the 485 FDissim indices. Several formulae are available with which it is possible; however, they may 486 assume different functional relationships between similarity and distance. One of such 487 formulae we used is the linear transformation according to Similarity = 1-Distance. Besides 488 this, we also used Similarity = e-u×Distance which supposes a curvilinear function between 489 similarity and distance (Leinster & Cobbold 2012). With this exponential formula, it is 490 possible to weight the importance of small Gower distances between species relative to large 491 distances. With changing the parameter u it is possible to adjust how steeply similarity 492 decreases with increasing distance. We set u = 10 which leads to a relatively steep decline. 493 Although, after this transformation the minimal value for similarity is higher than zero, we 494 considered it negligibly low (e-10≈0.000045) so we did not apply the transformation proposed 495 by Botta-Dukát (2018). For all FDissim indices where it was necessary we used the similarity 496 matrix or a dissimilarity matrix calculated as Dissimilarity = 1-Similarity as input. The 497 dissimilarity matrix is identical with the Gower distance matrix if the similarities were 498 calculated in a linear way, but in the other case, it keeps the exponential relationship between 499 distance and (dis-)similarity. 500 Dissimilarity matrices were calculated for the four community data sets with different sigma 501 values, with the two functions transforming Gower distances, and across a broad range of 502 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 23 available FDissim indices. For indices where absolute or relative abundances could have been 503 taken into account, we opted for relative abundance for the sake of better comparability. With 504 generalized_Tradidiss, we calculated the ‘even’ and the ‘uneven’ weighting versions. The 505 entire analysis was run with abundance and presence/absence data. Some FDissim indices are 506 only suitable for binary data, thus the number of indices applied for relative abundance and 507 binary data were 25 and 31, respectively. In cases of indices handling both data types, we 508 used exactly the same version of the index as with abundance data, hence communities with 509 different numbers of species were given equal weight due to division by community totals. 510 Additionally, dissimilarity matrices were also calculated using the Bray-Curtis index (for 511 binary data: Sørensen index in dissimilarity form) to provide a contrast against the case 512 disregarding between-species dissimilarities. 513 Then for each dissimilarity matrices, we conducted two types of analyses. Firstly, we 514 compared how strongly the dissimilarity indices correlate with the environmental distance 515 using Kendall tau rank correlation. This gives an estimate of how well a dissimilarity index 516 reveals the monotonic relationship between trait composition of local communities and the 517 environmental gradient. We visually assessed the shape of relationship between dissimilarity 518 and environmental distance in the case of lowest sigma (i.e., longest gradient) when the 519 distortion of linear relationship between the two is supposed to be the strongest. Then, to 520 disentangle the effects of different methodological decisions and the sigma parameter on the 521 correlation between FDissim indices and environmental distance we calculated a random 522 forest model. In this model the dependent variable was the Kendall tau correlation coefficient, 523 while the independent variables were the sigma, the data type (abundance vs. 524 presence/absence), the transformation method for Gower distances (linear vs. exponential), 525 and the FDissim method. Within approaches FDissim methods often strongly correlated that 526 resulted in very similar Kendall’s tau values. Therefore, only the Sørensen/Bray-Curtis 527 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 24 versions of dsimcom, dissABC, PADDis/DCW, generalized_Tradidiss with uneven weights, as 528 well as βturnover, CWMdis, and the CDFdis were included into this analysis. Variable 529 importance scores (VIS) in the random forest were estimated by the permutation approach 530 based on mean decrease in log-likelihood using the varimp function of the partykit package. 531 The effects of the model terms were also illustrated by heat-maps. 532 All statistical analyses were done in R (R Core Team 2019) using the FD (Laliberté & 533 Legendre 2010, Laliberté et al. 2014), adiv (Pavoine 2020a,b), comsimitv (Botta-Dukát 2020,) 534 vegan (Oksanen et al. 2019), DescTools (Signorell et al. 2020), partykit (Hothorn et al. 2006, 535 Strobl et al. 2007, Strobl et al. 2008, Hothorn & Zeileis 2015) packages. 536 Results 537 Kendall tau correlation coefficients decreased as the strength of environmental filtering 538 decreased (that is, with increasing sigma) in all examined cases. For FDissim indices which 539 handled both data types, presence/absence data resulted in lower correlations than abundance 540 data for all indices. For most indices, this difference was highest at intermediate values for 541 sigma. These trends were consistent between the linear and the exponential transformations. 542 Correlations for all indices at all sigma values with linear transformation are shown in Table 3 543 for abundances data and in Table 4 for presence/absence data. 544 In most simulation scenarios, the FDissim indices correlated more strongly with the 545 environmental gradient than the species-based Bray-Curtis index. However, in several 546 occasions, indices belonging to the nearest neighbour family performed poorer than the 547 species-based dissimilarity. Notably, at the highest sigma and with presence/absence data, all 548 indices showed correlation near to zero but among them the Bray-Curtis index had the highest 549 correlation with environmental distance. 550 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 25 As expected, we found perfect rank correlations among Jaccard, Sørensen, Sokal-Sneath and 551 Whittaker’s beta versions of dsimcom, among Jaccard, Sørensen and Sokal-Sneath forms of 552 dissABC, between DIP and Sørensen form of PADDis (only for presence-absence data), 553 between DCW and Kulczynski form of PADDis (only for presence-absence data), and between 554 DIP and DCW (for abundance data type). 555 Dissimilarity indices showed various shapes of relationship with environmental distance 556 (Appendix S4). At strongest environmental filtering, all FDissim indices had dissimilarity 557 values near zero at minimal environmental distance, only the species-based Bray-Curtis which 558 had dissimilarity was near 0.4 at the smallest environmental distances. In case of linear 559 transformation of Gower distances and presence/absence data, approximately linear 560 relationship was found for CWMdis, CDFdis, DQ, Sørensen and Ochiai forms of dsimcom, 561 Jaccard form of dissABC, Marczewski-Steinhaus form of generalized_Tradidiss with both 562 weighting versions, βheterogeneity and βsegregation; although, most other indices showed only a 563 small degree of distortion of linear function (Figure S4.1). Exponential relationship was found 564 for the evenness-based (PE) form of generalized_Tradidiss. Notably, the taxon-based Bray-565 Curtis index had the steepest asymptotic function among all. In case of exponential 566 transformation all other indices relying on between-species dissimilarities showed an 567 asymptotic curve (Figure S4.2). 568 In the random forest, niche width (that is, sigma) acquired by far the highest variable 569 importance score (VIS=0.114). The less important variables were the data type (VIS=0.0176), 570 the dissimilarity method (VIS=0.0037) and the transformation (VIS=-0.00001). The heat map 571 (Figure 1) also revealed a strong decrease in correlation along increasing sigma. It is also 572 clearly shown that in most cases abundance data resulted in significantly higher correlation 573 than presence/absence. The difference between linear and exponential transformation methods 574 was not always visible. Regarding variation between dissimilarity indices, the most striking 575 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 26 patterns were the relatively poor performance of the PADDis/DCW indices. All but the latter 576 index combined with abundance data and linear transformation of dissimilarities lead to the 577 highest correlation with environmental distance. 578 579 Discussion 580 General patterns in the correlation with environmental distance 581 We ran different simulation scenarios with varying strength of environmental filtering. We 582 expected that the correlation between FDissim indices and environmental distance to be the 583 highest when the environmental filtering is the strongest, and the correlation to become 584 neutral when environmental filtering is not effective. When environmental filtering was 585 strongest (that is, minimal overlap of species niches along the environmental gradient), all 586 FDissim indices correlated highly with the environmental gradient. As expected, correlation 587 between trait dissimilarity and environmental distance decreased as filtering weakened, 588 moreover, differences between families of indices became more apparent. This result suggests 589 that all tested methods are able to reveal the strong environmental filtering processes. 590 As the contribution of competitive exclusion and stochastic processes approach or override 591 environmental filtering, the correlation between FDissim indices and the background gradient 592 becomes weaker. This decrease itself is not a drawback of the FDissim methods, rather it is a 593 consequence of our study design, since we applied a series of scenarios where the effect of 594 niche filtering was decaying. However, we think that the degree of the decrease reflects the 595 sensitivity of the FDissim indices to the underlying trait-environmental relationship. Indices, 596 which showed high correlation with environmental distance, could be capable of revealing the 597 environmental signal even when it is weak. Actually, in our tests, most indices reached 598 similarly high correlation, and there were only a few combinations of simulation parameters 599 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 27 which resulted in a decreased correlation with environmental distance for some dissimilarity 600 indices. 601 Determinants of the correlation based on the random forest model 602 The random forest model revealed that the effect of gradient length is the most important 603 determinant of the correlation between dissimilarity and environmental distance, while 604 methodological decisions had much lower variable importance. These observations suggest 605 that the absolute value of the correlation between dissimilarity and environmental distance is 606 primarily dependent on the sample in hand, and can be influenced by methodological 607 decisions to a limited extent. 608 Correlations were stronger with abundance than with presence/absence data. This finding is at 609 least partly attributable to our simulation design where community composition was driven by 610 individual-based processes: birth, fitness difference, reproduction, and death. As a result, 611 species relative abundances had to be proportional with their environmental suitability in the 612 local community. Transforming such data to binary scale loses meaningful information and 613 weakens the correlation between dissimilarity in trait composition and environmental 614 background. In cases when presences and absences of species respond more robustly to the 615 main environmental gradient, while relative abundances change stochastically, or abundance 616 estimations are inaccurate, the binary data type might be more straightforward. 617 Transforming between-species dissimilarities has a potential to conform distributional 618 requirements, to approximate expert intuitions about relatedness of species or to customize 619 sensitivity to functional difference with respect to specific research aims. For most indices 620 across the tested range of gradient length and data type, the exponential transformation 621 resulted a somewhat lower correlation than with linear transformation. More insight is 622 provided by examining the shape of the relationships besides the pure correlation value. After 623 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 28 linear transformation of Gower distances, most dissimilarity indices showed a linear or 624 slightly curved function along environmental distance; although the scatter of the evenness-625 based generalized_Tradidiss differed considerably from the straight line towards an 626 exponentially increasing one. After exponential transformation of between-species trait 627 dissimilarities, all indices in the direct dissimilarity-based class showed a rather steeply 628 increasing asymptotic function. This result suggests that with the exponential transformation 629 of between-species dissimilarities, it is possible to make FDissim indices more sensitive to 630 smaller differences in functional composition. Certainly, summary-based indices (CWMdis, 631 CDFdis) are not affected by this transformation, since they are not based on between-species 632 dissimilarities. 633 Comparison of taxon-based vs. trait-based dissimilarity 634 The basic assumption of functional ecology is that the traits of individuals should be in closer 635 relationship with ecological properties than their taxonomical status. Following this argument, 636 we expected that trait-based dissimilarity measures correlate more strongly with the 637 environmental background than species-based indices. In contrast, higher correlation of 638 species-based dissimilarity than trait-based dissimilarity indicates loss of information with the 639 introduction of between-species similarity – which is non-sensual since our data was 640 simulated in a way to possess a strong pattern in trait-environment relationship. We used the 641 Sørensen/Bray-Curtis index in a dissimilarity form as a reference method representing 642 species-based dissimilarity calculations disregarding traits. Our expectation was fulfilled by 643 all indices with the exception of the members of the nearest neighbour family (DIP, DCW and 644 PADDis). We suspect two potential reasons behind the low performance of these latter groups 645 of indices. The first one is the improper scaling factor used for standardizing the ‘operational 646 part’ of the indices (see the description in of the PADDis family and the discussion about it 647 under the paragraph “Within-family variation of indices”). Second, these indices rely on the 648 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 29 quantities of minimally different species in the two communities under comparison. However, 649 the minimum is a less robust descriptor of any sample distribution because of its dependency 650 on sampling error; therefore, it might provide a poor representation of total community 651 dissimilarity. 652 Although, we did not include dissimilarity values at exactly zero distance, the y-intercept (also 653 called ‘nugget’) of the dissimilarity vs. environmental distance functions can be extrapolated 654 with negligible error (Fortin & Dale 2005). Brownstein et al. (2012) argued that the nugget of 655 the distance decay relationship is a direct estimate of the amount of chance in the variation 656 between local communities. In this respect worth noting is that the nugget with species-based 657 Bray-Curtis index was near 0.4, while with all trait-based indices the nugget was near zero. 658 This suggests that without accounting for species similarities, environmental distance between 659 communities can be overestimated due to similar species replacing each other. 660 Within-family variation of indices 661 The perfect correlation between Jaccard, Sørensen and Sokal-Sneath forms of dsimcom and 662 dissABC families was expected, since the original, taxon-based Jaccard, Sørensen and Sokal-663 Sneath indices are algebraically related, too (Janson & Vegelius 1981). However, for PADDis 664 Jaccard, Sørensen and Sokal-Sneath forms showed correlation below 1. At this family, the B 665 and C components of the 2×2 contingency table are defined as measurable quantities with 666 clear interpretation: the sum of species uniqueness values within each community. The total 667 diversity (A+B+C) is defined to be equal with the species richness of the pooled pair of 668 communities (a+b+c), and the quantity A is derived by subtracting (B+C) from it. With this 669 definition, A remains a virtual quantity with no biological interpretation. In PADDis indices, 670 trait-based quantities B and C appear in the numerator (the ‘operational part’ sensu Ricotta et 671 al. 2016) of the indices, while in the denominators (i.e., in the ‘scaling factor’) the taxon-672 based quantities, a, b and c are used. We argue that the inconsistent behaviour of PADDis is 673 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 30 due to the application of taxon-based quantities for scaling factors of trait-based operational 674 parts. At the same time, we acknowledge that we either see no obvious solution to define total 675 diversity or shared diversity according to the uniqueness-based idea behind PADDis in a more 676 realistic way. In the generalized_Tradidiss family, the trait-based analogue of Bray-Curtis 677 index can be achieved by calculating generalized Canberra distance with uneven weighting of 678 species. We expected this to be perfectly correlated with Marczewski-Steinhaus form of 679 generalized_Tradidiss index with uneven weighting, since Bray-Curtis and Marczewski-680 Steinhaus indices are the abundance forms of Sørensen and Jaccard indices, respectively. 681 However, the correlation between them was lower. In the generalized_Tradidiss family, 682 between-community dissimilarity is calculated as weighted sum a standardized differences in 683 species ordinariness values. Species ordinariness is calculated on the basis of species 684 abundance and trait values; however, weights used for adjusting species-level contributions 685 are derived solely from abundances. Therefore, generalized_Tradidiss also follows a ‘hybrid’ 686 approach in accounting for taxon-based vs. trait-based information. We argue that this is the 687 reason why the algebraic relationships between the original Sørensen and Jaccard indices does 688 not apply to its Sørensen/Bray-Curtis-type and Jaccard/Marczewski-Steinhaus-type forms. To 689 sum up, we point to our observation that Jaccard, Sørensen and Sokal-Sneath forms of certain 690 families of indices do not satisfy the algebraic relationships they supposed to, opening space 691 for potential confusion. These algebraic relations hold only if A, B and C quantities are 692 explicitly and consistently defined. 693 Families of FDissim indices combine abundance difference of species between plots and 694 interspecific trait differences in a unique way, while indices belonging to the same family 695 differ in how they relate this amount of ‘unshared’ variation (summarized as the b and c 696 portions of the contingency table) to the shared (a) variation. Some indices are able to handle 697 abundances either as absolute or relative abundance (e.g. dsimcom, generalized_Tradidiss, 698 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 31 dissABC), while others divide absolute abundances by their sum over the respective 699 community, thus they work only with relative abundances. When indices in the former group 700 are set to consider absolute abundances, they become sensitive to variation in the summed 701 abundances of the communities under comparison. To place our tests on a common ground, 702 we simulated communities with equal total number of individuals, and set all indices, where 703 relevant, to work with relative abundances. Hence, we removed the effect of differences in 704 total abundance. The constant number of individuals might have increased the similarity 705 between FDissim indices belonging to the same family and the correlation with the 706 environmental gradient. The sum of abundances, let them be measured on any quantitative 707 scale, may vary considerably in real study situations due to aggregated distribution of 708 individuals or uneven sampling effort. Therefore, our findings are more likely valid for 709 settings when the sum of abundances are relatively stable, e.g. when sampling effort is 710 controlled and individuals are dispersed evenly, or when abundances are recorded on 711 percentage scale. 712 Limitations of our study 713 In our study, we simulated a research situation in a simplistic way. We applied only one 714 environmental gradient which operated as an environmental filter driving convergence on a 715 single trait. Besides this, we applied another trait which was constantly affected by a low level 716 of competitive exclusion. These two traits were uncorrelated. Nevertheless, there was some 717 effect of random drift on community composition due to the probabilistic components of the 718 simulation algorithm. We varied the strength of environmental filtering thus it had different 719 relative contribution compared with competitive exclusion and stochasticity. In real research 720 situations local trait composition is influenced by a wide range of processes, including several 721 abiotic and biotic filters acting simultaneously. Unless they are manipulated as parts of an 722 experimental system, the full set of such filters are usually unknown for the researchers. The 723 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 32 multiplicity of filters may reduce the ability of FDissim indices in recovering trait-724 environment relationships. Further research should clarify how increasing complexity of the 725 sample affects the behaviour of FDissim indices. 726 727 Conclusions 728 Considering the diversity of concepts they are built upon, FDissim indices showed 729 unexpectedly low variation in performance. CWMdis, dsimcom, generalized_Tradidiss 730 acquired the highest correlation with environmental distance in all simulation scenarios, 731 therefore they seem to be equally suitable for quantifying pairwise beta diversity based on 732 traits. Nevertheless, the most important determinant of the matching between trait-based 733 dissimilarity and environmental distance is the length of the trait gradient. Besides this, the 734 data type (presence/absence vs. abundance) also affected the correlation more strongly than 735 the choice of FDissim method. Extending the comparative tests of FDissim measure to more 736 complex gradients and real data sets could offer further insight into their behaviour. 737 738 Data availability 739 Simulated data was generated using the comsimitv R package. Own functions for functional 740 dissimilarity indices are made available through the Zenodo public repository: 741 10.5281/zenodo.4323590. 742 743 Author contributions 744 A.L. designed and carried out the analysis, lead writing, Z.B.D. discussed the concept and the 745 results, wrote parts of and commented on the manuscript. 746 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 33 747 References 748 Anderson, M. J., Crist, T. O., Chase, J. M., Vellend, M., Inouye, B. D., Freestone, A. L., 749 Sanders, N. J., Cornell, H. V., Comita, L. S., Davies, K. F., Harrison, S. P., Kraft, N. J. B., 750 Stegen, J. C. & Swenson, N. G. (2011). Navigating the multiple meanings of β diversity: a 751 roadmap for the practicing ecologist. Ecology Letters, 14(1), 19-28. doi:10.1111/j.1461-752 0248.2010.01552.x 753 Anderson, M. J., Ellingsen, K. E. & McArdle, B. H. (2006). Multivariate dispersion as a 754 measure of beta diversity. Ecology Letters, 9(6), 683-693. doi:10.1111/j.1461-755 0248.2006.00926.x 756 Baselga, A. & Leprieur, F. (2015). Comparing methods to separate components of beta 757 diversity. Methods in Ecology and Evolution, 6: 1069-1079. doi:10.1111/2041-210X.12388 758 Botta�Dukát, Z. & Czúcz, B. (2016). Testing the ability of functional diversity indices to 759 detect trait convergence and divergence using individual�based simulation. Methods in 760 Ecology and Evolution, 7, 114-126. https://doi.org/10.1111/2041-210X.12450 761 Botta�Dukát, Z. (2005). Rao's quadratic entropy as a measure of functional diversity based 762 on multiple traits. Journal of Vegetation Science, 16, 533-540. https://doi.org/10.1111/j.1654-763 1103.2005.tb02393.x 764 Botta�Dukát, Z. (2018). The generalized replication principle and the partitioning of 765 functional diversity into independent alpha and beta components. Ecography, 41: 40-50. 766 doi:10.1111/ecog.02009 767 Botta-Dukat, Z. (2020). comsimitv: Flexible Framework for Simulating Community 768 Assembly. R package version 0.1.4. https://CRAN.R-project.org/package=comsimitv 769 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 34 Brownstein, G., Steel, J.B., Porter, S., Gray, A., Wilson, C., Wilson, P.G. & Wilson, J. B. 770 (2012). Chance in plant communities: a new approach to its measurement using the nugget 771 from spatial autocorrelation. Journal of Ecology, 100, 987-996. 772 https://doi.org/10.1111/j.1365-2745.2012.01973.x 773 Cardoso, P., Rigal, F., Carvalho, J.C., Fortelius, M., Borges, P.A.V., Podani, J. & Schmera, D. 774 (2014). Partitioning taxon, phylogenetic and functional beta diversity into replacement and 775 richness difference components. Journal of Biogeography, 41, 749-761. 776 doi:10.1111/jbi.12239 777 Carmona, C. P., de Bello, F., Mason, N. W. H., Lepš, J. (2016). Traits without borders: 778 Integrating functional diversity across scales. Trends in Ecology and Evolution 31(5), 382-779 394. doi: 10.1016/j.tree.2016.02.003 780 Champely, S., Chessel, D. (2002). Measuring biological diversity using Euclidean metrics. 781 Environmental and Ecological Statistics 9, 167–177. 782 https://doi.org/10.1023/A:1015170104476 783 Chao, A., Chiu, C. and Hsieh, T.C. (2012). Proposing a resolution to debates on diversity 784 partitioning. Ecology, 93, 2037-2051. https://doi.org/10.1890/11-1817.1 785 Chao, A., Chiu, C.�H., Villéger, S., Sun, I�F., Thorn, S., Lin, Y.�C., Chiang, J.�M., & 786 Sherwin, W. B. (2019). An attribute�diversity approach to functional diversity, functional 787 beta diversity, and related (dis)similarity measures. Ecological Monographs, 89(2), e01343. 788 10.1002/ecm.1343 789 Chiu, C.-H., Jost, L. & Chao, A. (2014). Phylogenetic beta diversity, similarity, and 790 differentiation measures based on Hill numbers. Ecological Monographs, 84(1), 21-44. 791 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 35 Clarke, K.R. & Warwick, R.M. (1993). Quantifying structural redundancy in ecological 792 communities. Oecologia, 113(2), 278-289. 793 De Bello, F., Carmona, C.P., Mason, N.W.H., Sebastià, M.�T. and Lepš, J. (2013). Which 794 trait dissimilarity for functional diversity: trait means or trait overlap? Journal of Vegetation 795 Science, 24, 807-819. doi:10.1111/jvs.12008 796 De Bello, F., Lepš, J., Lavorel, S., & Moretti, M. (2007). Importance of species abundance for 797 assessment of trait composition: an example based on pollinator communities. Community 798 Ecology, 8(2), 163–170. https://doi.org/10.1556/ComEc.8.2007.2.3 799 Díaz, S., & Cabido, M. (2001). Vive la différence: plant functional diversity matters to 800 ecosystem processes. Trends in Ecology and Evolution, 16(11), 646–655. 801 https://doi.org/10.1016/S0169-5347(01)02283-2 802 Faith, D. P., Minchin, P. R. & Belbin, L. (1987). Compositional dissimilarity as a robust 803 measure of ecological distance. Vegetatio 69, 57-68. 804 Fortin, M.�J. & Dale, M.R.T. (2005). Spatial Data Analysis: a Guide for Ecologists. 805 Cambridge University Press, Cambridge. 806 Garnier, E., Cortez, J., Billès, G., Navas, M., Roumet, C., Debussche, M., Laurent, G., 807 Blanchard, A., Aubry, D., Bellmann, A., Neill, C. & Toussaint, J. (2004). Plant functional 808 markers capture ecosystem properties during secondary succession. Ecology, 85, 2630-2637. 809 doi:10.1890/03-0799 810 Gregorius, H.�R., Gillet, E.M. & Ziehe, M. (2003). Measuring Differences of Trait 811 Distributions Between Populations. Biometrical Journal, 45, 959-973. 812 https://doi.org/10.1002/bimj.200390063 813 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 36 Grime, J. P. (1998). Benefits of plant diversity to ecosystems: immediate, filter and founder 814 effects. Journal of Ecology, 86, 902–910. 815 Hawkins, B.A., Leroy, B., Rodríguez, M.Á., Singer, A., Vilela, B., Villalobos, F., Wang, X. 816 & Zelený, D. (2017). Structural bias in aggregated species�level variables driven by repeated 817 species co�occurrences: a pervasive problem in community and assemblage data. Journal of 818 Biogeography, 44, 1199-1211. 819 Hérault, B., & Honnay, O. (2007). Using life-history traits to achieve a functional 820 classification of habitats. Applied Vegetation Science, 10(1), 73–80. 821 https://doi.org/10.1111/j.1654-109X.2007.tb00505.x 822 Hill, M. O. & Gauch, H. G. (1980). Detrended Correspondence Analysis: An Improved 823 Ordination Technique. Vegetatio, 42, 47–58. 824 Hill, M. O. (1973). Diversity and evenness: a unifying notation and its consequences. 825 Ecology, 54(2), 427–432. 826 Hitchcock, F.L. (1941). Distribution of a product from several sources to numerous localities. 827 Journal of Mathematical Physics, 20: 224-230. 828 Hothorn, T., Hornik, K., Van de Wiel, M. A. & Zeileis, A. (2006). A Lego System for 829 Conditional Inference. The American Statistician, 60(3), 257–263. 830 Hothorn, T., Zeileis, A. (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. 831 Journal of Machine Learning Research, 16, 3905-3909. URL 832 http://jmlr.org/papers/v16/hothorn15a.html 833 Izsák, C., & Price. R. G. (2001). Measuring b-diversity using a taxonomic similarity index, 834 and its relation to spatial scale. Marine Ecology Progress Series 215, 69–77. 835 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 37 Janson, S. & J. Vegelius (1981). Measures of ecological association. Oecologia, 49(3), 371-836 376. 837 Jost, L. (2007). Partitioning diversity into independent alpha and beta components. Ecology, 838 88, 2427–2439. 839 Kleyer, M., Dray, S., Bello, F., Lepš, J., Pakeman, R.J., Strauss, B., Thuiller, W. & Lavorel, 840 S. (2012). Assessing species and community functional responses to environmental gradients: 841 which multivariate methods? Journal of Vegetation Science, 23, 805-821. doi:10.1111/j.1654-842 1103.2012.01402.x: 1199–1211. 843 Koleff, P., Gaston, K. J. & Lennon, J. J. (2003). Measuring beta diversity for presence–844 absence data. Journal of Animal Ecology, 72, 367-382. doi:10.1046/j.1365-845 2656.2003.00710.x 846 Laliberté, E. & P. Legendre (2010). A distance-based framework for measuring functional 847 diversity from multiple traits. Ecology, 91, 299-305. 848 Laliberté, E., Legendre, P., & Shipley, B. (2014). FD: measuring functional diversity from 849 multiple traits, and other tools for functional ecology. R package version 1.0-12. 850 Legendre, P. & Legendre, L. (1998) Numerical ecology. Elsevier, Amsterdam, NL 851 Legendre, P., De Cáceres, M. (2013). Beta diversity as the variance of community data: 852 dissimilarity coefficients and partitioning. Ecology Letters 16, 951–963 853 Leinster, T. & Cobbold, C.A. (2012). Measuring diversity: the importance of species 854 similarity. Ecology, 93, 477-489. doi:10.1890/10-2402.1 855 Lengyel, A. & Podani, J. (2015). Assessing the relative importance of methodological 856 decisions in classifications of vegetation data. Journal of Vegetation Science, 26, 804-815. 857 doi:10.1111/jvs.12268 858 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 38 Lengyel, A., Swacha, G., Botta-Dukát, Z. & Kacki, Z. (2020). Trait-based numerical 859 classification of mesic and wet grasslands in Poland. Journal of Vegetation Science, 31, 319–860 330. https://doi.org/10.1111/jvs.12850 861 Lepš, J., de Bello, F., Lavorel, S. & Berman, S. (2006). Quantifying and interpreting 862 functional diversity of natural communities: practical considerations matter. Preslia, 78, 481–863 501. 864 MacArthur, R., Levins, R. (1967). Limiting similarity convergence and divergence of 865 coexisting species. American Naturalist, 101, 377–387. 866 Mason, N. W. H., Mouillot, D., Lee, W. G. & Wilson, J. B. (2005). Functional richness, 867 functional evenness and functional divergence: the primary components of functional 868 diversity. Oikos, 111, 112-118. doi:10.1111/j.0030-1299.2005.13886.x 869 McGill, B., Enquist, B. J., Weiher, E., Westoby, M. (2006). Rebuilding community ecology 870 from functional traits. Trends in Ecology and Evolution 21(4), 178-185. 871 Mouchet, M.A., Villéger, S., Mason, N.W.H. and Mouillot, D. (2010). Functional diversity 872 measures: an overview of their redundancy and their ability to discriminate community 873 assembly rules. Functional Ecology, 24, 867-876. doi:10.1111/j.1365-2435.2010.01695.x 874 Mouillot, D., Stubbs, W., Faure, M., Dumay, O., Tomasini, J.A., Wilson, J.B. & Chi, T.D. 875 (2005). Niche overlap estimates based on quantitative functional traits: a new family of 876 non�parametric indices. Oecologia, 145, 345–353. 877 Muscarella, R. & Uriarte, M. (2016). Do community-weighted mean functional traits reflect 878 optimal strategies? Proceedings of the Royal Society B, 283, 20152434. 879 https://doi.org/10.1098/rspb.2015.2434 880 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 39 Nipperess, D.A., Faith, D.P. & Barton, K. (2010), Resemblance in phylogenetic diversity 881 among ecological assemblages. Journal of Vegetation Science, 21, 809-820. 882 doi:10.1111/j.1654-1103.2010.01192.x 883 Oksanen, J., Blanchet, F.G., Friendly, M., Kindt, R., Legendre, P., McGlinn, D., Peter R. 884 Minchin, P. R., O'Hara, R. B., Simpson, G. L., Solymos, P., Stevens, M. H. M., Szoecs, E. & 885 Wagner, H. (2019). vegan: Community Ecology Package. R package version 2.5-6. 886 https://CRAN.R-project.org/package=vegan 887 Pavoine, S. & Ricotta, C. (2014). Functional and phylogenetic similarity among communities. 888 Methods in Ecology and Evolution, 5, 666--675. 889 Pavoine, S. & Ricotta, C. (2019). Measuring functional dissimilarity among plots: Adapting 890 old methods to new questions. Ecological Indicators, 97, 67-72. 891 Pavoine, S. (2012). Clarifying and developing analyses of biodiversity: towards a 892 generalisation of current approaches. Methods in Ecology and Evolution, 3, 509-518. 893 doi:10.1111/j.2041-210X.2011.00181.x 894 Pavoine, S. (2016). A guide through a family of phylogenetic dissimilarity measures among 895 sites. Oikos, 125, 1719-1732. doi:10.1111/oik.03262 896 Pavoine, S. (2020). adiv: An R package to analyse biodiversity in ecology. Methods in 897 Ecology and Evolution, 11, 1106– 1112. https://doi.org/10.1111/2041-210X. 898 Peres-Neto, P.R., Dray, S. & ter Braak, C.J.F. (2017). Linking trait variation to the 899 environment: critical issues with community�weighted mean correlation resolved by the 900 fourth�corner approach. Ecography, 40, 806-816. 901 Petchey, O. L. & Gaston, K. J. (2006). Functional diversity: back to basics and looking 902 forward. Ecology Letters, 9, 741-758. doi:10.1111/j.1461-0248.2006.00924.x 903 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 40 Podani, J. & Schmera, D. (2011). A new conceptual and methodological framework for 904 exploring and explaining pattern in presence – absence data. Oikos, 120, 1625-1638. 905 doi:10.1111/j.1600-0706.2011.19451.x 906 Podani, J. (2000). Introduction to the exploration of multivariate biological data. Backhuys, 907 Leiden, NL. 908 R Core Team (2019). R: A language and environment for statistical computing. R Foundation 909 for Statistical Computing, Vienna, Austria. https://www.R-project.org/. 910 Rao, C. R. (1982). Diversity and dissimilarity coefficients: a unified approach. Theoretical 911 Population Biology, 21, 24-43. 912 Ricotta C. & Burrascano S. (2008). Beta diversity for functional ecology. Preslia, 80, 61–71. 913 Ricotta, C. & G. Bacaro. (2010). On plot-to-plot dissimilarity measures based on species 914 functional traits. Community Ecology, 11, 113–119. 915 Ricotta, C. & J. Podani. (2017). On some properties of the Bray-Curtis dissimilarity and their 916 ecological meaning. Ecological Complexity, 31, 201–205. 917 Ricotta, C. & Pavoine, S. (2015). Measuring similarity among plots including similarity 918 among species: an extension of traditional approaches. Journal of Vegetation Science, 26, 919 1061-1067. doi:10.1111/jvs.12329 920 Ricotta, C. (2017). Of beta diversity, variance, evenness, and dissimilarity. Ecology and 921 Evolution 7, 4835– 4843. https://doi.org/10.1002/ece3.2980 922 Ricotta, C. (2018). A family of (dis)similarity measures based on evenness and its relationship 923 with beta diversity. Ecological Complexity, 34, 69-73. DOI: 10.1016/j.ecocom.2018.03.002 924 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 41 Ricotta, C., Bacaro, G., Caccianiga, M., Cerabolini, B.E.L. & Moretti, M. (2015). A classical 925 measure of phylogenetic dissimilarity and its relationship with beta diversity. Basic and 926 Applied Ecology 16(1), 10-18. https://doi.org/10.1016/j.baae.2014.10.003 927 Ricotta, C., Podani, J., Pavoine, S. (2016). A family of functional dissimilarity measures for 928 presence and absence data. Ecology and Evolution, 6, 5383–5389. DOI: 10.1002/ece3.2214 929 Schmidt, T., Matias Rodrigues, J. & von Mering, C. (2017). A family of interaction-adjusted 930 indices of community similarity. ISME Journal 11, 791–807. 931 https://doi.org/10.1038/ismej.2016.139 932 Signorell, A. et mult. al. (2020). DescTools: Tools for descriptive statistics. R package version 933 0.99.38. 934 Strobl, C., Boulesteix, A.L., Kneib, T., Augustin, T. & Zeileis, A. (2008). Conditional 935 Variable Importance for Random Forests. BMC Bioinformatics, 9(307). 936 http://www.biomedcentral.com/1471-2105/9/307 937 Strobl, C., Boulesteix, A.L., Zeileis, A. & Hothorn, T. (2007). Bias in Random Forest 938 Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 939 25. http://www.biomedcentral.com/1471-2105/8/25 940 Swenson N. G., Anglada-Cordero P. & Barone J. A. (2011). Deterministic tropical tree 941 community turnover: evidence from patterns of functional beta diversity along an elevational 942 gradient. Proceedings of the Royal Society B, 278, 877–884. 943 Swenson, N. G. (2011). Phylogenetic Beta Diversity Metrics, Trait Evolution and Inferring 944 the Functional Beta Diversity of Communities. PLoS ONE 6(6), e21264. 945 https://doi.org/10.1371/journal.pone.0021264 946 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 42 Tamás, J., Podani, J. & Csontos, P. (2001). An extension of presence/absence coefficients to 947 abundance data: a new look at absence. Journal of Vegetation Science, 12, 401-410. 948 doi:10.2307/3236854 949 Tuomisto, H. (2010a). A diversity of beta diversities: straightening up a concept gone awry. 950 Part 1. Defining beta diversity as a function of alpha and gamma diversity. Ecography, 33, 2-951 22. doi:10.1111/j.1600-0587.2009.05880.x 952 Tuomisto, H. (2010b). A diversity of beta diversities: straightening up a concept gone awry. 953 Part 2. Quantifying beta diversity and related phenomena. Ecography, 33, 23-45. 954 doi:10.1111/j.1600-0587.2009.06148.x 955 Villéger, S., Mason, N.W.H. & Mouillot, D. (2008). New multidimensional functional 956 diversity indices for a multifaceted framework in functional ecology. Ecology, 89, 2290-2301. 957 doi:10.1890/07-1206.1 958 Violle, C., Navas, M.�L., Vile, D., Kazakou, E., Fortunel, C., Hummel, I. & Garnier, E. 959 (2007). Let the concept of trait be functional! Oikos, 116, 882-892. doi:10.1111/j.0030-960 1299.2007.15559.x 961 Whittaker, R. H. (1960). Vegetation of the Siskiyou Mountains, Oregon and California. 962 Ecological Monographs, 30, 280–338. 963 Whittaker, R. H. (1972). Evolution and measurement of species diversity. Taxon, 21, 213-964 251.doi:10.2307/1218190 965 Zelený, D. (2018). Which results of the standard test for community weighted mean approach 966 are too optimistic? Journal of Vegetation Science 29, 953-966. 967 968 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 43 Tables and Figures 969 970 Table 1. Similarity and dissimilarity forms of resemblance indices for presence-absence data 971 972 Name of the index Similarity version Dissimilarity version Sørensen �� � 2�2� � � � � � � �� � ��� 2⁄ �� � � � � 2� � � � � � � � � �� � �� Ochiai �� � ���� � ���� � �� � � ����� �� � � � � ����� Kulczynski �� � 12 � � � � � � � � � � � � 2 1 ��⁄ � 1 ��⁄ �⁄ �� � 1 2 ! � � � � � � � � �" � 1 2 # � �� � � ��$ Simpson ��� � �� � min��, �� � � min ��, ��� ��� � � � � ()*���, ��� Jaccard �� � �� � � � � � � ��� �� � � � � � � � � � � � � � ��� Sokal & Sneath ��� � �� � 2�� � �� ��� � 2�� � �� � � 2�� � �� 973 974 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 44 Table 2. Classification of trait-based dissimilarity indices. In columns of input data type X-es indicate, if abundance (A), relative abundance (R), 975 and presence-absence data can be used as input. 976 Class Approach Family References Input Data tpye R function A R P/A Summary-based Typical value CWM-based Ricotta et al. (2015) X X X FD:::functcomp Distribution- based CDF-based Appendix S3 X X X our new functions, see Data availability Direct dissimilarity Probabilistic DISC/DQ Rao 1982, Pavoine & Ricotta (2014) X X X adiv::SQ dsimcom Pavoine & Ricotta (2014) X X X adiv:::dsimcom Ordinariness- based dissABC Pavoine & Ricotta (2015) X X X adiv:::dissABC generalized_Tradidiss Pavoine & Ricotta (2019) X X adiv:::generalized_Tradidiss Diversity multiplicative beta Chao et al. (2012) X our new functions, see Data (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r. A ll rig h ts re se rve d . N o re u se a llo w e d w ith o u t p e rm issio n . T h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d Ja n u a ry 8 , 2 0 2 1 . ; h ttp s://d o i.o rg /1 0 .1 1 0 1 /2 0 2 1 .0 1 .0 6 .4 2 5 5 6 0 d o i: b io R xiv p re p rin t https://doi.org/10.1101/2021.01.06.425560 45 partitioning availability Nearest neighbour DCW, DCW(Q) Clarke & Warwick (1998), Ricotta & Bacaro (2010) X X our new functions, see Data availability DIP Izsák & Prince (2001), Ricotta & Bacaro (2010) X X our new functions, see Data availability PADDis Ricotta et al. (2016) X adiv:::PADDis Classification- based not discussed not discussed Hérault & Honnay (2007), Nipperess et al. (2010), Cardoso et al. (2014), Pavoine (2016) 977 (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r. A ll rig h ts re se rve d . N o re u se a llo w e d w ith o u t p e rm issio n . T h e co p yrig h t h o ld e r fo r th is p re p rin t th is ve rsio n p o ste d Ja n u a ry 8 , 2 0 2 1 . ; h ttp s://d o i.o rg /1 0 .1 1 0 1 /2 0 2 1 .0 1 .0 6 .4 2 5 5 6 0 d o i: b io R xiv p re p rin t https://doi.org/10.1101/2021.01.06.425560 46 Table 3. Kendall tau correlations between environmental distance and the functional 978 dissimilarity measures at different values of sigma and with abundance data type 979 Sigma=0.01 Sigma=0.1 Sigma=0.25 Sigma=0.5 Sigma=1 Sigma=5 CWMdis 0.974 0.905 0.846 0.828 0.649 0.251 CDFdis 0.974 0.904 0.845 0.83 0.646 0.255 D(Q) 0.974 0.912 0.832 0.828 0.637 0.243 dsimcom.SS 0.974 0.911 0.832 0.829 0.638 0.243 dsimcom.Jac 0.974 0.911 0.832 0.829 0.638 0.243 dsimcom.Sor 0.974 0.911 0.832 0.829 0.638 0.243 dsimcom.Och 0.974 0.911 0.832 0.829 0.639 0.243 dsimcom.Beta 0.974 0.911 0.832 0.829 0.638 0.243 dissABC.Jac 0.967 0.899 0.82 0.813 0.617 0.243 dissABC.Sor 0.967 0.899 0.82 0.813 0.617 0.243 dissABC.SS 0.967 0.899 0.82 0.813 0.617 0.243 dissABC.Och 0.968 0.899 0.819 0.814 0.618 0.243 dissABC.Kul 0.968 0.898 0.819 0.814 0.619 0.243 dissABC.Si 0.954 0.867 0.789 0.791 0.616 0.243 Tradidiss.GC.even 0.974 0.908 0.816 0.829 0.626 0.245 Tradidiss.MS.even 0.974 0.908 0.814 0.828 0.623 0.243 Tradidiss.PE.even 0.974 0.907 0.828 0.831 0.639 0.25 Tradidiss.GC.uneven 0.967 0.901 0.827 0.815 0.622 0.244 Tradidiss.MS.uneven 0.966 0.899 0.823 0.813 0.618 0.243 Tradidiss.PE.uneven 0.969 0.905 0.837 0.821 0.637 0.249 βturnover 0.974 0.911 0.837 0.829 0.641 0.251 βheterogeneity 0.974 0.911 0.837 0.829 0.641 0.251 βsegregation 0.974 0.911 0.837 0.829 0.641 0.251 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 47 DIP 0.923 0.778 0.68 0.565 0.338 0.034 DCW 0.923 0.778 0.68 0.565 0.338 0.034 Bray-Curtis (species-based) 0.711 0.832 0.778 0.678 0.455 0.086 980 981 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 48 Table 4. Kendall tau correlations between environmental distance and the functional 982 dissimilarity measures at different values of sigma and with presence/absence data type 983 Sigma=0.01 Sigma=0.1 Sigma=0.25 Sigma=0.5 Sigma=1 Sigma=5 CWMdis 0.944 0.818 0.691 0.568 0.353 0.014 CDFdis 0.943 0.820 0.700 0.594 0.306 -0.001 D(Q) 0.941 0.818 0.701 0.59 0.313 -0.003 dsimcom.SS 0.944 0.821 0.705 0.593 0.316 -0.003 dsimcom.Jac 0.944 0.821 0.705 0.593 0.316 -0.003 dsimcom.Sor 0.944 0.821 0.705 0.593 0.316 -0.003 dsimcom.Och 0.944 0.822 0.707 0.592 0.323 0.000 dsimcom.Beta 0.944 0.821 0.705 0.593 0.316 -0.003 dissABC.Jac 0.946 0.819 0.704 0.592 0.292 -0.006 dissABC.Sor 0.946 0.819 0.704 0.592 0.292 -0.006 dissABC.SS 0.946 0.819 0.704 0.592 0.292 -0.006 dissABC.Och 0.946 0.819 0.704 0.591 0.293 -0.006 dissABC.Kul 0.946 0.819 0.704 0.591 0.294 -0.006 dissABC.Si 0.939 0.835 0.701 0.556 0.324 0.019 Tradidiss.GC.even 0.945 0.820 0.707 0.593 0.305 -0.005 Tradidiss.MS.even 0.945 0.819 0.707 0.592 0.304 -0.005 Tradidiss.PE.even 0.947 0.821 0.704 0.595 0.323 -0.002 Tradidiss.GC.uneven 0.946 0.819 0.702 0.592 0.308 -0.006 Tradidiss.MS.uneven 0.946 0.819 0.703 0.592 0.307 -0.006 Tradidiss.PE.uneven 0.947 0.820 0.698 0.593 0.326 -0.003 βturnover 0.943 0.817 0.699 0.585 0.331 -0.003 βheterogeneity 0.943 0.817 0.699 0.585 0.331 -0.003 βsegregation 0.943 0.817 0.699 0.585 0.331 -0.003 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 49 DIP 0.905 0.696 0.597 0.435 0.158 -0.017 DCW 0.904 0.694 0.593 0.431 0.160 -0.017 PADDis.Jac 0.904 0.679 0.575 0.418 0.154 -0.025 PADDis.Sor 0.905 0.696 0.597 0.435 0.158 -0.017 PADDis.SS 0.902 0.662 0.546 0.396 0.144 -0.034 PADDis.Och 0.904 0.697 0.596 0.436 0.159 -0.017 PADDis.Simp 0.881 0.620 0.474 0.343 0.158 0.030 PADDis.Kul 0.904 0.694 0.593 0.431 0.160 -0.017 Sørensen (species-based) 0.698 0.724 0.606 0.415 0.127 0.048 984 985 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 50 Figure 1. Heat maps showing the interactive effects of niche width (sigma), transformation of 986 between-species dissimilarities (lin = linear, exp = exponential), data type (ABUND = 987 abundance, P/A = presence/absence), and dissimilarity index (1 – CWMdis, 2 – CDFdis, 3 – 988 DQ, 4 – dsimcom/Sørensen, 5 – dissABC/Sørensen, 6 – generalized_Tradidiss/generalized 989 Canberra, uneven weighting, 7 – βturnover, 8 – DCW) on the correlation with environmental 990 distance 991 992 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425560doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425560 10_1101-2021_01_06_425569 ---- 65426741 Metabolite discovery through global annotation of untargeted metabolomics data Li Chen1,2, Wenyun Lu2,3, Lin Wang2,3, Xi Xing2,3, Xin Teng2, Xianfeng Zeng2,3, Antonio D.Muscarella2, Yihui Shen2, Alexis Cowan2,4, Melanie R. McReynolds2,3, Brandon Kennedy5, Ashley M. Lato6, Shawn R. Campagna6, Mona Singh2,7, Joshua Rabinowitz2,3,4,# 1Institute of Metabolism and Integrative Biology, Fudan University, Shanghai, 200438, China. 2Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, 08544, USA. 3Department of Chemistry, Princeton University, Princeton, NJ, 08544, USA. 4Department of Molecular Biology, Princeton University, Princeton, NJ, 08544, USA. 5Lotus Separation LLC, Department of Chemistry, Princeton University, Princeton, NJ, 08544, USA 6Department of Chemistry, The University of Tennessee at Knoxville, Knoxville, TN, 37996, USA 7Department of Computer Science, Princeton University, Princeton, NJ, 08544, USA. # Corresponding author, e-mail: joshr@princeton.edu Abstract A primary goal of metabolomics is to identify all biologically important metabolites. One powerful approach is liquid chromatography-high resolution mass spectrometry (LC-MS), yet most LC-MS peaks remain unidentified. Here, we present a global network optimization approach, NetID, to annotate untargeted LC-MS metabolomics data. We consider all experimentally observed ion peaks together, and assign annotations to all of them simultaneously so as to maximize a score that considers properties of peaks (known masses, retention times, MS/MS fragmentation patterns) as well network constraints that arise based on mass difference between peaks. Global optimization results in accurate peak assignment and trackable peak-peak relationships. Applying this approach to yeast and mouse data, we identify a half-dozen novel metabolites, including thiamine and taurine derivatives. Isotope tracer studies indicate active flux through these metabolites. Thus, NetID applies existing metabolomic knowledge and global optimization to annotate untargeted metabolomics data, revealing novel metabolites. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction Metabolomics provides a snapshot of small-molecule concentrations in a biological system. In so doing, it reflects the integrated impact of genetics and the environment on metabolism. One important role of metabolomics is annotating previously unknown or underappreciated metabolites. For example, metabolomics facilitated identification of 2-hydroxyglutarate as an oncometabolite, eventually leading to the development of inhibitors of 2-hydroxyglutarate synthesis as anticancer agents1,2. Metabolomics also contributed to identification of a diversity of natural products3,4 and disease biomarkers5. A common experimental strategy in metabolomics is liquid chromatography-high resolution mass spectrometry (LC-MS). LC-MS metabolomics measures thousands of ion peaks, of which hundreds are associated with known metabolites. A much greater number of peaks, however, still remain unannotated. The standard approach to peak annotation is to compare exact mass and either retention time or MS/MS fragmentation pattern to authenticated standards. To facilitate such comparisons, extensive chemical databases have been developed (e.g. METLIN6, HMDB7, MoNA8, KEGG9, Pubchem10, ChEBI11 and NIST12), with software tools available for automated peak picking and database comparison. Modern software also includes features for annotating peaks arising from isotopes and adducts of known metabolites, based on co-elution and characteristic mass differences (e.g. XCMS13,14, GNPS15, MS-DIAL16, MZmine17, and CAMERA18). Such peaks seem to account for at least half of non-background LC-MS features19,20. Despite this progress, a great number of unknown peaks remain, and figuring out their identities is a primary challenge in the field. One promising approach is network analysis, capitalizing on peak-peak relationships to increase annotation scope and accuracy. Connections can be drawn based on similar responses across experiments and/or MS2 similarity. Such connections can arise either through biochemical activities or mass spectrometry phenomena, such as isotopes, adducts, or in-source fragments. While distinct metabolites typically separate chromatographically, ions connected through mass spectrometry phenomena co-elute. Workflows employing the concept of molecular connectivity have been used to build networks (e.g., GNPS21,22, CliqueMS23, MetDNA24, BioCAn25, and IPA26), and are showing increasing utility for annotating metabolomics data in diverse contexts. For example, GNPS has been used broadly in identifying natural products. Existing algorithms generally focus on metabolite peaks with MS2 spectra available, using MS2 spectral data as the main annotation driver. This is an effective strategy for annotating high abundance peaks with informative MS2 spectra, such as major secondary metabolites. It is less effective, however, for many low abundance metabolomics peaks, due to poor quality or less informative MS2 spectra. We accordingly set out to develop a network algorithm for annotating the breadth of metabolomics peaks, capitalizing on available MS2 spectra but including also low .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ abundance peaks lacking MS2 spectra. Effective incorporation of peaks without MS2 spectra required making yet better use of peak-peak relationships to enhance annotation accuracy, which we achieved through the computational approach of global optimization: not dealing with peak annotation one- by-one, but instead all at once to take full advantage of the entire available information. This global optimization strategy had not previously been applied in the context of molecular networking analysis. To this end, we present the algorithm “NetID”. Similar to existing network analysis approaches, nodes are experimentally observed non-background ion peaks and connections are mass differences between peaks. We explicitly distinguish connections due to biotransformations (“biochemical connections” linking two metabolites) from those due to mass spectrometry phenomenon (“abiotic connections” linking isotopes, adducts, and fragments to the metabolites from which they are derived). Peak annotation occurs in a single global optimization step, based on linear programming, that enforces a single formula assignment for each experimentally observed ion peak. Using this approach, we can annotate roughly 80% of untargeted metabolomics peaks, with a majority being isotopes and adducts of known metabolites. Through these efforts, we provide likely formulae for several hundred novel metabolites, and confirm the identities of half-dozen species not currently in metabolomics databases. Results NetID algorithm NetID involves three computational steps: initial annotation, scoring, and optimization (Figure 1). The workflow starts with a peak table that contains a list of peak m/z, RT, intensity, and (when available) associated MS2 spectra, with background peaks removed by comparing to a process blank sample. Each peak defines a node in the network. In the initial annotation phase, we match every experimentally measured node m/z to formulae in the HMDB database. Peaks matching to HMDB formula within 10 ppm are annotated as seed nodes, from which we extend edges to build the network. Edges connect two nodes via gain or loss of specific chemical moieties (atoms). The atom differences can occur either due to metabolism (biochemical connection) or due to mass spectrometry phenomena (abiotic connections). For example, a difference of H2 suggests an oxidation/reduction relationship and defines a biochemical edge. A difference of Na-H suggests sodium adducting and is a type of abiotic edge (adduct edge). Other atom differences define other types of abiotic connections (isotope or fragment edges). Most atom differences are specific to biochemical, adduct, isotope, or fragment edges, but a few occur in multiple categories. For example, H2O loss can be either biochemical (enzymatic dehydration) or abiotic (in-source water loss). By integrating literature and in-house data, we assembled a list of 25 biochemical atom differences and 59 abiotic atom .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ differences which together define all connections in the network (Supplementary Table 1, 2). Using these lists, starting from the seed nodes, we draw all feasible edges such that (i) Δm/z between the connected nodes matches the atom mass difference and (ii) only co-eluting peaks are connected by abiotic edges. Through the edge extension process, possible formulae are assigned to nodes outside the initial seeds. A few rounds of edge extension suffice to give thorough coverage. Due to finite mass measurement precision, a single node may be assigned multiple contradictory formulae, which are resolved at the optimization step (see Methods). NetID then scores every node and edge annotation. Node annotations are scored based on precision of m/z match to the molecular formula, precision of retention time match to known metabolite retention time and (when the relevant information is available) quality of MS2 spectra match to database structure. In addition, there is a bonus for matching to formula in HMDB and a penalty for breaking basic chemical rules (Seven Golden Rules for filtering molecular formulae27). Biochemical edges receive a positive score for MS2 spectra similarity match between the connected nodes, and are otherwise unscored. Abiotic edges are scored based on precision of co-elution with the parent metabolite, connection type (adduct, isotope, etc.), and features specific to the connection type, such as expected natural abundance for isotope peaks (see Methods). The overall impact is to assign high scores to annotations that effectively align the experimentally observed ion peaks with prior metabolomics knowledge. With a score assigned for each potential node and edge annotation, we formulate the global network optimization problem as that of maximizing the network score with linear constraints that each node and edge has a single unique annotation and that these are consistent (e.g. peaks connected by H2 edge must have formula differing by 2H). Such optimization is readily performed by linear programing with a typical runtime of hours in R on a personal computer, and results in an optimal and consistent network annotation. Global network optimization As an example of the utility of global network optimization, where all peaks and connections are simultaneously considered to enhance annotation accuracy, we present an example network containing five peaks (Figure 2A). We first match experimental measurements to the database, annotating node a and node b as seed nodes adenosine monophosphate (AMP, C10H14N5O7P) and adenosine (C10H13N5O4), respectively. We also identify five possible connections between the five nodes. Two alternative networks are generated by extending annotations. In the left network, node c is annotated as adenosine HCl adduct (C10ClH14N5O4), whereas in the right network, node c is annotated as a putative metabolite (C9H14N5O5P) resulting from CO2 loss from AMP. Node d is 13C isotope of node c in both networks. Node e is annotated as 37Cl isotope of node c in the left network, and is unannotated in the right network because there is no Cl atom in the parent molecule. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ The left network has higher total node and edge annotation scores than the right network, and thus is selected by NetID. This selection makes sense to an experienced mass spectroscopist: the 37Cl isotope signature in node e indicates that node c should contain Cl. The power of NetID is that it automatically captures such logic, and uses the power of global computational optimization to extend such inferences across the network in an automated manner. To test the NetID workflow, we applied it to both yeast and liver datasets, in both positive and negative ionization mode (Figure 2B, 2C). Considering the example of negative mode yeast data with a total of 5,588 non-background peaks, in the initial annotation step, roughly 1,600 potential formulae were assigned to 1,400 peaks, with about 200 peaks receiving multiple formula annotations. These nodes were connected by just over 50,000 potential edges. Edge extension expanded coverage to over 5,000 nodes with an average of twelve potential formulae each, highlighting the importance of scoring and network optimization to assign proper formulae. After scoring node and edge annotations, global network optimization settled on about 4,800 unique node annotations. About 20% of the annotated peaks were metabolites, 14% were putative novel metabolites, and the rest were mass spectrometry phenomena, such as adducts, fragments, isotopes. Nodes were connected by about 10,000 edges, roughly evenly split between biochemical and abiotic connections (Figure 2C, Supplementary Fig. 1A). More than 90% of annotated nodes fell into a single dominant connected network (Supplementary Fig. 1B), reflecting most peaks being connected to core metabolism. About 15% of peaks, however, remained unannotated. These unannotated peaks likely reflect deficiencies in our lists of allowed atom differences, including additional forms of mass spectrometry phenomena. For example, manual examination of the unconnected peaks revealed a dozen nickel adducts of known compounds (Supplementary Table. 3). Importantly, the annotated peaks included several hundred novel metabolite formulae (Supplementary Fig. 2, Supplementary Data 1). Collectively, these provide a wealth of opportunities for metabolite discovery. Thiamine-derived metabolites NetID optimization provided not only a list of putative metabolites, but also connections linking these putative metabolites to known metabolites. In the yeast metabolomics dataset, we found three putative metabolites that have total ion current > 105, connected in a subnetwork around thiamine. Their formulae are C12H16N4O2S (thiamine+O), C14H20N4O2S (thiamine+C2H2O) and C14H18N4O2S, (thiamine+C2H4O) (Figure 3A). While not found in HMDB, thiamine+O is documented in METLIN as a thiamine oxidation product, so we focused on the other two potential thiamine derivatives. MS/MS spectra of the putative thiamine+C2H2O and thiamine+C2H4O contained characteristic thiamine fragments. Both contained a classical pyrimidine fragment, with thiamine+C2H2O also containing an acetylated pyrimidine fragment, leading to a probable structure (Figure 3A,B). The structural assignment is further supported by the presence of an unmodified thiazole fragment. In .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ contrast, thiamine+C2H4O lacked a classical unmodified thiazole fragment, instead showing a thiazole+C2H4O fragment (and a fragment with further water loss) (Figure 3A,B). Isotope tracing experiments further confirm these two peaks contain thiamine. When fed [U- 13C]glucose as sole carbon source, yeast synthesize thiamine de novo, resulting fully labeled thiamine species, with carbon counts matching the NetID formula assignments (Figure 3C). Adding unlabeled thiamine to the [U-13C]glucose culture media, yeast uptake the unlabeled thiamine, resulting in unlabeled thiamine and M+2 labeled thiamine+C2H2O and thiamine+C2H4O species. Although discovered in yeast, these are conserved metabolites, found also in mammalian samples (Figure 3D). Acetylation is one of the 25 biochemical atom transformations allowed in NetID. The addition of C2H4O is much less common biochemically, and was captured in NetID as two steps, acetylation followed by reduction. Accordingly, we looked into thiamine metabolism to explore how thiamine+C2H4O might be produced. Thiamine pyrophosphate is an important cofactor in pyruvate dehydrogenase (PDH, the entry step to TCA cycle) (Figure 3E). The de-pyrophosphorylation product of thiamine intermediate in PDH reaction yields thiamine+C2H4O matches the proposed thiamine+C2H4O structure (Figure 3F). Based on this biochemical route, we realized that analogous products could be formed by α- ketoglutarate dehydrogenase (thiamine+C4H6O3) and branched-chain keto acid dehydrogenase (thiamine+C4H8O) (Figure 3F). Peaks at both of these exact masses were also experimentally observed, with isotope labeling results supporting their being thiamine-derived metabolites (Supplementary Fig. 3). Thus, NetID enabled the discovery of four novel thiamine-derived metabolites. N-glucosyl-taurine We similarly carried out NetID annotation of a mouse liver dataset. We observed multiple putative metabolite peaks linked to taurine, by apparent glucosylation (+C6H10O5), palmitylation (+C16H30O) and transamination (+O-NH3) (Figure 4A). The latter two, while missing in HMDB, were found in METLIN: N-palmitoyl taurine (C18H37NO4S) and sulfoacetaldehyde (C2H4O4S). To elucidate the structure of the putative taurine glucosylation product (C8H17NO8S), we chemically synthesized N- glucosyl-taurine. Synthetic N-glucosyl-taurine matched the retention time and MS/MS fragmentation pattern of the observed C8H17NO8S peak (Figure 4B,C). In liver samples of mice infused with [U- 13C]glucose, C8H17NO8S appeared in M+6 form, suggesting active synthesis of the N-glucosyl-taurine from circulating glucose (Figure 4D). N-glucosyl-taurine was not observed in yeast extract but was detected in multiple mouse tissues. Quantitation using the synthetic standard shows that liver has the highest level of glucosyl-taurine at ~170 μM (Figure 4E, Supplementary Fig. 4). This ranks among the few dozen most abundant liver metabolites. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Discussion The advent of LC-MS metabolomics revealed tens of thousands of metabolite peaks not matching known formulae, raising the possibility that the majority of metabolites remained to be discovered. While the biosphere likely contains many novel metabolites, it has been increasingly recognized that most peaks in typical untargeted metabolomics studies do not arise from novel metabolites, but rather mass spectrometry phenomena. The goal of comprehensively annotating untargeted metabolomics peaks with molecular formulae has, however, remained elusive. One promising strategy for peak annotation involves building molecular networks where nodes are LC-MS peaks (with associated molecular formulae) and edges are atom transformations linking the peaks. Here we advance this strategy by combining metabolomics knowledge with computational global optimization. We explicitly differentiate biochemical connections reflecting metabolic activity and abiotic connections arising from mass spectrometry phenomena. By formulating the peak annotation challenge as a linear program, we identify an optimal network in light of all observed peaks. Rather than weeding out peaks from mass spectrometry phenomena like adducts and natural isotopes, this approach takes advantage of the information embedded in them. It further provides traceable peak-peak relationships, which illuminate the basis for assigned formulae and suggest candidate structures. Applying this approach to untargeted LC-MS data from yeast and liver samples, we assign formulae to roughly three-quarters of all non-background peaks. In each of positive and negative mode, the annotated peaks cover about 1000 known metabolites, with on average more than four mass peaks for every metabolite (e.g. M+H plus three adduct or isotope peaks). This leaves a couple thousand unannotated peaks from each LC-MS run. Based on the observed ratio between peaks and metabolites, this likely correspond to hundreds (but not thousands) of unidentified metabolites. This number may actually be less, due to novel adducts (e.g. nickel adducts, which we discovered via careful examination of the unannotated peaks) or other mass spectrometry phenomena. Importantly, this approach has already generated likely formulae for many hundreds of putative novel metabolites (Supplementary Fig. 2, Supplementary Data 1), including a half-dozen for which we assign structures (Figure 3, 4). A key benefit of molecular network-based annotation is the ability to assimilate steadily new information21,22. Each newly identified metabolite provides an additional anchor point for optimizing the network. Other data types can be seamlessly added. For example, compound class categorization based on MS/MS data28 or retention time prediction29 can be added to score nodes. Labeling similarity upon feeding different isotope-labeled nutrients could potentially be added to score edges. Global optimization, integrating all new information comprehensively with prior knowledge to arrive at optimal annotations, is novel and potentially transformative for the field more broadly. The cycle .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ of careful experimentation and focused computational method developments holds the potential to identify most unknown metabolites over the coming decade, providing a robust blueprint of the metabolome (Figure 5). .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Methods Yeast metabolomics sample preparation and isotope labeling S. cerevisiae strain FY4 was grown for at least 10 generations in minimal essential media containing 0.4% [U-12C] or [U-13C] glucose and 10 mM ammonium sulfate with or without 0.4 mg/L thiamine hydrochloride30. Then, in mid-exponential phase, 5 mL culture broth (OD600 = 0.80) was filtered and metabolites were extracted using 1 mL extraction buffer (40:40:20:0.5 acetonitrile:methanol:water:formic acid), followed by adding 88 μL neutralization buffer (15% NH4HCO3). The extracts were kept at -20℃ for at least15 min to precipitate protein before centrifuging at 16,000 g for 10 min. The supernatant was used for LC–MS analysis. Murine metabolomics sample preparation and intravenous infusion experiment Animal studies followed protocols approved by the Princeton University Institutional Animal Care and Use Committee. Twelve-month-old female wild-type C57BL/6 mice (The Jackson Laboratory, Bar Harbor, ME) on normal diet were sacrificed by cervical dislocation and tissues quickly dissected and snap frozen in liquid nitrogen with precooled Wollenberger clamp. Frozen samples from liquid nitrogen were then transferred to −80°C freezer for storage. To extract metabolites, frozen liver tissue samples were first weighed (~ 20 mg each) and transferred to 2 mL round-bottom Eppendorf Safe-Lock tubes on dry ice. Samples were then ground into powder with a cryomill machine (Retsch, Newtown, PA) for 30 seconds at 25 Hz, and maintained at cold temperature using liquid nitrogen. For every 25 mg tissues, 922 uL extraction buffer (as above) was added to the tube, vortexed for 10 seconds, and allowed to sit on ice for 10 minutes. Then 78 L neutralization buffer was added and the samples vortexed. The samples were allowed to sit on ice for 20 minutes and then centrifuged at 16,000 g for 25 min at 4°C. The supernatants were transferred to another Eppendorf tube and centrifuged at 16,000 g for another 25 min at 4°C. The supernatants were transferred to glass vials for LC-MS analysis. A procedure blank was generated identically without tissue, which was used later to remove the background ions. Detailed methods for intravenous infusion of mice have been described previously31. Briefly, in vivo infusions were performed on 12–14-week-old C57BL/6 mice pre-catheterized in the right jugular vein (Charles River Laboratories). Mice were kept fasted for 6 h and then infused for 2.5 h with [U- 13C]glucose (200 mM, 0.1 L/min/g). The mouse infusion setup (Instech Laboratories) included a tether and swivel system so that the animal had free movement in the cage. Venous samples were taken from tail bleeds. At the end of the infusion, the mouse was euthanized by cervical dislocation and tissues were collected and extracted as above. Serum metabolites were extracted by adding 100 l methanol to 5 L of serum and centrifuging for 20 min. The supernatant was used for LC–MS analysis. LC-MS and LC-MS/MS .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ LC separation was achieved using a Vanquish UHPLC system (Thermo Fisher Scientific) with an Xbridge BEH Amide column (150×2mm, 2.5 µm particle size; Waters). Solvent A is 95:5 water: acetonitrile with 20 mM ammonium acetate and 20 mM ammonium hydroxide at pH 9.4, and solvent B is acetonitrile. The gradient is 0 min, 90% B; 2 min, 90% B; 3 min, 75%; 7 min, 75% B; 8 min, 70%, 9 min, 70% B; 10 min, 50% B; 12 min, 50% B; 13 min, 25% B; 14 min, 25% B; 16 min, 0% B, 20.5 min, 0% B; 21 min, 90% B; 25 min, 90% B. Total running time is 25 min at a flow rate of 150 µl/min. LC-MS data were collected on a Q-Exactive Plus mass spectrometer (Thermo Fisher) operating in full scan mode with a MS1 scan range of m/z 70-1000, and resolving power of 160,000 at m/z 200. Other MS parameters are as follows: sheath gas flow rate, 28 (arbitrary units); aux gas flow rate, 10 (arbitrary units); sweep gas flow rate, 1 (arbitrary units); spray voltage, 3.3 kV; capillary temperature, 320°C; S- lens RF level, 65; AGC target, 3E6 and maximum injection time, 500 ms. To demonstrate the utility of inclusion of MS2 data for NetID analysis, 1479 and 803 MS2 spectra were obtained for selected peaks with intensity > 105 in positive and negative ionization mode respectively from a previous liver dataset32. Targeted MS2 spectra were collected using the PRM function at 25 eV HCD energy with other instrument setting being, resolution 17500, AGC target 106, Maximum IT 250 ms, isolation window 1.5 m/z. Glucosyl-taurine synthesis Glucosyl-taurine synthesis was carried out following previous literature reports with slight modifications33. In brief, dry methanol was obtained by distillation of HPLC-grade methanol (Fisher; HPLC grade 0.2 micron filtered) over CaH2 (Acros Organics; ca. 93% extra pure, 0-2 mm grain size). A flame-dried round-bottom flask equipped with a reflux condenser and stir bar was charged with 2.0 g taurine (Alfa Aesar; 99%), 3.1 g D-glucose (Acros Organics; ACS reagent), and 80 mL of dry methanol. This mixture was sonicated under an inert atmosphere for 30 minutes before being returned to the manifold for the reaction. To the fine-suspension of taurine and glucose in dry methanol at room temperature, 4.0 mL 5.4 M sodium methoxide in methanol (Acros Organics) was added via glass syringe. At this point, the suspension began to dissolve and after 30 minutes, gave a clear and colorless solution. The solution was stirred vigorously under an inert atmosphere for 72 hours, which resulted in a faint peach-colored solution. This solution was chilled to 0 ˚C, and ~200 mL of absolute ethanol (200 proof) was added and precipitation was allowed to occur at this temperature for 30 minutes. Solvent was then removed by filtration over a glass filter (medium porosity), and washed with ~100 mL of absolute ethanol, affording a fine pale-yellow powder (2.4 g; crude material). NMR experiment was carried out to validate the structure of synthesized N-glucosyl-taurine. Selective TOCSY experiments using DIPSI2 spin-lock and with added chemical shift filter34 were run on a Bruker Avance III HD NMR spectrometer equipped with a custom-made QCI-F cryoprobe (Bruker, Billerica, MA) at 800 MHz and at 295.2K controlled temperature. The sample was dissolved in DMSO- .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ d6. The spectra shown on the plots are results of 200 ms SL mixing, 8 scans each. Data processing (MNova v.14, Mestrelab Research S.L., Santiago de Compostela, Spain) included zero filling, 1 Hz Gaussian apodization, phase- and baseline correction. NMR analysis suggests that the final crude material contains 5.2% N-glucosyl-taurine and unreacted substrates (Supplementary Figure 5). NetID algorithm I. Data preparation and input LC-MS raw data files (.raw) were converted to mzXML format using ProteoWizard35 (version 3.0.11392). El-MAVEN (version 7.0) was used to generate a peak table containing m/z, retention time, intensity for peaks. Parameters for peak picking were the defaults except for the following: mass domain resolution is 10 ppm; time domain resolution is 15 scans; minimum intensity is 1000; minimum peak width is 5 scans. The resulting peak table was exported to a .csv file. Redundant peak entries due to imperfect peak picking process are removed if two peaks are within 0.1 min and their m/z difference are within 2 ppm. Background peaks are removed if its intensity in procedure blank sample is > 0.5-fold of that in biological sample. The m/z of the remaining peaks are recalibrated by applying an absolute m/z adjustment factor εabsolute (independent of measured m/z) and a relative m/z adjustment factor εrelative (linearly dependent on measured m/z). For each peak i the recalibrated values im/z, adjusted are computed as 𝑖 / , = 𝑖 / , × (1 + 𝜀 ) + 𝜀 (1) The εrelative and εabsolute values are fit via linear regression using measured m/z values of selected known metabolite ion peaks and their calculated m/z. That is, for each of these known metabolite k, we have equations 𝑘 / , = 𝑘 / , × (1 + 𝜀 ) + 𝜀 (2) LC-MS/MS data were extracted from the mzXML files using lab-developed Matlab code. MS2 spectra may contain interfering product ions from co-eluting isobaric parent ions. These interfering product ions were removed by examining the extracted ion chromatogram (EIC) similarity between the product ions in MS2 data and the parent ion in MS1 data. A Pearson correlation coefficient of 0.8 was used as a cutoff to retain those product ions that has similar EIC as the parent ion. The cleaned MS2 data were exported to Excel files for further processing. Structures, formulae, m/z and MS2 spectra of metabolites were obtained from the Human Metabolome Database (HMDB, version 4.0), and retention times of selected metabolites were determined through running authentic standards using the above-mentioned LC-MS method. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ NetID algorithm requires three types of input files: a peak table (in .csv format) recording m/z, retention time, intensity for peaks; an atom difference rule table (in .csv format) containing a list of 25 biochemical atom differences and 59 abiotic atom differences which together define all connections in the network (Supplementary Table 1, 2), and metabolite information files containing structure, formula, m/z and MS2 spectra of HMDB metabolites and retention time of selected metabolites under different LC conditions. Exemplary peak table from the yeast dataset, atom difference rule table and HMDB metabolite information file are provided in Supplementary Data 2. II. Initial annotation of nodes and edges in the network The first step of NetID algorithm is to make an initial annotation for seed nodes, determine possible annotations for other nodes, and determine edges in the network. Each peak is a node in the network. We compare the experimentally measured m/z for each node to those of all metabolite formulae in the HMDB database. When the m/z difference is within 10 ppm, candidate formulae and HMDB IDs are assigned to the node, and this node is defined as a primary seed node. A primary seed node can contain more than one candidate formulae and HMDB IDs if all are within the m/z difference range. Edges connect two nodes via gain or loss of specific atoms. We assembled a list of 25 biochemical atom differences and 59 abiotic atom differences which together define all connections in the network (Supplementary Table 1, 2). Let each of these differences be denoted by Di. For each node u, if there is a node v such that the difference in the measured m/z of the nodes matches one of the those in the list of atom mass differences, we add an edge between u and v. That is, if um/z and vm/z are the experimentally measured m/z for the peaks corresponding to nodes u and v respectively (assuming vm/z > um/z for simplicity), then there is an edge between these nodes if there is some difference Di such that | 𝑣 / − 𝑢 / − 𝐷 | < 𝑣 / × 10 ppm (3) If Di is an abiotic difference, in order to add an edge, it is additionally required that the retention time between two nodes should be within 0.2 min. That is, if uRT and vRT are the retention times for u and v respectively, then it is required that | 𝑣 − 𝑢 | < 0.2 min (4) For each node, its candidate formulae set will expand due to propagating formulae from its neighboring nodes through edge atom differences. For example, when applying the atom difference of edge (u, v) on the formula assigned to primary seed node u, we can derive a new candidate formula for the connected node v. If the derived formula’s calculated m/z is within 5 ppm of node v’s measured m/z, then a new candidate formula is added for node v. Iterating the process to all candidate formulae of node u through edge (u, v) will further expand candidate formulae for node v. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ We apply the above extension process to formulae of all primary seed nodes through atom difference edges, and these new candidate formulae can themselves be used for another round of extension. Note that a primary seed node will be treated as the rest of nodes during the subsequent rounds of extension, and may as well be assigned with new formulae. To avoid duplicated efforts in the extension process, we allow formulae of primary seed nodes and biotransformed formulae thereof to be extended through both biotransformation and abiotic atom difference edges, and do not allow abiotic candidate formulae be further extended through biotransformation atom difference edges. The default extension process includes two rounds of biotransformation edge extensions and three rounds of abiotic edge extensions. III. Scoring node annotations NetID then scores every candidate node and edge annotation assigned in the initial annotation step. The node scoring system aims to assign high scores to annotations that align observed ion peaks with known metabolites based on m/z, retention time, MS/MS, and/or isotope abundances. Let the set of candidate annotation for node u be denoted as {𝑎 … 𝑎 … 𝑎 }. For each node u and each of its candidate annotation 𝑎 , let S(u, 𝑎 ) denotes the score of candidate annotation 𝑎 for node u. Different scoring components for candidate node annotations are defined as below: (a) Sm/z(u, 𝑎 ) is negative when measured m/z differs from the calculated m/z of assigned molecular formula. A larger ppm difference between calculated formula m/z and measurement m/z results to lower scores. The default scale factor is -0.5. Let 𝑎 , / be the calculated formula m/z of annotation 𝑎 , and 𝑢 / be the measured m/z of node u, then S / (𝑢, 𝑎 ) = −0.5 × 𝑢 / − 𝑎 , / / 𝑢 / × 10 (5) (b) SRT(u, 𝑎 ) is positive if the measured RT for the peak corresponding to node u matches to a known standard. A smaller difference between known and measured RT results in a higher score. Let 𝑎 , is the known RT of annotation 𝑎 , and 𝑢 be the measured RT of node u, then S (𝑢, 𝑎 ) = 1 − 𝑢 − 𝑎 , , if 𝑢 − 𝑎 , < 0.5 min Otherwise, S (𝑢, 𝑎 ) = 0 (6) (c) SMS2(u, 𝑎 ) is positive if the measured MS2 spectrum of node u matches the database MS2 spectrum of annotation 𝑎 . A dot product scoring function is used to score the MS2 spectra similarity24. The intensities of the fragment ions in the MS2 spectra are rescaled so that the highest fragment ion is set to 1. MS2 spectra are represented as W = [relative intensity of MS2 ions]n[m/z value]m, with n = 1, m = 0. Dot product (DP) and score for MS2 match (SMS2(u, 𝑎 )) are defined as below. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 𝐷𝑃 = ∑ ∑ × ∑ (7) S (𝑢, 𝑎 ) = DP, if DP > 0.5 Otherwise S (𝑢, 𝑎 ) = 0 (8) (d) Sdatabase(u, 𝑎 ) is positive if the annotated formula 𝑎 exists in HMDB. We give a positive score to a primary seed node annotation if that annotated formula exists in HMDB. S (𝑢, 𝑎 ) = 0.5, if 𝑎 in HMDB Otherwise, S (𝑢, 𝑎 ) = 0 (9) (e) Smissing_isotope(u, 𝑎 ) is negative if an isotopic peak is missing. We penalize a formula annotation if it passes the intensity threshold (default at 5x104) but does not have isotopic peaks of specified elements. The default isotope being evaluated is 37Cl. Any other elements, such as 13C or 18O, can be included by users. S _ (𝑢, 𝑎 ) = −1, if isotopic peak is missing Otherwise S _ (𝑢, 𝑎 ) = 0 (10) (f) Srule(u, 𝑎 ) is negative if annotation 𝑎 violates basic chemical rules. We strongly penalize formulae that violate basic chemical rules, including a negative RDBE (ring and double bond equivalents), and unlikely element ratios in metabolites (O/P < 3, O/Si < 2). S (𝑢, 𝑎 ) = −10, if chemical rules are violated Otherwise, S (𝑢, 𝑎 ) = 0 (11) (g) Sderivative(u, 𝑎 ) is positive if the annotation 𝑎 is derived from a parent peak p with an annotation h that has high score Sparent(p, h), which is calculated by summing up scores in (a)-(f) for S(p, h). S (𝑢, 𝑎 ) = S (𝑝, ℎ) − 0.5 (12) S (𝑝, ℎ) = S / (𝑝, ℎ) + S (𝑝, ℎ) + S (𝑝, ℎ) + S (𝑝, ℎ) + S _ (𝑝, ℎ) + S (𝑝, ℎ) (13) This is particularly helpful in annotating abiotic peaks. For example, annotation of glutamate sodium adduct will be given a positive Sderivative when its parent node is annotated as glutamate with high Sparent score. A final score S(u, 𝑎 ) for each candidate annotation 𝑎 of node u is calculated by summing scores in (a)-(g). .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ S(𝑢, 𝑎 ) = S / (𝑢, 𝑎 ) + S (𝑢, 𝑎 ) + S (𝑢, 𝑎 ) + S (𝑢, 𝑎 ) + S _ (𝑢, 𝑎 ) + S (𝑢, 𝑎 ) + S (𝑢, 𝑎 ) (14) Note that for each node u, we have one of candidate “annotations” that corresponds to no annotation being chosen for that node. The node score for this null annotation is 0 at default, and can be set at a negative value to promote choosing actual annotations. IV. Scoring edge annotations (biological, adduct, isotope) The edge scoring system aims to assign high scores to edge annotations that correctly capture biochemical connections between metabolites (based on MS2 spectra similarity) and abiotic connections between metabolites and their mass spectrometry phenomena derivatives, such as isotopes and adducts. Biochemical, isotope, and adduct edge annotations are the most common types, and other less common abiotic connection types are then described in the subsequent section. Suppose we consider two nodes u and v that are connected by an edge (u, v). For each pair of nodes u and v such that there is an edge (u, v), let the set of candidate formula for node u and v be denoted as {𝑎 … 𝑎 … 𝑎 } and {𝑏 … 𝑏 … 𝑏 }, respectively, and let the set of candidate atom differences for edge (u, v) be {𝐷 … 𝐷 … 𝐷 }. Let S(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) be the score of choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v). Note that S(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is set to be 0 if atom difference 𝐷 does not represent the formula difference of 𝑎 and 𝑏 . S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0, if 𝑎 − 𝑏 ≠ 𝐷 Different scoring components for candidate edge annotations are defined as below: (h) When node u and v have experimental measured MS2 spectra, SMS2_similarity( 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for a biochemical edge, and is a positive score if two connected nodes u and v have MS2 similarity, given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 . SMS2_similarity is determined using the dot product (DP), as described in previous section, and reverse dot product (DP_R), which evaluates the neutral ion loss similarity in the MS2 spectra24. A reverse MS2 spectrum is represented as R = [relative intensity of MS2 ions]n[parent m/z – measured m/z value]m, with n = 1, m = 0. DP = ∑ ∑ × ∑ (15) DP_R = ∑ ∑ × ∑ (16) S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = max (DP, DP_R), if max(DP, DP_R) > 0.3 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Otherwise, S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0 (17) (i) Sco_elution(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for an abiotic edge, and is a negative score if the RT of two connected nodes differ more than a threshold (0.05 min), given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 . S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = −5 × |𝑢 − 𝑣 |, if |𝑢 − 𝑣 | ≥ 0.05 min Otherwise, S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0 (18) (j) Stype(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for all edges, given the formula difference of 𝑎 and 𝑏 matches the atom difference defined by 𝐷 , and is a non-negative score depending on the connection type of edge, which is defined by 𝐷 , including biotransformation, adduct, isotope and fragment (Supplementary Table 1, 2). The magnitude of scores reflects the empirical confidence in the annotation type when certain atom differences occur, and can be adjusted based on personal use. S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0, if 𝐷 ϵ biotransformation S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0.5, if 𝐷 ϵ adduct S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 2, if 𝐷 ϵ isotope S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0.3, if 𝐷 ϵ fragment (19) (k) For each 𝐷 ϵ isotope, Sisotope_intensity(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) is defined for isotope edge (u, v) where 𝑏 is the isotopic derivative of 𝑎 with atom difference of 𝐷 , and is a negative score if the measured isotope peaks deviate from expected natural abundance. The score for an isotope edge depends on how likely the ratio of measured and expected isotopic intensity (Ratioisotope) is observed in an empirical normal distribution N 1, σ . Isotopes of all elements included in the atom difference table are evaluated. Ratio = / ( , , ) (20) S (𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) = 𝑙𝑜𝑔 𝜇 = Ratio N 1, σ 𝜇 = 1 N 1, σ (21) .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ σisotope is empirically defined as below, so that when measured isotope intensity is close to detection limit, a larger σisotope (a widened distribution, which is more tolerant to discrepancy) will be used. σ = 0.2 + 10 ( ) (22) A final edge annotation score S(𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) for choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v) is calculated by summing scores in (h)-(k), if other less common abiotic connection types are not considered (see next section). S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 (23) V. Additional abiotic edge types LC-MS metabolomics may include additional abiotic relationships. In orbitrap data, these include oligomers, multi-charge species, heterodimers, in-source fragments of known or unknown metabolites36, and ringing artifact peaks surrounding high intensity ions20,37. These relationships were included in NetID as additional edge types, which are evaluated for all m/z pairs within a predefined RT range (0.2 min). (l) Oligomer and multi-charge species. An oligomer/multi-charge edge is assigned between two nodes u and v, if their m/z satisfy |𝑣 / − n × 𝑢 / | < 𝑢 / × 10 ppm, n ϵ {positive integers} (23) (m) Heterodimer. Heterodimer peak (node v) may be observed when one abundant metabolite (node u) forms ion cluster with other ion species (node t). We examine nodes that have intensity above 105, and assign a heterodimer edge between two nodes u and v if their m/z difference satisfy |( 𝑣 / − 𝑢 / ) − 𝑡 / | < 𝑢 / × 10 ppm (24) (n) In-source fragments. Fragmentation peaks may be observed when one abundant metabolite breaks up into fragments during the ionization process. Database MS2 of known metabolites can be used to identify known ion fragmentation peaks36. If candidate annotation 𝑏 of node v is annotated with a HMDB ID associated with database MS2 spectrum, and m/z of node u matches to a fragment m/z in 𝑏 ’s MS2 spectrum, then a database fragment edge will connect such two nodes. That is, .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 𝑢 / ϵ Database MS2 spectrum of candidate annotation 𝑏 of node v (25) Measured MS2 spectra can be used to identify unknown ion fragmentation peaks. If node v is associated with a measured MS2 spectrum, and m/z of another node u matches to a fragment m/z in the MS2 spectra, then an experiment fragment edge will connect such two nodes. That is, 𝑢 / ϵ Measured MS2 spectrum of node v (26) (o) Ringing artifacts. Ringing peaks are artifact peaks (node v) often observed on both sides of the m/z of an intense ion peak (node u) in Fourier-transformed MS instrument including orbitrap. We examine nodes that have intensity above 106, and assign a ringing artifact edge between two nodes if two nodes satisfy 50 ppm < | 𝑣 / − 𝑢 / | / 𝑢 / < 1000 ppm 𝑢 / 𝑣 > 50 (27) Scoring of these additional abiotic edges follow the same rules described in the “Scoring edge annotations” section with additional Stype defined as below. S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0.5, if 𝐷 ϵ oligomer or multi-charge S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0, if 𝐷 ϵ heterodimer S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 0.3, if 𝐷 ϵ database MS2 fragment S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 1, if 𝐷 ϵ measured MS2 fragment S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = 2, if 𝐷 ϵ ringing artifacts (28) A final edge annotation score S( 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 ) for choosing candidate formula 𝑎 for node u, candidate formula 𝑏 for node v and candidate atom difference 𝐷 for edge (u, v) is calculated by summing scores in (h)-(o). S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 = S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 + S _ 𝑢, 𝑣, 𝑎 , 𝑏 , 𝐷 (29) VI. Global network optimization using linear programing .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Using scores assigned for each candidate node and edge annotation, our goal is to find annotations for each node so as to maximize the sum of the scores across the network under the constraints that each node is assigned a single annotation, and that the network annotation is consistent. We use linear programming to solve this optimization problem optimally, as described next. For each node u and each of its candidate formula 𝑎 , we define a node binary decision variable 𝑥 , to denote whether candidate formula 𝑎 is selected as the annotation for node u. That is, 𝑥 , = 1, if node u is annotated with formula 𝑎 Otherwise, 𝑥 , = 0 (28) We define a binary decision variable 𝑐 , , , , to denote whether candidate formulae 𝑎 and 𝑏 are chosen for nodes u and v , and the candidate atom difference 𝐷 corresponds to the formula difference of candidate formulae 𝑎 and 𝑏 of the connected nodes u and v. That is, 𝑐 , , , , = 1, if 𝑎 and 𝑏 are chosen for nodes u and v respectively, and 𝑎 − 𝑏 = 𝐷 Otherwise, 𝑐 , , , , = 0 (29) We constrain the optimization so that each node has a single annotation, and an edge exists and only exist if the atom difference of that edge annotation matches the formula difference of nodes. As a result, the node and edge binary variables should satisfy ∑ 𝑥 , = 1 (30) 𝑐 , , , , ≤ 𝑥 , , 𝑐 , , , , ≤ 𝑥 , (31) 𝑐 , , , , ≥ 𝑥 , + 𝑥 , − 1 (32) For all variables defined above, we add the constraints that they are either 0 or 1. With each candidate node and edge annotation being scored, the objective for the optimization is to find values for all variables 𝑥 , and 𝑐 , , , , so as to maximize the sum of all node scores and edge scores in a network while satisfying the constraints. Maximize: ∑ 𝑥 , × S(𝑢, 𝑎) + ∑ 𝑐 , , , , × S(𝑢, 𝑣, 𝑎, 𝑏, 𝐷) (32) The optimization result provides a string of binary numbers that denote if a candidate node or edge annotation is selected for the global optimal network. IBM ILOG CPLEX Optimization Studio (Version 12.8.0 or later) is used to solve the linear programing problem. A cplexAPI package for R is used to .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ call CPLEX optimization function in an R environment. For the yeast datasets and using the above scoring parameters, optimization finishes within an hour on a standard laptop. Depending on the number of peaks in data tables, the entries in the atom difference tables, and the parameters involved in scoring, runtimes during internal testing ranged from minutes to 48 h. Code availability NetID was developed mainly in R, and used a mixture of IBM ILOG CPLEX Optimization Studio, Matlab and Python. NetID code is available for non-commercial use in github at https://github.com/LiChenPU/NetID, under the GNU General Public License v3.0. A ShinyR app is provided to visualize the network results from NetID in a local environment, along with a detailed user guide and example files (Supplementary Note 1, Supplementary Data 2). Acknowledgement This work was supported by a Department of Energy (DOE) grant (no. DE-SC0012461 to J.D.R.), the Center for Advanced Bioenergy and Bioproducts Innovation (grant no. DE-SC0018420, subcontract to J.D.R.) and NIH grant R50CA211437 to W.L. M.R.M is funded by the Howard Hughes Medical Institute and Burroughs Wellcome Fund via the PDEP and Hanna H. Gray Fellows Programs. We thank Istvan Pelczer at NMR facility of Department of Chemistry, Princeton University for the NMR analysis, and X. Su for scientific discussion and help. The Center for Advanced Bioenergy and Bioproducts Innovation and the Center for Bioenergy Innovation are both U.S. Department of Energy Bioenergy Research Centers supported by the Office of Biological and Environmental Research in the DOE Office of Science. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the U.S. Department of Energy. Competing interests The authors declare no competing interests. Author contributions L.C., M.S. and J.D.R. conceived the project. L.C. developed the NetID algorithm. W.L., L.W., X.Z., A.C. M.M. performed experiments on mouse. L.W., W.L. and L.C. performed experiments on yeast. L.C., W.L., L.W. and X. X. analyze LC-MS and LC-MS/MS data. X.T., A.M. and Y.S. contributed to coding development. B.K., A.M.L., and S.R.C. provided chemical synthesis of taurine-related compounds. L.C. and J.D.R. wrote the manuscript. All authors discussed the results and commented on the manuscript. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure legends Figure 1. A global network optimization approach for untargeted metabolomics data annotation (NetID). The input data are LC-MS peaks with m/z, retention times, intensities and optional MS2 spectra. The output is a molecular network with peaks (nodes) assigned unique formulae and connected by edges reflecting atom differences arising either through enzymatic reaction (biochemical connection) or mass spectrometry phenomenon (abiotic connection). Peaks are classified as “metabolite” (M+H or M-H peak of formula found in HMDB), “putative metabolite” (formula not found in HMDB but with biochemical connection to a metabolite), or “artifact” (only abiotic connection to a metabolite). NetID algorithm involves three steps. Initial annotation first matches peaks to HMDB formulae. These seed annotations are then extended through edges to cover most nodes, with the majority of nodes receiving multiple formula annotations. Each node and edge annotation are then scored based on match to known masses, retention times, and MS/MS fragmentation patterns. Global network optimization maximizes sum of node scores and edge scores, while enforcing a unique formula for each node and unique transformation relationship for each edge. Figure 2. Utility of global network optimization. (A) An example network demonstrating the value of the global optimization step in NetID. Node a and node b match HMDB formulae and are connected by an edge of phosphate (HPO3). Node c can be connected to either node a or node b through mutually incompatible annotations, resulting in two different candidate networks. The table below the two candidate networks shows the annotations and scoring criteria for each, with the left network preferred for more good node and edge annotations. (B) Visualization of the optimal network obtained from negative mode LC-MS analysis of Baker’s yeast, containing 4851 nodes and 9699 connections. Metabolite and putative metabolite peaks are in green and artifact peaks in purple. (C) Summary table of NetID annotations of negative and positive mode LC-MS data from Baker's yeast and mouse liver. Figure 3. NetID reveals thiamine-derived metabolites in yeast. (A) Subnetwork surrounding thiamine. Nodes, connections, and formulae are direct output of NetID. Boxes with structures were manually added. (B) MS2 spectra of thiamine, thiamine+C2H2O, and thiamine+C2H4O, with proposed structures of the major fragments. (C) Labeling fraction of thiamine and its derivatives, in [U- 13C]glucose with and without unlabeled thiamine in the medium. (D) The thiamine derivatives are also found in mouse tissues and urine. (E) Proposed mechanism for formation of thiamine+C2H4O. Pyruvate dehydrogenase (PDH) decarboxylates pyruvate, and adds the resulting [C2H4O] unit (in red) to thiamine. (F) The same enzymatic mechanism occurs in oxoglutarate dehydrogenase (OGDH) and branched-chain α-ketoacid dehydrogenase complex (BCKDC), and generates thiamine+C4H6O3 and thiamine+C4H8O respectively. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 4. NetID discovers mammalian taurine derivatives. (A) Subnetwork surrounding taurine from mouse liver extract data. Nodes, connections, and formulae are direct output of NetID. Boxes with structures were manually added. (B) LC-MS chromatogram of N-glucosyl-taurine standard and the putative glucosyl-taurine from liver extract. (C) MS2 spectrum of glucosyl-taurine peak from liver extract (top), and synthetic N-glucosyl-taurine standard (bottom). (D) Isotope labeling pattern of putative glucosyl-taurine in mice, infused via jugular vein catheter for 2 h with [U-13C]glucose. (E) Absolute N-glucosyl-taurine concentration in murine serum and tissues. Figure 5. NetID applies global optimization for metabolomics data annotation and metabolite discovery. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Reference 1. DiNardo, C. D. et al. Durable Remissions with Ivosidenib in IDH1-Mutated Relapsed or Refractory AML. N. Engl. J. Med. 378, 2386–2398 (2018). 2. Dang, L. et al. Cancer-associated IDH1 mutations produce 2-hydroxyglutarate. Nature 462, 739 (2009). 3. Doroghazi, J. R. et al. A roadmap for natural product discovery based on large-scale genomics and metabolomics. Nature Chemical Biology 10, 963–968 (2014). 4. Aron, A. T. et al. Reproducible molecular networking of untargeted mass spectrometry data using GNPS. Nature Protocols 15, 1954–1991 (2020). 5. Johnson, C. H., Ivanisevic, J. & Siuzdak, G. Metabolomics: beyond biomarkers and towards mechanisms. Nature Reviews Molecular Cell Biology 17, 451–459 (2016). 6. Guijas, C. et al. METLIN: A Technology Platform for Identifying Knowns and Unknowns. Anal. Chem. 90, 3156–3164 (2018). 7. Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res 46, D608–D617 (2018). 8. Tsugawa, H. et al. Hydrogen Rearrangement Rules: Computational MS/MS Fragmentation and Structure Elucidation Using MS-FINDER Software. Anal. Chem. 88, 7946–7958 (2016). 9. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res 44, D457–D462 (2016). 10. Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res 47, D1102–D1109 (2019). .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11. Hastings, J. et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res 44, D1214–D1219 (2016). 12. sherena.johnson@nist.gov. NIST Standard Reference Database 1A. NIST https://www.nist.gov/srd/nist-standard-reference-database-1a (2014). 13. Tautenhahn, R., Patti, G. J., Rinehart, D. & Siuzdak, G. XCMS Online: A Web-Based Platform to Process Untargeted Metabolomic Data. Anal. Chem. 84, 5035–5039 (2012). 14. Forsberg, E. M. et al. Data processing, multi-omic pathway mapping, and metabolite activity analysis using XCMS Online. Nature Protocols 13, 633–651 (2018). 15. Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nature Biotechnology 34, 828–837 (2016). 16. Tsugawa, H. et al. A cheminformatics approach to characterize metabolomes in stable-isotope- labeled organisms. Nature Methods 16, 295 (2019). 17. Pluskal, T., Castillo, S., Villar-Briones, A. & Orešič, M. MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics 11, 395 (2010). 18. Kuhl, C., Tautenhahn, R., Böttcher, C., Larson, T. R. & Neumann, S. CAMERA: An Integrated Strategy for Compound Spectra Extraction and Annotation of Liquid Chromatography/Mass Spectrometry Data Sets. Anal. Chem. 84, 283–289 (2012). 19. Sindelar, M. & Patti, G. J. Chemical Discovery in the Era of Metabolomics. J. Am. Chem. Soc. 142, 9097–9105 (2020). .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20. Wang, L. et al. Peak Annotation and Verification Engine for Untargeted LC–MS Metabolomics. Anal. Chem. 91, 1838–1846 (2019). 21. Schmid, R. et al. Ion Identity Molecular Networking in the GNPS Environment. http://biorxiv.org/lookup/doi/10.1101/2020.05.11.088948 (2020) doi:10.1101/2020.05.11.088948. 22. Nothias, L.-F. et al. Feature-based molecular networking in the GNPS analysis environment. Nat Methods 17, 905–908 (2020). 23. Senan, O. et al. CliqueMS: A computational tool for annotating in-source metabolite ions from LC-MS untargeted metabolomics data based on a coelution similarity network. 8. 24. Shen, X. et al. Metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics. Nature Communications 10, 1516 (2019). 25. Alden, N. et al. Biologically Consistent Annotation of Metabolomics Data. Anal. Chem. 89, 13097–13104 (2017). 26. Del Carratore, F. et al. Integrated Probabilistic Annotation: A Bayesian-Based Annotation Method for Metabolomic Profiles Integrating Biochemical Connections, Isotope Patterns, and Adduct Relationships. Anal. Chem. (2019) doi:10.1021/acs.analchem.9b02354. 27. Kind, T. & Fiehn, O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics 8, 105 (2007). 28. Dührkop, K. et al. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat Biotechnol (2020) doi:10.1038/s41587-020-0740-8. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29. Bonini, P., Kind, T., Tsugawa, H., Barupal, D. K. & Fiehn, O. Retip: Retention Time Prediction for Compound Annotation in Untargeted Metabolomics. Anal. Chem. 92, 7515–7522 (2020). 30. Xu, Y.-F. et al. Discovery and Functional Characterization of a Yeast Sugar Alcohol Phosphatase. ACS Chem. Biol. 13, 3011–3020 (2018). 31. Hui, S. et al. Glucose feeds the TCA cycle via circulating lactate. Nature 551, 115–118 (2017). 32. Lu, W. et al. Improved Annotation of Untargeted Metabolomics Data through Buffer Modifications That Shift Adduct Mass and Intensity. Anal. Chem. 92, 11573–11581 (2020). 33. Cho, H. J., You, J. S., Chang, K. J., Kim, K. S. & Kim, S. H. Anti-adipogenic Effect of Taurine- Carbohydrate Derivatives. Bulletin of the Korean Chemical Society 35, 1863–1866 (2014). 34. Robinson, P. T., Pham, T. N. & Uhrıń, D. In phase selective excitation of overlapping multiplets by gradient-enhanced chemical shift selective filters. Journal of Magnetic Resonance 170, 97– 103 (2004). 35. Chambers, M. C. et al. A Cross-platform Toolkit for Mass Spectrometry and Proteomics. Nat Biotechnol 30, 918–920 (2012). 36. Xue, J. et al. Enhanced in-Source Fragmentation Annotation Enables Novel Data Independent Acquisition and Autonomous METLIN Molecular Identification. Anal. Chem. 92, 6051–6059 (2020). 37. Mitchell, J. M. et al. New methods to identify high peak density artifacts in Fourier transform mass spectra and to mitigate their effects on high-throughput metabolomic data analysis. Metabolomics 14, 125 (2018). .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 1. A global network optimization approach for untargeted metabolomics data annotation (NetID). The input data are LC-MS peaks with m/z, retention times, intensities and optional MS2 spectra. The output is a molecular network with peaks (nodes) assigned unique formulae and connected by edges reflecting atom differences arising either through enzymatic reaction (biochemical connection) or mass spectrometry phenomenon (abiotic connection). Peaks are classified as “metabolite” (M+H or M-H peak of formula found in HMDB), “putative metabolite” (formula not found in HMDB but with biochemical connection to a metabolite), or “artifact” (only abiotic connection to a metabolite). NetID algorithm involves three steps. Initial annotation first matches peaks to HMDB formulae. These seed annotations are then extended through edges to cover most nodes, with the majority of nodes receiving multiple formula annotations. Each node and edge annotation are then scored based on match to known masses, retention times, and MS/MS fragmentation patterns. Global network optimization maximizes sum of node scores and edge scores, while enforcing a unique formula for each node and unique transformation relationship for each edge. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 2. Utility of global network optimization. (A) An example network demonstrating the value of the global optimization step in NetID. Node a and node b match HMDB formulae and are connected by an edge of phosphate (HPO3). Node c can be connected to either node a or node b through mutually incompatible annotations, resulting in two different candidate networks. The table below the two candidate networks shows the annotations and scoring criteria for each, with the left network preferred for more good node and edge annotations. (B) Visualization of the optimal network obtained from negative mode LC-MS analysis of Baker’s yeast, containing 4851 nodes and 9699 connections. Metabolite and putative metabolite peaks are in green and artifact peaks in purple. (C) Summary table of NetID annotations of negative and positive mode LC-MS data from Baker's yeast and mouse liver. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 3. NetID reveals thiamine-derived metabolites in yeast. (A) Subnetwork surrounding thiamine. Nodes, connections, and formulae are direct output of NetID. Boxes with structures were manually added. (B) MS2 spectra of thiamine, thiamine+C2H2O, and thiamine+C2H4O, with proposed structures of the major fragments. (C) Labeling fraction of thiamine and its derivatives, in [U-13C]glucose with and without unlabeled thiamine in the medium. (D) The thiamine derivatives are also found in mouse tissues and urine. (E) Proposed mechanism for formation of thiamine+C2H4O. Pyruvate dehydrogenase (PDH) decarboxylates pyruvate, and adds the resulting [C2H4O] unit (in red) to thiamine. (F) The same enzymatic mechanism occurs in oxoglutarate dehydrogenase (OGDH) and branched-chain α-ketoacid dehydrogenase complex (BCKDC), and generates thiamine+C4H6O3 and thiamine+C4H8O respectively. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 4. NetID discovers mammalian taurine derivatives. (A) Subnetwork surrounding taurine from mouse liver extract data. Nodes, connections, and formulae are direct output of NetID. Boxes with structures were manually added. (B) LC-MS chromatogram of N-glucosyl-taurine standard and the putative glucosyl-taurine from liver extract. (C) MS2 spectrum of glucosyl- taurine peak from liver extract (top), and synthetic N-glucosyl-taurine standard (bottom). (D) Isotope labeling pattern of putative glucosyl-taurine in mice, infused via jugular vein catheter for 2 h with [U-13C]glucose. (E) Absolute N-glucosyl-taurine concentration in murine serum and tissues. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 5. NetID applies global optimization for metabolomics data annotation and metabolite discovery. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.06.425569doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425569 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_01_07_425697 ---- Capsule network for protein ubiquitination site prediction Capsule network for protein ubiquitination site prediction Qiyi Huang1,2¶ Jiulei Jiang3¶ Yin Luo2* Weimin Li4& Ying Wang3 1(School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, Ningxia, China) 2(School of Life Sciences, East China Normal University, Shanghai 200444, China) 3 (School of Computer Science and Engineering, Changshu Institute of Technology, Suzhou 215500, Jiangsu, China) 4 (School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China) *Corresporending author. E-mail: yluo@bio.ecnu.edu.cn(YL) ¶ These authors contributed equally to this work. & This author also contributed equally to this work. Copyright: © 2020 Huang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This project is supported by the National Key R&D Program of China (2018YFE0194000), National Nature Science Foundation of China (61762002), National Statistical Science Research Project (2020LY074). Competing Interests: The authors have declared that no competing interests exist. Abstract Ubiquitination modification is one of the most important protein posttranslational modifications used in many biological processes. Traditional ubiquitination site determination methods are expensive and time-consuming, whereas calculation-based prediction methods can accurately and efficiently predict ubiquitination sites. This study used a convolutional neural network and a capsule network in deep learning to design a deep learning model, “Caps-Ubi,” for multispecies ubiquitination site prediction. Two encoding methods, one-of-K and the amino acid continuous type were used to characterize the sequence pattern of ubiquitination sites. The proposed Caps-Ubi predictor achieved an accuracy of 0.91, a sensitivity of 0.93, a specificity of 0.89, a measure-correlate-prediction of 0.83, and an area under receiver operating characteristic curve value of 0.96, which outperformed the other tested predictors. Introduction Ubiquitination is an important posttranslational modification of proteins, consisting of the covalent binding of ubiquitin to a variety of cellular proteins. Ubiquitin was discovered in 1975 by .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ Goldstein et al. [1]; it is a small protein composed of 76 amino acids [2]. Ubiquitination is the process of covalently binding the lysine of a substrate protein to the small ubiquitin molecule under the action of a series of enzymes. Three enzymes are involved in the process: E1 activation, E2 conjugation, and E3 ligation. Ubiquitination modification plays a very important role in basic reactions such as signal transduction, cell diseases, DNA repair, and transcription regulation [3–6]. Due to the important biological characteristics of ubiquitination, identifying potential ubiquitination sites helps to understand protein regulation and molecular mechanisms. Determining ubiquitination sites based on traditional biological experimental techniques such as mass spectrometry [7] and antibody recognition [8] is costly and time-consuming. Therefore, it is necessary to develop a calculation method that can accurately and efficiently recognize protein ubiquitination. In recent years, some calculation methods have been developed to predict potential ubiquitination sites. Huang et al. [9] used amino acid composition (AAC), a position weighting matrix, amino acid pair composition (AAPC), a position-specific scoring matrix (PSSM), and other information to develop a predictor called UbiSite using a support vector machine (SVM). Nguyen et al. [10] used an SVM to combine three kinds of information: AAC, evolution information, and AAPC to develop a predictor. Qiu et al. [11] developed a new predictor called “iUbiq-Lys” to apply to sequence evolution information and a gray system model. Chen et al. [12] also applied SVM to build a UbiProber predictor. Wang et al. [13] introduced physical–chemical attributes into an SVM to develop the ESA-UbiSite predictor. Radivojac et al. [14] developed the predictor UbPred using a random forest algorithm. Lee et al. [15] developed UbSite using efficient radial basis functions. All of those machine learning-based methods and predictors have promoted the development of ubiquitination site prediction research and achieved good prediction performance. However, most of them rely on artificial feature selection, which may lead to imperfect features [16], and their datasets are small despite the large volume of accumulated biomedical data. Deep learning, the most advanced machine learning technology, can handle large-scale data well. It has multilayer networks and nonlinear mapping operations, which can fit the complex structure of data well. In recent years, deep learning has been developed rapidly [16] and has been successfully applied in various fields of bioinformatics [17,18]. Some methods based on deep learning have been used for ubiquitination site identification. For example, Fu et al. [19] applied one-hot and composition of k-spaced amino acid pairs encoding methods to develop DeepUbi with text-CNN. Liu et al. [20] used deep transfer learning methods to develop the DeepTL-Ubi predictor for multispecies ubiquitination site prediction. He et al. [21] established a multimodel predictor using one-hot, physical–chemical properties of amino acids, and a PSSM. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ Although various ubiquitination site predictors and tools have been developed, there are still some limitations, and their accuracy and other performance elements must be further improved. In this paper, a deep learning model, “Caps-Ubi,” is proposed that uses a capsule network for protein ubiquitination site prediction. In Caps-Ubi, the protein fragments are first passed through one-of-K and amino acid continuous methods to encode them. Then three convolutional layers and the capsule network layer are used as a feature extractor to obtain the functional domains in the protein fragments and finally to get the prediction result. Relative to existing tools, the prediction performance of Caps-Ubi is a significant improvement. Researchers could use the predictor to select potential ubiquitination candidate sites and do experiments to verify them, which will reduce the range of protein candidates and save time. Materials and methods Benchmark dataset The ubiquitination dataset came from the largest online protein lysine modification database, PLMD 3.0, which contains 20 protein lysine modifications. The database has 53,501 proteins and 284,780 protein lysine modification sites, including 25,103 proteins and 121,742 ubiquitination sites. To eliminate errors caused by homologous sequences, we used CD-HIT [22] to filter out homologous sequences with sequence similarities greater than 40%. We obtained 12,100 proteins and 54,586 ubiquitination sites, which were used as a positive sample set. Based on those annotated sequences, 427,305 nonubiquitinated sites were extracted from the proteins as a negative sample set, and CD-HIT-2D [23] was used to filter out homologous sequences within the positive sample set that were greater than 50%. To establish a balanced training model, we randomly selected the same data as the positive sample set and selected 90% of it as the training and validation sets and 10% as the independent test set. Finally, 53,999 data on ubiquitination sites and 50,315 data on nonubiquitination sites were obtained. The final data division is shown in Table 1. Table 1. Data of protein ubiquitination sites Dataset No. of positive data No. of negative data Training 44,214 44,214 Validation 4,913 4,913 Testing 5,459 5,459 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ Input sequence coding The coding method directly determines the quality of its prediction results; a good feature can extract the correlation between the ubiquitination feature and the targets from peptide sequences [24]. After encoding the protein sequence, the sequence information is converted into digital information, and then deep learning is done on it. In this study, two methods were used to encode the amino acid sequence around the protein ubiquitination site; namely, one-of-K encoding and amino acid continuous encoding. One-of-K encoding The one-of-K encoding method was adopted for protein fragments, and each protein fragment was encoded into an m × k 2D matrix, where m is the number of amino acids in each sequence— that is, the length of the input sequence—and k is the type of amino acid. There are 20 kinds of common amino acids. When the length of the input sequence did not reach the window length, it was filled in with a “-” on the left or right side of the protein fragment and was treated as another amino acid, so each sequence consisted of 21 amino acids. Continuous coding of amino acids The continuous amino acid coding method [25] was proposed by Venkatarajan; the coding uses 237 physical-chemical properties to quantitatively characterize 20 amino acids. They used five main components to characterize the changes in 237 physica-chemical properties of amino acids. In this paper, each amino acid is represented by a 6D vector, wherein the first 5D represents the five principal components as shown in Table 1 of [25], the last 1D represents the gap in the input protein fragment with a length of m. The gap is represented by a dash“-”, meaning that when the sequence length does not reach the window length, the bit is coded as 1; otherwise, it is 0. Finally, each protein fragment is coded into an m × 6 2D matrix. This continuous coding scheme can comprehensively consider the physical and chemical properties of protein amino acids and has a smaller dimension than that of one-of-K coding. The smaller input dimension will lead to a relatively simple network structure, which is beneficial to avoid overfitting. Capsule network In a CNN, the pooling layer can extract valuable information from the data, but some location information is lost [26]. Also, a CNN outputs scalar values in neurons, and the information represented by scalar neurons is limited and cannot reflect the spatial position relation of the internal .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ features of the neural network. To solve the problems of scalar neurons, in 2017 Hinton proposed a deep learning architecture called a capsule network [27]. The main building module of a capsule network is the capsule [28], which is a set of neuron vectors. The length of the capsule represents the probability of the existence of an entity; the longer the capsule, the greater the probability,and the direction of the capsule represents the state of the entity. The capsule network provides a unique and powerful deep learning building block that can better model the complex relations within a neural network. A CNN uses scalar input activation functions, such as the rectified linear activation function ReLU, a sigmoid, and a tanh, and the capsule network uses an activation function called a squash. The calculation equation is (1) where 𝑣 𝑗 is the output of capsule 𝑗 , and 𝑠 𝑗 is the weighted sum of the input vectors of capsule 𝑗 . This function compresses the vector length to the interval [0,1], which can be regarded as a kind of compression and reallocation of the vector length. In addition to the first-layer capsule network, the input of the capsule 𝑠 𝑗 is obtained by the weighted sum of the prediction vector (𝑢 𝑗 | 𝑖 ) located in the lower-layer capsule, and the prediction vector (𝑢 𝑗 | 𝑖 ) is passed through the lower layer. The capsule is calculated by multiplying its output (𝑢 𝑖 ) and the weight matrix (𝑤 𝑖 𝑗 ): (2) (3) where 𝑐𝑖𝑗 is the coupling coefficient, which is obtained by a softmax transformation from 𝑏𝑖𝑗; its calculation equation is (4) In Eq. (4), the sum of the coupling coefficients of all capsules and capsule 𝑖 in the previous layer is 1. The coupling coefficient is obtained through a dynamic routing mechanism; the pseudocode is as follows: procedure ROUTING ( 𝑢𝑗|𝑖 ,r,l) 2 2 || || 1 || || || || j j j j j s s v s s   |ˆj i ij j is c u  |ˆ j i ij iu w u exp( ) exp( ) ij ij k ik b c b   .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ for all capsules i in layer l and capsules j in layer (l + 1): 𝑏𝑖𝑗 0. for r iterations do: for all capsules i in layer l:𝑐𝑖 softmax (𝑏𝑖) for all capsules j in layer (l + 1): 𝑠𝑗 𝛴𝑐𝑖𝑗𝑢𝑗|𝑖 for all capsules j in layer (l + 1): 𝑣𝑗 squashing (𝑠𝑗) for all capsules i in layer l and capsules j in layer (l + 1):𝑏𝑖𝑗 𝑏𝑖𝑗 + 𝑢𝑗|𝑖. 𝑣𝑗 return 𝑣𝑗 The loss function of the capsule network is the margin loss function, and the calculation equation is (5) where 𝐾 is the number of categories, 𝑇 𝐾 is the real label ubiquitinated to 1 and nonubiquitinated to 0, | | 𝑉 𝑘 | | is the output length of the kth capsule, which is the probability of predicting the kth class. The boundary 𝑚 + is 0.9, which is a penalty for false positives, and the lower boundary 𝑚 ― is 0.1, which is a penalty for false negatives. 𝜆 is a proportional coefficient of 0.5, which is used to control the loss caused when some categories do not appear , to prevent the capsule vector length of all categories from being reduced in the early stage of training,and the total loss is the sum of the losses of 𝐾 categories. Architecture design As shown in Figure 1, the structure of the proposed model contains two identical subnetworks that process one-of-21 and amino acid continuous encoding modes. After training in their respective network model, the two models merge the features as the final output. Each subnetwork consists of the same three 1D convolutional layers (Conv1, Conv2, Conv3) and a capsule network layer. The first convolutional layer (Conv1) of the network is a 1D convolution kernel, which comprises 256 convolution kernels with a size of 1 and a step size of 1 that use the ReLU activation function. A convolution kernel with a length of 1 first appears in the Network in Network [29]; a convolution kernel with a length of 1 can reduce the complexity of the model and can make the network deeper 2 2L max(0, || ||) (1 ) max(0,|| || )k k k k kT m V T V m       .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ and wider. Applied in this study, it acted as a feature filter and could pool features in two encoding modes. The second convolutional layer, Conv2, is a conventional convolutional layer with 256 1D convolution kernels with a length of 7 and a step size of 1, which functions as a local feature detector to extract the protein sequence input and convert it to corresponding local features. Conv2 is understood as the functional domain characteristics of the protein, and its output is used as the input of the next layer, Conv3. The third convolutional layer, Conv3, has 256 1D convolution kernels with a size of 11 and a step size of 1. The activation function used is ReLU and a dropout mechanism with a random deletion rate of 0.3. The dropout mechanism is used to prevent the model from overfitting and to increase the generalization ability of the model. These two convolutional layers are used to increase the feature representation ability of the capsule network and convert the original features of protein fragments into more advanced and abstract features. Then the local features of Conv2 are used as the input of the PrimaryCapsule network layer. The dimension of each capsule in the PrimaryCapsule is 8, the step size is 1, the convolution kernel length is 20, and the squash activation function is used. The last layer of LabelCapsule is a capsule with a dimension of 10, which is used to represent the two states of the input protein fragment: the input sequence is ubiquitination site or non-ubiquitination site, and finally the output of the two subnetworks are merged as the final prediction result. Figure 1. Network structure structure of the proposed model Model training For model training, we used the Adam[30] optimization algorithm. Adam can automatically adjust the learning rate of the parameters, improve the training speed, and improve the stability of the model. The learning rate was 0.003, the first-order estimated exponential decay rate was 0.9, and the exponential decay rate estimated by the second moment was 0.999. The dynamic routing .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ mechanism was consistent with that in the original paper [26]. The number of routing iterations was 3, and the boundary loss function was used as the loss function of the model. The boundary loss function form is shown in Eq. (5). and the number of model training iterations was 50 epochs. The deep learning framework used by this model was Keras 2.1.4. Keras is a highly modular deep learning framework based on Theano and written in Python; it supports both CPU and GPU. The programming language was Python 3.5, and the model was trained and tested on a Windows 10 system equipped with an Nvidia RTX 2060 GPU. Result Model evaluation and performance indicators A confusion matrix is a visual display tool used to evaluate the quality of classification models. Each row of the matrix represents the actual condition of the sample, and each column represents the sample condition predicted by the model. There are four values in the matrix, as shown in the following equations, where FN is the number of false negatives, FP is the number of false positives, TN is the number of true negatives, and TP is the number of true positives. The following indicators based on the confusion matrix are usually used to evaluate the prediction of the model performance: Among them, Sn stands for sensitivity, which is the evaluation of the prediction performance of negative samples; Sp is the specificity, which is the evaluation of the prediction performance of positive samples; Acc is the accuracy, which is the evaluation of the accuracy of the model; and MCC is the Matthew’s correlation coefficient, which is the overall evaluation of the model. The receiver operating characteristic (ROC) curve and the area under the curve (AUC) for the ROC curve are usually used to evaluate the pros and cons of binary classifiers: the larger the AUC value, the better the model performance.   FN ( )( )( ) TP TN FP TN FP T MCC TP FN P FP TN FN         .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ Experimental results First, we did many experiments on the selection of the window size of protein fragments. Because the correlation information between amino acids had a direct effect on the prediction results, we needed to determine an appropriate window size. Previous studies directly used empirical values such as 21, 33, or 49. However, different data models and classifiers tend to have different window sizes [31]. Therefore, a window length of n was selected from a range of 21 to 75, and we did a series of experiments with the different window lengths. For each window length, we encoded all training data into two input modes and trained their respective subnetworks. According to the prediction results of the validation set, we selected each appropriate window size. Figure 2 shows the performance of various window sizes in one-of-21 and amino acid continuous encoding modes. Figure 2. Accuracy of the verification set for various window lengths In Figure 2, the abscissa represents the window length, and the ordinate represents the accuracy of the model. It can be seen from Figure 3 that when the window length was 51, the two encoding modes had the highest accuracy. Therefore, we set the window length of this model to 51. To compare the performance of the model under different encoding schemes, we compared the capsule network and the CNN with similar hierarchical structures of capsule networks and the same training set size. The CNN structure replaced only the PrimaryCapsule layer with the Conv3 layer. We set the LabelCapsule layer to a 128 × 1 fully connected layer. The comparison results are shown in Table 2. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ Table 2. Comparison of various coding schemes Feature Model Acc (%)1 Sn (%)2 Sp (%)3 AUC4 MCC5 CapsNet 89.51 93.70 85.31 0.96 0.80 One-of-21 CNN 84.93 86.39 82.93 0.93 0.70 CapsNet 90.06 91.88 88.23 0.96 0.80 Amino acid continuous CNN 83.83 85.25 82.41 0.91 0.68 CapsNet 90.47 93.66 87.27 0.96 0.81One-of-21 and amino acid continuous CNN 84.67 82.62 86.72 0.93 0.70 1Accuracy of the model 2Sensitivity of the model 3Specificity of the model 4Area under curve 5Matthew’s correlation coefficient From Table 2, it can be concluded that the capsule network’s accuracies were 5.39%, 7.43%, and 6.85% percentage points higher than those of CNN under the one-of-21, amino acid continuous, and combined one-of-21 and amino acid continuous types, indicating that the capsule network internally expressing the hierarchical relation modeling aspect has more advantages than CNN. Among them, the performance under the combined one-of-21 and amino acid continuous encoding modes is the best on the capsule network: this proposed Caps-Ubi model achieved an accuracy, sensitivity, specificity, area under curve, and Matthew’s correlation coefficient of 91.23%, 93.11%, 89.34%, 0.96, 0.83 respectively. The proposed Caps-Ubi was obtained from balanced data. The ROC curve of Caps-Ubi on the test set is shown in Figure 3, which shows that it was very close to the real situation. Figure 3. Receiver operating characteristic curve of Caps-Ubi on the test set When we used balanced data to train the model on an experimentally verified ubiquitination dataset and a nonubiquitination dataset [19], the ratio of positive peptides and negative peptides was 1:8, so we tested Caps-Ubi using natural-distribution data. The test results are shown in Table 3. According to the test results, the performance was slightly worse than that under the balanced data. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ Table 3. Results of testing Caps-Ubi under natural-distribution data Protein fragment Acc (%)1 Sn (%)2 Sp (%)3 AUC4 MCC5 Positive–negative ratio 1,000 53.75 0.08 0.99 0.70 0.19 1:8 10,000 53,30 0.12 0.95 0.59 0.12 1:8 1Accuracy of the model 2Sensitivity of the model 3Specificity of the model 4Area under curve 5Matthew’s correlation coefficient Comparison with other methods In the past 10 years, many researchers have contributed to the prediction and research of protein ubiquitination sites. We compared the proposed model with other sequence-based prediction tools. The corresponding data and results are shown in Table 4, which shows that the performance of the Caps-Ubi model exceeded that of the best-performing deep learning model DeepUbi and several other prediction models. The accuracy, sensitivity, specificity, area under curve, and Matthew’s correlation coefficient of Caps-Ubi were 2.36, 3.31, 1.24, 0.05, and 0.05 respectively percentage points higher than those of DeepUbi. Table 4. Proposed Caps-Ubi compared with other methods Predictor Acc (%)1 Sn (%)2 Sp (%)3 AUC4 MCC5 UbiPred 84.44 83.44 85.43 0.85 0.69 UbSite 74.5 65.5 74,8 – – CKSAAP_UbSite 73.4 69.85 76.96 0.81 0.47 UbiProber – 37.0 90.0 0.77 0.63 iUbiq-Lys 82.14 80.56 99.39 – 0.50 DeepUbi 88.98 89.8 88,10 0.91 0.78 Caps-Ubi 91.34 93.11 89.34 0.96 0.83 1Accuracy of the model 2Sensitivity of the model 3Specificity of the model 4Area under curve 5Matthew’s correlation coefficient Conclusion and outlook In this paper, a new deep learning model for predicting protein ubiquitination sites is proposed, using one-of-K and amino acid continuous coding modes. We used the largest available protein ubiquitination site dataset, and the experimental results above verify the effectiveness of this model. The operation of the model has four main steps: encoding protein sequences, constructing convolutional layers, constructing a capsule network layer, and constructing an output layer. The capsule network introduces a new building block for deep learning. Relative to CNN, the capsule network, which uses a dynamic routing mechanism to update parameters, requires more training time, but the time required for prediction is similar. The capsule network can also characterize the .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ complex relations among amino acids in various sequence positions and can explore the internal data distribution related to biochemical significance. The proposed Caps-Ubi prediction tool will facilitate the sequence analysis of ubiquitination and can also be used to identify other posttranslational modification sites in proteins. In the future, we will study other features that may better extract sample attributes to construct deeper models. References 1. Goldstein G, Scheid M, Hammerling U, Schlesinger DH, Niall HD, Boyse EA. Isolation of a polypeptide that has lymphocyte-differentiating properties and is probably represented universally in living cells. Proc Natl Acad Sci U S A. 1975;72:11-15. 2. Wilkinson KD. The discovery of ubiquitin-dependent proteolysis. Proc Natl Acad Sci U S A. 2005;102:15280-15282. 3. Hicke L, Schubert HL, Hill CP. Ubiquitin-binding domains. Nat Rev Mol Cell Biol. 2005;6:610 621. 4. Hicke L. Protein regulation by monoubiquitin. Nat Rev Mol Cell Biol. 2001;2:195-201. 5. Pickart CM. Ubiquitin enters the new millennium. Mol Cell. 2001;8:499-504. 6. Haglund K, Dikic I. Ubiquitylation and cell signaling. EMBO J. 2005;24:3353-3359. 7. Peng J, Schwartz D, Elias JE, et al. A proteomics approach to understanding protein ubiquitination. Nat Biotechnol. 2003;21:921-926. 8. Gentry MS, Worby CA, Dixon JE. Insights into Lafora disease: malin is an E3 ubiquitin ligase that ubiquitinates and promotes the degradation of laforin. Proc Natl Acad Sci U S A. 2005;102(24):8501-8506. 9. Huang CH, Su MG, Kao HJ, Jhong JH, Weng SL, Lee TY. UbiSite: incorporating two-layered machine learning method with substrate motifs to predict ubiquitin-conjugation site on lysines. BMC Syst Biol. 2016;10 Suppl 1(Suppl 1):6. 10. Nguyen VN, Huang KY, Huang CH, Lai KR, Lee TY. A New Scheme to Characterize and Identify Protein Ubiquitination Sites. IEEE/ACM Trans Comput Biol Bioinform. 2017;14:393- 403. 11. Qiu WR, Xiao X, Lin WZ, Chou KC. iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. J Biomol Struct Dyn. 2015;33:1731-1742. 12. Chen X, Qiu JD, Shi SP, Suo SB, Huang SY, Liang RP. Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ sites. Bioinformatics. 2013;29:1614-1622. 13. Wang JR, Huang WL, Tsai MJ, Hsu KT, Huang HL, Ho SY. ESA-UbiSite: accurate prediction of human ubiquitination sites by identifying a set of effective negatives. Bioinformatics.2017;33:661-668. 14. Radivojac P, Vacic V, Haynes C, et al. Identification, analysis, and prediction of protein ubiquitination sites. Proteins. 2010;78(2):365-380. 15. Lee TY, Chen SA, Hung HY, Ou YY. Incorporating distant sequence features and radial basis function networks to identify ubiquitin conjugation sites. PLoS One. 2011;6:e17331. 16. Wang D, Zeng S, Xu C, et al. MusiteDeep: a deep-learning framework for general and kinase specific phosphorylation site prediction. Bioinformatics. 2017;33:3909-3916. 17. Shaw D, Chen H, Jiang T. DeepIsoFun: a deep domain adaptation approach to predict isoform functions. Bioinformatics. 2019;35(15):2535-2544. 18. Sun, D. , Wang, M. , Feng, H. , & Li, A. . (2018). Prognosis prediction of human breast cancer by integrating deep neural network and support vector machine: Supervised feature extraction and classification for breast cancer prognosis prediction. 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE. 19. Fu H, Yang Y, Wang X, Wang H, Xu Y. DeepUbi: a deep learning framework for prediction of ubiquitination sites in proteins. BMC Bioinformatics. 2019;20:86. 20. Liu Y, Li A, Zhao XM, Wang M. DeepTL-Ubi: A novel deep transfer learning method for effectively predicting ubiquitination sites of multiple species. Methods. 2020;S1046- 2023(20)30156-0. 21. He F, Wang R, Li J, Bao L, Xu D, Zhao X. Large-scale prediction of protein ubiquitination sites using a multimodal deep architecture. BMC Syst Biol. 2018;12(Suppl 6):109. 22. Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26:680-682. 23. Huang CH, Su MG, Kao HJ, Jhong JH, Weng SL, Lee TY. UbiSite: incorporating two-layered machine learning method with substrate motifs to predict ubiquitin-conjugation site on lysines. BMC Syst Biol. 2016;10 Suppl 1(Suppl 1):6. 24. Plewczynski D, Tkacz A, Wyrwicz LS, Rychlewski L. AutoMotif server: prediction of single residue post-translational modifications in proteins. Bioinformatics. 2005;21:2525-2527. 25. Venkatarajan M S , Braun W . New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties[J]. Molecular modeling annual, 2001, 7(12):445-453. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ 26. Dombetzki LA. An overview over capsule networks. Network Architectures and Services 2018. 27. Sabour S , Frosst N , Hinton G E . Dynamic Routing Between Capsules[J]. 2017. 28. Hinton,G.E. et al. (2011) Transforming Auto-encoders. International Conference on Artifificial Neural Networks. Springer, Finland, pp. 44–51. 29. Lin M., Chen Q., Yan S. Network in network[J]. arXiv preprint arXiv:1312.4400,2013: 30. Kingma,D. and Ba,J. (2014) Adam: a method for stochastic optimization, arXiv preprint arXiv:1412.6980 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425697doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425697 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_01_07_425716 ---- Comprehensive comparison of transcriptomes in SARS-CoV-2 infection: alternative entry routes and innate immune responses Comprehensive comparison of transcriptomes in SARS-CoV-2 infection: alternative entry routes and innate immune responses Yingying Cao1∗, Xintian Xu2, Simo Kitanovski1, Lina Song3, Jun Wang1, Pei Hao2,4∗, Daniel Hoffmann1∗ 1Bioinformatics and Computational Biophysics, Faculty of Biology and Center for Medical Biotechnology, University of Duisburg-Essen, Essen 45141, Germany 2Key Laboratory of Molecular Virology and Immunology, Institut Pasteur of Shanghai, Center for Biosafety Mega-Science, Chinese Academy of Sciences, Shanghai 200031, China 3Translational Skin Cancer Research, German Consortium for Translational Cancer Research, Essen, Germany 4The Joint Program in Infection and Immunity: a. Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, Guangzhou 510623, China; b. Institut Pasteur of Shanghai, Chinese Academy of Sciences, Shanghai 200031, China ∗To whom correspondence should be addressed; E-mail: daniel.hoffmann@uni-due.de, phao@ips.ac.cn, yingying.cao@uni-due.de. The pathogenesis of COVID-19 emerges as complex, with multiple factors leading to injury of different organs. Several studies on underlying cellular processes have produced contradictory claims, e.g. on SARS-CoV-2 cell en- try or innate immune responses. However, clarity in these matters is imper- 1 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 ative for therapy development. We therefore performed a meta-study with a diverse set of transcriptomes under infections with SARS-CoV-2, SARS- CoV and MERS-CoV, including data from different cells and COVID-19 pa- tients. Using these data, we investigated viral entry routes and innate im- mune responses. First, our analyses support the existence of cell entry mech- anisms for SARS and SARS-CoV-2 other than the ACE2 route with evidence of inefficient infection of cells without expression of ACE2; expression of TM- PRSS2/TPMRSS4 is unnecessary for efficient SARS-CoV-2 infection with ev- idence of efficient infection of A549 cells transduced with a vector expressing human ACE2. Second, we find that innate immune responses in terms of inter- ferons and interferon simulated genes are strong in relevant cells, for example Calu3 cells, but vary markedly with cell type, virus dose, and virus type. Introduction Coronaviruses are non-segmented positive-sense RNA viruses with a genome of around 30 kilobases. The genome has a 5’ cap structure along with a 3’ poly (A) tail, which acts as mRNA for translation of the replicase polyproteins. The replicase gene occupies approximately two thirds of the entire genome and encodes 16 non-structural proteins (nsps). The remaining third of the genome contains open reading frames (orfs) that encode accessory proteins and four structural proteins, including spike (S), envelope (E), membrane (M), and nucleocapsid (N) (1). Over the past 20 years, three epidemics or pandemics of life-threatening diseases have been caused by three closely related coronaviruses – severe acute respiratory syndrome coronavirus (SARS-CoV), which emerged with nearly 10 % mortality (2, 3) in 2002-2003 and spread to 26 countries before being contained; Middle East respiratory syndrome coronavirus (MERS-CoV), with mortality around 34 % (4, 5) starting in 2012 and since then spreading to 27 countries; 2 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 SARS-CoV-2, emerging in late 2019 (6), which has caused many millions of confirmed cases and > 1 million deaths worldwide (7). Infection with SARS-CoV, MERS-CoV or SARS-CoV-2 can cause a severe acute respiratory illness with similar symptoms, including fever, cough, and shortness of breath. SARS-CoV-2 is a new coronavirus, but its similarity to SARS-CoV (amino acid sequences about 76% identical (8)) and MERS-CoV suggests comparisons to these earlier epidemics. De- spite the difference in the total number of cases caused by SARS-CoV and SARS-CoV-2 (3, 7) due to different transmission rates, the outbreak caused by SARS-CoV-2 resembles the out- break of SARS: both emerged in winter and were linked to exposure to wild animals sold at markets. Although MERS-CoV has high morbidity and mortality rates, lack of autopsies from MERS-CoV cases has hindered our understanding of MERS-CoV pathogenesis in humans. Until now there are no specific anti-SARS-CoV-2, anti-SARS-CoV or anti-MERS-CoV therapeutics approved for human use. There are several points of attack for potential anti- SARS-CoV-2/SARS-CoV/MERS-CoV therapies, e.g. intervention on cell entry mechanisms to prevent virus invasion, or acting on the host immune system to kill the infected cells and thus prevent replication of the invading viruses. A better understanding of virus entry mechanisms and the immune responses can therefore guide the development of novel therapeutics. Virus entry into host cells is the first step of the viral life cycle. It is an essential component of cross-species transmission and an important determinant of virus pathogenesis and infectivity (9, 10), and also constitutes an antiviral target for treatment and prevention (11). It seems that SARS-CoV and SARS-CoV-2 use similar virus entry mechanisms (12). The infection of SARS- CoV or SARS-CoV-2 in target cells was initially identified to occur by cell-surface membrane fusion (13,14). Some later studies have shown that SARS-CoV can infect cells through receptor mediated endocytosis (15, 16) as well. Both mechanisms require the S protein of SARS-CoV or SARS-CoV-2 to bind to angiotensin converting enzyme 2 (ACE2), and S protein of MERS- 3 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 CoV to dipeptidyl peptidase 4 (DPP4) (17), respectively, through their receptor-binding domain (RBD) (18). In addition to ACE2 and DPP4, some recent studies suggest that there are possible other coronavirus-associated receptors and factors that facilitate the infection of SARS-CoV- 2 (19), including the cell surface proteins Basignin (BSG or CD147) (20), and CD209 (21). Recently, clinical data have revealed that SARS-CoV-2 can infect several organs where ACE2 expression could not be detected in healthy individuals (22, 23), which highlights the need of closer inspection of virus entry mechanisms. The binding of S protein to a cell-surface receptor is not sufficient for infection of host cell (24). In the cell-surface membrane fusion mechanism, after binding to the receptor, the S protein requires proteolytic activation by cell surface proteases like TMPRSS2, TPMRSS4, or other members of the TMPRSS family (14, 25, 26), followed by the fusion of virus and target cell membranes. In the alternative receptor mediated endocytosis mechanism, the endocytosed virion is subjected to an activation step in the endosome, resulting in the fusion of virus and endosome membranes and the release of the viral genome into the cytoplasm. The endosomal cysteine proteases cathepsin B (CTSB) and cathepsin L (CTSL) (27) might be involved in the fusion of virus and endosome membranes. Availability of these proteases in target cells largely determines whether viruses infect the cells through cell-surface membrane fusion or receptor mediated endocytosis. How the presence of these proteases impacts efficiency of infection with SARS-CoV-2, SARS-CoV and MERS-CoV, still remains elusive. When the virus enters a cell, it may trigger an innate immune response, a crucial compo- nent of the defense against viral invasion. Compounds that regulate innate immune responses can be introduced as antiviral agents (10). The innate immune system is initialized as pat- tern recognition receptors (PRRs) such as Toll-like receptors (TLRs) and cytoplasmic retinoic acid-inducible gene I (RIG-I) like receptors (RLRs) recognize molecular structures of the in- vading virus (28, 29). This pattern recognition activates several signaling pathways and then 4 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 downstream transcription factors such as interferon regulator factors (IRFs) and nuclear factor κB (NF-κB). Transcriptional activation of IRFs and NF-κB stimulates the expression of type I (α or β) and type III (λ) interferons (IFNs). IFN-α (IFNA1, IFNA2, etc), IFN-β (IFNB1) and IFN-λ (IFNL1-4) are important cytokines of the innate immune responses. IFNs bind and induce signaling through their corresponding receptors (IFNAR for IFN-α/β and IFNLR for IFN-λ), and subsequently induce expression of IFN-simulated genes (ISGs) (e.g. MX1, ISG15 and OASL) and pro-inflammatory chemokines (e.g. CXCL8 and CCL2) to suppress viral repli- cation and dissemination (30, 31). Dysregulated inflammatory host response results in acute respiratory distress syndrome (ARDS), a leading cause of COVID-19 mortality (32). One attractive therapy option to combat COVID-19 is to harness the IFN-mediated innate immune responses. Clinical trials with type I and type III IFNs for treatment of COVID-19 have been conducted and many more are still ongoing (33, 34). In this regard, the kinetics of the secretion of IFNs in the course of SARS-CoV-2 infection needs to be defined. Unfortunately, some results on the host innate immune responses to SARS-CoV-2 are apparently at odds with each other (35–39), e.g. it is unclear whether SARS-CoV-2 infection induces low IFNs and moderate ISGs (35), or robust IFN responses and markedly elevated expression of ISGs (36– 39). This has to be clarified. The use of IFNs as a treatment in COVID-19 is now a subject of debate as well (40). Thus, the kinetics of IFN secretion relative to the kinetics of virus replication need to be thoroughly examined to better understand the biology of IFNs in the course of SARS-CoV-2 infection and thus provide guidance to identify the temporal window of therapeutic opportunity. We have collected and analyzed a diverse set of publicly available transcriptome data (35, 41–45): (1) bulk RNA-Seq data with different types of cells, including human non-small cell lung carcinoma cell line (H1299), human lung fibroblast-derived cells (MRC5), human alveo- lar basal epithelial carcinoma cell line (A549), A549 cells transduced with a vector expressing 5 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 human ACE2 (A549-ACE2), primary normal human bronchial epithelial cells (NHBE), hetero- geneous human epithelial colorectal adenocarcinoma cells (Caco2), and African green monkey (Chlorocebus sabaeus) kidney epithelial cells (Vero E6) infected with SARS-CoV-2, SARS- CoV and MERS-CoV (Table 1); (2) RNA-Seq data of lung samples, peripheral blood mononu- clear cell (PBMC) samples, and bronchoalveolar lavage fluid (BALF) samples of COVID-19 patients and their corresponding healthy controls (Table 1 and Table 2). Using this collection, we systemically evaluated the replication and transcription status of virus in these cells, ex- pression levels of coronavirus-associated receptors and factors, as well as the innate immune responses of these cells during virus infection. Results Different infection efficiency of SARS-CoV-2, SARS-CoV and MERS-CoV in different cell types The RNA-Seq data for all samples can be aligned to the genome of the corresponding virus to evaluate the infection efficiency in cells, estimated by the mapping rate to the virus genome, i.e. the percentages of viral RNAs in intracellular RNAs. To assess the infection efficiency of SARS-CoV-2, SARS-CoV, and MERS-CoV in different types of cells, we collected and analyzed a comprehensive public datasets of RNA-Seq data of cells infected with these viruses at 24 hours post infection (hpi) with comparable multiplicity of cellular infection (MOI) (Table 1). MOI refers to the number of viruses that are added per cell in infection experiments. For example, if 2000 viruses are added to 1000 cells, the MOI is 2. Our analysis shows that the infection efficiency of viruses can be both cell type dependent and virus dose dependent (Fig. 1). MERS-CoV can efficiently infect MRC5 and Vero E6 cells. However, the infection efficiency is influenced strongly by MOI in the same type of cells. Cells infected with low MOI, say 0.1, have significantly lower mapping rates than those with high 6 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 MOI, say 3 (Fig. 1). For SARS-CoV and SARS-CoV-2, the infection efficiency is influenced strongly by cell type. For SARS-CoV-2, there is efficient virus infection in A549-ACE2, Calu3, Caco2, and Vero E6 cells, but not in A549, H1299, or NHBE cells (Fig. 1 and Table S1). The mapping rates in A549, H1299, and NHBE cells are low even at high MOIs (Fig. 1 and Table S1). Similar to SARS-CoV-2, the infection by SARS-CoV is also cell type dependent, Vero E6 cells and Calu3 cells show high mapping rates to SARS-CoV genome, but the mapping rates of SARS-CoV in MRC5 and H1299 cells are close to zero even at the high MOI of 3 (Fig. 1 and Table S1). Since “total RNA” (see Methods/Data collection) includes additional negative-strand templates of virus, the mapping rates are usually much higher than those that used the PolyA+ selection method in the same condition (Fig. 1 and Table S1). Evidence for multiple entry mechanisms for SARS-CoV-2 and SARS-CoV To examine the detailed replication and transcription status of these viruses in the cells, we calculated the number of reads (depth) mapped to each site of the corresponding virus genome (Fig. 2). For better comparison, these read numbers were log10 transformed. The replication and transcription of MERS-CoV, SARS-CoV-2 and SARS-CoV share an uneven pattern of expression along the genome, typically with a minimum depth in the first half of the viral genome, and the maximum towards the end. Among the parts with very high levels, there are especially coding regions for structural proteins, including S, E, M, and N proteins, as well as the first coding regions with nsp1 and nsp2. Interestingly, there is an exception for BALF samples in COVID-19 patients, which show a more irregular, fluctuating behavior along the genome (Fig. 2B). The deviation from the cellular expression pattern is not surprising because BALF is not a well-organized tissue but a mixture of many components, some of which will probably digest viral RNA. Interestingly, the mentioned uneven transcription pattern of efficient infections with SARS- 7 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 CoV-2, SARS-CoV, and MERS-CoV, is also visible for inefficient infection with SARS-CoV-2 in A549, NHBE, and H1299 cells, and SARS-CoV in H1299 and MRC5 cells (Fig. 2C, D), although there the total mapping rates to their corresponding virus genomes are much lower (Fig. 1). To further elucidate the corresponding entry mechanisms for different types of cells, we examined the expression levels of those receptors and proteases that have already been described as facilitating target cell infection (Fig. 3). Our analysis shows that MERS-CoV can efficiently infect MRC5 and Vero E6 cells (Fig. 1 and Fig. 2E) that both express DPP4 (Fig. 3A), though compared to Vero E6 cells, MRC5 cells infected with MERS-CoV have higher expression levels of DPP4 (Fig. 3A), but lower mapping rates to the virus genome (Fig. 1). These observations show that higher expression levels of the receptor (DPP4) do not guarantee higher MERS-CoV infection efficiency in cells. This is also true for SARS-CoV-2 receptor ACE2, which is expressed three orders of magnitudes higher in A549-ACE2 cells than in Vero E6 cells (Fig. 3B), while both cells produce about the same amount of virus (Fig. 1). Although SARS-CoV-2 can efficiently infect A549-ACE2 cells (Fig. 1 and Fig. 2), there is no expression of TMPRSS2 or TMPRSS4 (Fig. 3C, D), needed for the canonical cell-surface membrane fusion mechanism (Fig. 3J). However, there are considerable expression levels of CTSB and CTSL (Fig. 3E, F), which are involved in endocytosis (Fig. 3J). In A549, H1299, and MRC5 cells, which do express small amounts of SARS-CoV-2 and SARS-CoV virus (Fig. 1, Fig. 2C, D), there is no ACE2 expression at all (Fig. 3B). This could point to an alternative ACE2-independent entry mechanism for SARS-CoV-2 and SARS-CoV (Fig. 3J). Since there were already reports about alternative SARS-CoV-2 receptors such as BSG/CD147 and CD209 (20, 21), we examined their expressions in these cells as well (Fig. 3G, H). For all cells, the expression of BSG is at the same level of 2-3 (Fig. 3G), and the expression 8 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 of CD209 is very low. Certainly, CD209 and BSG alone cannot explain the differences in virus expression (Fig. 1), nor can we exclude other low efficiency entry mechanisms. It could e.g. be that relatively inefficient alternative entry paths are often present but in some cells masked by more efficient entry via ACE2/TPMRSS. To gain a comprehensive overview we clustered cells with respect to gene expression levels of coronavirus-associated receptors and factors (Fig. 3I), and summarized conceivable mecha- nisms accordingly (Fig. 3J). Since all cells show high expression levels of CTSB and CTSL, the major differences between these cells lie in the expression levels of ACE2, TMPRSS2 and TPMRSS4. Cell-surface membrane fusion (Fig. 3J, 1a) might be mainly used in SARS-CoV-2 infec- tion of Calu3, Caco2, and NHBE cells where there are low to moderate expression of ACE2 and moderate expression of TMPRSS2 and TMPRSS4. Endocytosis (Fig. 3J, 1b) might be mainly used in SARS-CoV-2 infection of A549-ACE2 cells where ACE2 is expressed at high levels but there is no expression of TMPRSS2 or TMPRSS4. An alternative ACE2-independent way (Fig. 3J, 1c) in absence of ACE2, TMPRSS2, or TMPRSS4 could be mainly employed in SARS-CoV-2 infection of MRC5, A549, and H1299 cells. Note that although the expres- sion pattern of coronavirus-associated receptors and factors of NHBE cells is similar to that in Caco2 cells, NHBE cells are not infected efficiently by SARS-CoV-2. Vero E6 cells have mod- erate expression of ACE2, and low expression of TMPRSS2 and TMPRSS4, so all these entry mechanisms mentioned above could contribute to SARS-CoV-2 infection of Vero E6 cells. Strength of IFN/ISG response varies between cell lines and viruses, with strong response to SARS-CoV-2 in relevant cells As a virus enters a cell, it may trigger an innate immune response, i.e. the cell may start expres- sion of various types of innate immunity molecules at different strengths. There is currently 9 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 an intense debate about which of these molecules, especially IFNs and ISGs, are expressed how strongly (35–39). We therefore focused in our analysis on innate immunity molecules such as IFNs, ISGs, and pro-inflammatory cytokines. To broaden the basis for conclusions, we analyzed, apart from cell lines, bulk RNA-Seq data of lung, PBMC, and BALF samples of COVID-19 patients, and single-cell RNA-Seq data of BALF samples from moderate and severe COVID-19 patients; for each type of patient data, we also included healthy controls. Gene ex- pressions were compared quantitatively in terms of TPM (transcripts per million), as well as log fold changes (logFC) with respect to healthy controls (human samples) or mock-infected cultures (cell lines) (Fig. S1, Fig. S2). The heatmap and clustering dendrogram of the logFC of IFNs, ISGs and pro-inflammatory cytokines in Fig. 4A reveal broadly two groups of samples with fundamentally different expres- sion of ISGs, IFNs, and pro-inflammatory cytokines. The top cluster in Fig. 4A are samples that show weaker innate immune response, includ- ing the two PBMC samples of COVID-19 patients, A549, NHBE, Caco2, and H1299 cells infected with SARS-CoV-2 and A549-ACE2 cells infected with SARS-CoV-2 at lower MOI (0.2), MRC5 cells infected with SARS, MRC5 and Vero E6 cells infected with MERS. The bottom cluster in Fig. 4A are samples that show stronger innate immune response, including BALF and lung samples of COVID-19 patients, Calu3 cells infected with SARS-CoV-2, A549- ACE2 cells infected with SARS-CoV-2 at higher MOI (2), as well as Vero E6 cells infected with SARS-CoV-2 and SARS. Most of the samples in the bottom part show markedly elevated levels of ISGs and elevated pro-inflammatory cytokines. An exception in the bottom cluster are four samples, namely Lung.1/2 and BALF.1/2, with a mixture of up- and down-regulation of ISGs and pro-inflammatory cytokines. In this respect, these four samples from patients with un- known COVID-19 severity differ from the BALF samples from moderate and severe COVID-19 patients. 10 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 The expression levels of IFNs are not upregulated either in most of these lung, PBMC and BALF samples of COVID-19 patients where no information about the severity of infection of these COVID-19 patients are available. However, we estimated the severity of their infection by aligning all the samples to SARS-COV-2 virus genome. There are no (0.00%) reads mapping to the SARS-CoV-2 genome in the PBMC samples. For the two BALF samples, there are low mapping rates (1.56% and 0.65%) to SARS-CoV-2 genome. The expression levels of ACE2 in these tissues (PBMC, lung and BALF samples) of healthy individuals are around zero (Fig. S8), which explains why there are almost no virus reads in these tissues. One of the two lung samples (accession number: SAMN14563387) has slightly upregulated IFNL1 (Fig. S6), which had been ignored in the original publication (35), although the total mapping rates to virus genome are both 0.00% for these two lung samples. We then checked the detailed coverage along the virus genome. There were a small number of virus reads aligned to SARS-CoV-2 genome in this sample (Fig. S7). Different from other lung samples that did not express ACE2, this lung sample expressed ACE2 at a considerable level (5.45 TPM, Table S2). This result implies that when SARS-CoV-2 enters into lung successfully, or when the lung tissue chosen for sequencing are successfully infected by SARS-CoV-2, IFNs (at least IFNL1) can be upregulated. Calu3 cells infected with SARS-CoV and SARS-CoV-2, and A549-ACE2 cells infected with SARS-CoV-2 at a high MOI of 2 have upregulated IFNB1, IFNL1, IFNL2 and IFNL3 (Fig. 4B-E). A549, H1299, NHBE (Fig. 4B-E), and MRC5 cells (Fig. S3), which do not support efficient virus infection, show no upregulation of IFNs. Low levels of IFN expression are also observed in Caco2 cells, which are efficiently infected with SARS-CoV and SARS-CoV-2. The same is true for A549-ACE2 cells infected with SARS-CoV-2 at low MOI of 0.2. In Vero E6 cells IFNL1 is upregulated as well in infected with SARS-CoV and SARS-CoV-2, but not with MERS-CoV (Fig. 4F). In BALF samples of moderate and severe COVID-19 patients, 11 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 upregulation of IFNs was not as obvious as in Calu3 cells, but is still present in some patients. These observations demonstrate that the innate immune response depends in complex ways on cell line, viral dose, and virus. Several studies (36–39) reported robust IFN responses and markedly elevated expression of ISGs in SARS-CoV-2 infection of different cells and patient samples. Conversely, the study by (35) concluded that weak IFN response and moderate ISG expression are characteristic for SARS-CoV-2 infection. This apparent contradiction can be resolved if we consider that Ref. (35) generalized from patient samples and cells that were only weakly infected, and that in such cases the host, in fact, responds with low levels of IFNs and ISGs. On the other hand, Ref. (35) treated efficiently infected cells, such as Calu3 and A549-ACE2 (at MOI 0f 2) as exceptions. However, our meta-analysis shows that these are not exceptions but typical for severely infected target cells that have robust IFN responses and ISG expressions (cluster 2 in Fig. 4A). Discussion One attractive potential anti-SARS-CoV-2 therapy is intervention in the cell entry mechanisms (12). However, the entry mechanisms of SARS-CoV-2 into human cells are partly unknown. During the last few months scientists have confirmed that SARS-CoV-2 and SARS-CoV both use human ACE2 as entry receptor, and human proteases like TMPRSS2 and TMPRSS4 (8, 14, 25), and lysosomal proteases like CTSB and CTSL (27) as entry activators. Since ACE2 is beneficial in cardiovascular diseases such as hypertension or heart failure (46), treatments tar- geting ACE2 could have a negative effect. Inhibitors of CTSL (47) or TMPRSS2 (14) are seen as potential treatment options for SARS-CoV and SARS-CoV-2. However, recently alternate coronavirus-associated receptors and factors including BSG/CD147 (20) and CD209 (21) have been proposed to facilitate virus invasion. Additionally, clinical data of SARS-CoV-2 infection 12 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 have shown that SARS-CoV-2 can infect several organs where ACE2 expression could not be detected (22, 23), urging us to explore other potential entry routes. First, our analyses here have shown that even without expression of TMPRRS2 or TM- PRSS4, high SARS-CoV-2 infection efficiency in cells is possible (Fig. 1A, C) with consider- able expression levels of CTSB and CTSL (Fig. 2E, F). This suggests receptor mediated endo- cytosis (15, 16, 27) as an alternative major entry mechanism. Given this TMPRSS-independent route, TMPRSS inhibitors will likely not provide complete protection. The studies designed to predict the tropism of SARS-CoV-2 by profiling the expression levels of ACE2 and TMPRSS2 across healthy tissues (48, 49) may need to be reconsidered as well. Second, the evidence presented in our study suggests further, possibly undiscovered entry mechanism for SARS-CoV-2 and SARS-CoV (Fig. 2). Although BSG/CD147 has been re- cently proposed as an alternate receptor (20), later experiments reported there was no evidence supporting the role of BSG/CD147 as a putative spike-binding receptor (50). The expression patterns of BSG/CD147 in different types of cells observed in our study could not explain the difference in virus loads observed in these cells either. CD209 and CD209L were recently re- ported as attachment factors to contribute to SARS-CoV-2 infection in human cells as well (21). However, CD209 expression in the cell lines included here is low. Another reasonable hypoth- esis could be that the inefficient ACE2-independent entry mechanism we observed could be macropinocytosis, one endocytic pathway that does not require receptors (51). Until now there is still no direct evidence for macropinocytosis involvement in SARS-CoV-2 and SARS-CoV entry mechanism. To confirm such an involvment, specific experiments are needed. Moreover, this ACE2-independent entry mechanism, only enables inefficient infection by SARS-CoV and SARS-CoV-2 (Fig. 2) and therefore cannot be a major entry mechanism. Fig. 3J summarizes the outcomes of our study with respect to entry mechanisms. The ob- servations with the broad range of transcriptome data can only be explained if there are several 13 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 entry routes. This is certainly a challenge to be reckoned with in the development of antiviral therapeutics (52). Another attractive potential anti-SARS-CoV-2 point of attack is supporting the human innate immune system to kill the infected cells and, thus disrupt viral replication. Not surprisingly, research in this area is flourishing but sometimes generates conflicting results, especially on the involvement of type I and III IFNs and ISGs (35–39). The results of our analyses could help to dissolve the confusion on the involvement of IFNs and ISGs. We found that immune responses in Calu3 cells infected with SARS-CoV and SARS-CoV-2 resemble those of BALF samples of moderate and severe COVID-19 patients, with elevated lev- els of type I and III IFNs, robust ISG induction as well as markedly elevated pro-inflammatory cytokines, in agreement with recent studies (36–39). However this picture differs from the one reported by (35) with low levels of IFNs and moderate ISGs. This latter study was partially based on A549 cells and NHBE cells with nearly no ACE2 expression and very low map- ping rate to the viral genome, and lung samples of two patients (both show 0.00% mapping rate to virus genome). Hence, given that there was no efficient virus infection in theses cells, the low levels of IFNs and ISGs were to be expected. However, in one of the lung samples sequenced by (35) (accession number: SAMN14563387), we observed a slight upregulation of IFNL1 (Fig. S6), which was ignored in the original publication, together with considerable ACE2 expression (Table S2) (5.45 TPM), and a few virus reads aligned to SARS-CoV-2 genome (Fig. S7). This results suggests that levels of IFNs are ISGs are associated with viral load and severity of virus infection. We found low induction of IFNs and moderate expression of ISGs in PBMC samples and BALF samples of COVID-19 patients (Fig. 4, Fig. S5). In these PBMC samples, there are no (0.00%) virus reads mapping to the SARS-CoV-2 genome. The failure to detect virus reads in these three PBMC samples can be explained by the absence of efficient entry routes (e.g. 14 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 no expression of ACE2 in PBMC samples of healthy individuals, Fig. S8), or with the cell types being otherwise incompatible with viral replication. This observation is consistent with the studies on SARS-CoV (53–55) with abortive infections of macrophages, monocytes, and dendritic cells; moreover, replication of SARS-CoV in PBMC samples is also self-limiting. However, due to the limited number of PBMC, BALF and lung samples included in this study, and the lack of the information of infection stage and infection severity of these COVID-19 patients, the assessment of IFNs and ISGs as well as the infection of SARS-CoV-2 in these samples may not be representative of host response against SASR-CoV-2. Future studies that include also other affected organs of more patients with different infection stages and severity are necessary for a better understanding of the immune responses. Several unexpected observations need further investigations. First, A549-ACE2 and Caco2 cells are efficiently infected with low MOI of 0.2 and 0.3, respectively, (Fig. 1), but fail to upregulate INF expression (Fig. 4B-E). Their cellular immune responses are more similar to those of cells that cannot support efficient virus infection (Fig. 4A). These results suggest that in Caco2 and A549-ACE2 cells the invasion of SARS-CoV-2 or SARS-CoV at low MOI shuts down or fails to activate the innate immune system. Based on the results observed above, multiple factors including disease severity, different organs, cell types and virus dose contribute to the variability in the innate immune responses. For a better characterization of the innate immune responses, a more comprehensive profiling is necessary, including of patients with infections in different stages, different levels of severity, and different clinical outcomes of the infection. Further, a larger array of cell types should be profiled over time after infection with different virus doses. In this way we would be better able to understand the kinetics of IFNs and ISGs in response to SARS-CoV-2 infection. In summary, our study has comparatively analyzed an extensive data collection from differ- ent cell types infected with SARS-CoV-2, SARS-CoV and MERS-CoV, and from COVID-19 15 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 patients. We have presented evidence for multiple SARS-CoV-2 entry mechanisms. We could also dissolve apparent conflicts on innate immune responses in SARS-CoV-2 infection (35–39), by drawing upon a larger set of cell types and infection severity. The results emphasize the com- plexity of interactions between host and SARS-CoV-2, offer new insights into pathogenesis of SARS-CoV-2, and can inform development of antiviral drugs. Materials and Methods Data collection After the successful release of the virus genome into the cytoplasm, a negative-strand genomic- length RNA is synthesized as the template for replication. Negative-strand subgenome-length mRNAs are formed as well from the virus genome as discontinuous RNAs, and used as the templates for transcription. In the public data we collected for the analysis, there are two main library preparation methods to remove the highly abundant ribosomal RNAs (rRNA) from to- tal RNA before sequencing. One is polyA+ selection, the other is rRNA-depletion (56). It is known that coronavirus genomic and subgenomic mRNAs carry a polyA tail at their 3’ ends, so in the polyA+ RNA-Seq, we have (1) virus genomic sequence from virus replication, i.e. repli- cated genomic RNAs from negative-strand as template, and (2) subgenomic mRNAs from virus transcription; in the rRNA-depletion RNA-Seq we have (1) virus genomic sequence from virus replication: both replicated genomic RNAs from negative-strand as template and the negative- strand templates themselves, and (2) subgenomic mRNAs from virus transcription. PolyA+ selection was used if not specifically stated in this study, “total RNA” is used to specify that the rRNA-depletion method was used to prepare the sequencing libraries. The raw FASTQ data of different cell types infected with SARS-CoV-2, SARS-CoV and MERS-CoV, and lung samples of COVID-19 patients and healthy controls were retrieved from NCBI (57) (https://www.ncbi.nlm.nih.gov/) and ENA (58) (https://www.ebi.ac.uk/ena) (acces- 16 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 sion numbers GSE147507 (35), GSE56189, GSE148729 (41) and GSE153940 (59)). The raw FASTQ data of PBMC and BALF samples of COVID-19 patients and corresponding con- trols were downloaded from BIG Data Center (60) (https://bigd.big.ac.cn/) (accession number CRA002390) (42), and the raw FASTQ data for BALF healthy control samples were down- loaded from NCBI (accession numbers SRR10571724, SRR10571730, and SRR10571732 un- der project PRJNA434133 (43)). The preprocessed single cell RNA-Seq data of BALF sam- ples from 6 severe COVID-19 patients and 3 moderate COVID-19 patients were downloaded from NCBI with accession number GSE145926 (44). The preprocessed single cell RNA-Seq data of BALF sample from a healthy control was retrieved from NCBI (accession number GSM3660650 under project PRJNA526088 (45)). Detailed information about these public datasets are available in the supplementary file: Supplementary.pdf For analysis, the human GRCh38 release 99 transcriptome and the green monkey (Chloro- cebus sabaeus) ChlSab1.1 release 99 transcriptome and their corresponding annotation GTF files were downloaded from ENSEMBL (61) (https://www.ensembl.org). The reference virus genomes were downloaded from NCBI: SARS-CoV-2 (GenBank: MN985325.1), SARS-CoV (GenBank: AY278741.1), MERS-CoV (GenBank: JX869059.2). Data analysis workflow The workflow of this study is summarized in Fig. S1 and Fig. S2 in the supplementary file: Supplementary.pdf. The quality of the raw FASTQ data was examined with FastQC (62). Trimmomatic-0.36 (63) was used to remove adapters and filter out low quality reads with param- eters “-threads 4 -phred33 ILLUMINACLIP:adapters.fasta:2:30:10 HEADCROP:10 LEAD- ING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:36”. The clean RNA sequencing reads were then pseudo-aligned to reference transcriptome and quantified using Kallisto (ver- sion 0.43.1) (64) with parameters “-b 30 –single -l 180 -s 20” for single-end sequencing data 17 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 and with parameter “-b 30” for paired-end sequencing data. Expression levels were calculated and summarized as transcripts per million (TPM) on gene levels with Sleuth (65), and logFC was then calculated for each condition. The single cell RNA-Seq data were summarized across all cells to obtain “pseudo-bulk” samples. R packages EDASeq (66) and org.Hs.eg.db (67) were used to obtain gene length, and TPM was calculated with the “calculateTPM” function of R package scater (68). logFC was then calculated for each patient. The clean RNA-Seq data were also aligned to the virus genome with Bowtie 2 (69) (version 2.2.6) and the aligned BAM files were created, and the mapping rates to the virus genomes were obtained as well. SAMtools (70) (version 1.5) was then used for sorting and indexing the aligned BAM files. The “SAMtools depth” command was used to produce the number of aligned reads per site along the virus genome. The heatmap in Fig. 3I was made by pheatmap R package (71), “complete” clustering method was used for clustering the rows and “euclidean” distance was used to measure the cluster distance. The heatmap in Fig. 4A was made by ComplexHeatmap R package (72). “complete” clustering method was used for clustering the rows and columns and “euclidean” distance was used to measure the cluster distance. References 1. A. R. Fehr, S. Perlman, Coronaviruses (Springer, 2015), pp. 1–23. 2. T. Kuiken, R. A. Fouchier, M. Schutten, G. F. Rimmelzwaan, G. Van Amerongen, D. Van Riel, J. D. Laman, T. De Jong, G. Van Doornum, W. Lim, A. E. Ling, P. K. Chan, J. S. Tam, M. C. Zambon, R. Gopal, C. Drosten, S. Van Der Werf, N. Escriou, J. C. Manuguerra, K. Stöhr, J. S. Peiris, A. D. Osterhaus, Newly discovered coronavirus as the primary cause of severe acute respiratory syndrome. The Lancet 362, 263–270 (2003). 18 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 3. WHO, Summary of probable sars cases with onset of illness from 1 november 2002 to 31 july 2003. 4. A. M. Zaki, S. Van Boheemen, T. M. Bestebroer, A. D. Osterhaus, R. A. Fouchier, Isolation of a novel coronavirus from a man with pneumonia in saudi arabia. New England Journal of Medicine 367, 1814–1820 (2012). 5. WHO, Middle east respiratory syndrome coronavirus (mers-cov) âĂŞ saudi arabia. 6. F. Wu, S. Zhao, B. Yu, Y. M. Chen, W. Wang, Z. G. Song, Y. Hu, Z. W. Tao, J. H. Tian, Y. Y. Pei, M. L. Yuan, Y. L. Zhang, F. H. Dai, Y. Liu, Q. M. Wang, J. J. Zheng, L. Xu, E. C. Holmes, Y. Z. Zhang, A new coronavirus associated with human respiratory disease in china. Nature 579, 265–269 (2020). 7. WHO, Who coronavirus disease (covid-19) dashboard. 8. X. Xu, P. Chen, J. Wang, J. Feng, H. Zhou, X. Li, W. Zhong, P. Hao, Evolution of the novel coronavirus from the ongoing wuhan outbreak and modeling of its spike protein for risk of human transmission. Science China Life Sciences 63, 457–460 (2020). 9. S. Belouzard, J. K. Millet, B. N. Licitra, G. R. Whittaker, Mechanisms of coronavirus cell entry mediated by the viral spike protein. Viruses 4, 1011–1033 (2012). 10. Z. Lou, Y. Sun, Z. Rao, Current progress in antiviral strategies. Trends in pharmacological sciences 35, 86–102 (2014). 11. E. Teissier, F. Penin, E.-I. Pécheur, Targeting cell entry of enveloped viruses as an antiviral strategy. Molecules 16, 221–250 (2011). 12. I. S. Mahmoud, Y. B. Jarrar, W. Alshaer, S. Ismail, Sars-cov-2 entry in host cells-multiple targets for treatment and prevention. Biochimie (2020). 19 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 13. Z. Qinfen, C. Jinming, H. Xiaojun, Z. Huanying, H. Jicheng, F. Ling, L. Kunpeng, Z. Jingqiang, The life cycle of sars coronavirus in vero e6 cells. Journal of medical vi- rology 73, 332–337 (2004). 14. M. Hoffmann, H. Kleine-Weber, S. Schroeder, N. Krüger, T. Herrler, S. Erichsen, T. S. Schiergens, G. Herrler, N. H. Wu, A. Nitsche, M. A. Müller, C. Drosten, S. Pöhlmann, Sars-cov-2 cell entry depends on ace2 and tmprss2 and is blocked by a clinically proven protease inhibitor. Cell (2020). 15. Z.-Y. Yang, Y. Huang, L. Ganesh, K. Leung, W.-P. Kong, O. Schwartz, K. Subbarao, G. J. Nabel, ph-dependent entry of severe acute respiratory syndrome coronavirus is mediated by the spike glycoprotein and enhanced by dendritic cell transfer through dc-sign. Journal of virology 78, 5642–5650 (2004). 16. H. Wang, P. Yang, K. Liu, F. Guo, Y. Zhang, G. Zhang, C. Jiang, Sars coronavirus entry into host cells through a novel clathrin-and caveolae-independent endocytic pathway. Cell research 18, 290–301 (2008). 17. W. Widagdo, S. Sooksawasdi Na Ayudhya, G. B. Hundie, B. L. Haagmans, Host determi- nants of mers-cov transmission and pathogenesis. Viruses 11, 280 (2019). 18. F. Li, Structure, function, and evolution of coronavirus spike proteins. Annual review of virology 3, 237–261 (2016). 19. M. Singh, V. Bansal, C. Feschotte, A single-cell rna expression map of human coronavirus entry factors. bioRxiv (2020). 20. K. Wang, W. Chen, Y.-S. Zhou, J.-Q. Lian, Z. Zhang, P. Du, L. Gong, Y. Zhang, H.-Y. Cui, J.-J. Geng, B. Wang, X.-X. Sun, C.-F. Wang, X. Yang, P. Lin, Y.-Q. Deng, D. Wei, X.-M. 20 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Yang, Y.-M. Zhu, K. Zhang, Z.-H. Zheng, J.-L. Miao, T. Guo, Y. Shi, J. Zhang, L. Fu, Q.-Y. Wang, H. Bian, P. Zhu, Z.-N. Chen, Sars-cov-2 invades host cells via a novel route: Cd147-spike protein. BioRxiv (2020). 21. R. Amraie, M. A. Napoleon, W. Yin, J. Berrigan, E. Suder, G. Zhao, J. Olejnik, S. Gum- muluru, E. Muhlberger, V. Chitalia, N. Rahimi, Cd209l/l-sign and cd209/dc-sign act as receptors for sars-cov-2 and are differentially expressed in lung and kidney epithelial and endothelial cells. BioRxiv (2020). 22. F. Hikmet, L. Méar, Å. Edvinsson, P. Micke, M. Uhlén, C. Lindskog, The protein expression profile of ace2 in human tissues. Molecular Systems Biology 16, e9610 (2020). 23. L. Zou, F. Ruan, M. Huang, L. Liang, H. Huang, Z. Hong, J. Yu, M. Kang, Y. Song, J. Xia, Q. Guo, T. Song, J. He, H. L. Yen, M. Peiris, J. Wu, Sars-cov-2 viral load in upper respiratory specimens of infected patients. New England Journal of Medicine 382, 1177– 1179 (2020). 24. G. Simmons, J. D. Reeves, A. J. Rennekamp, S. M. Amberg, A. J. Piefer, P. Bates, Char- acterization of severe acute respiratory syndrome-associated coronavirus (sars-cov) spike glycoprotein-mediated viral entry. Proceedings of the National Academy of Sciences 101, 4240–4245 (2004). 25. R. Zang, M. F. G. Castro, B. T. McCune, Q. Zeng, P. W. Rothlauf, N. M. Sonnek, Z. Liu, K. F. Brulois, X. Wang, H. B. Greenberg, M. S. Diamond, M. A. Ciorba, S. P. Whelan, S. Ding, Tmprss2 and tmprss4 promote sars-cov-2 infection of human small intestinal en- terocytes. Science immunology 5 (2020). 26. P. Zmora, M. Hoffmann, H. Kollmus, A.-S. Moldenhauer, O. Danov, A. Braun, M. Winkler, K. Schughart, S. Pöhlmann, Tmprss11a activates the influenza a virus hemagglutinin and 21 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 the mers coronavirus spike protein and is insensitive against blockade by hai-1. Journal of Biological Chemistry 293, 13863–13873 (2018). 27. X. Ou, Y. Liu, X. Lei, P. Li, D. Mi, L. Ren, L. Guo, R. Guo, T. Chen, J. Hu, Z. Xiang, Z. Mu, X. Chen, J. Chen, K. Hu, Q. Jin, J. Wang, Z. Qian, Characterization of spike glyco- protein of sars-cov-2 on virus entry and its immune cross-reactivity with sars-cov. Nature communications 11, 1–12 (2020). 28. Y.-M. Loo, M. Gale Jr, Immune signaling by rig-i-like receptors. Immunity 34, 680–692 (2011). 29. A. G. Bowie, I. R. Haga, The role of toll-like receptors in the host response to viruses. Molecular immunology 42, 859–867 (2005). 30. C. Chiang, M. U. Gack, Post-translational control of intracellular pathogen sensing path- ways. Trends in immunology 38, 39–52 (2017). 31. A. Park, A. Iwasaki, Type i and type iii interferons–induction, signaling, evasion, and ap- plication to combat covid-19. Cell Host & Microbe (2020). 32. Q. Ruan, K. Yang, W. Wang, L. Jiang, J. Song, Clinical predictors of mortality due to covid- 19 based on an analysis of data of 150 patients from wuhan, china. Intensive care medicine 46, 846–848 (2020). 33. I. F. N. Hung, K. C. Lung, E. Y. K. Tso, R. Liu, T. W. H. Chung, M. Y. Chu, Y. Y. Ng, J. Lo, J. Chan, A. R. Tam, H. P. Shum, V. Chan, A. K. L. Wu, K. M. Sin, W. S. Leung, W. L. Law, D. C. Lung, S. Sin, P. Yeung, C. C. Y. Yip, R. R. Zhang, A. Y. F. Fung, E. Y. W. Yan, K. H. Leung, J. D. Ip, A. W. H. Chu, W. M. Chan, A. C. K. Ng, R. Lee, K. Fung, A. Yeung, T. C. Wu, J. W. M. Chan, W. W. Yan, W. M. Chan, J. F. W. Chan, A. K. W. Lie, 22 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 O. T. Y. Tsang, V. C. C. Cheng, T. L. Que, C. S. Lau, K. H. Chan, K. K. W. To, K. Y. Yuen, Triple combination of interferon beta-1b, lopinavir–ritonavir, and ribavirin in the treatment of patients admitted to hospital with covid-19: an open-label, randomised, phase 2 trial. The Lancet 395, 1695–1704 (2020). 34. E. Andreakos, S. Tsiodras, Covid-19: lambda interferon against viral load and hyperin- flammation. EMBO Molecular Medicine p. e12465 (2020). 35. D. Blanco-Melo, B. E. Nilsson-Payant, W. C. Liu, S. Uhl, D. Hoagland, R. Møller, T. X. Jordan, K. Oishi, M. Panis, D. Sachs, T. T. Wang, R. E. Schwartz, J. K. Lim, R. A. Albrecht, B. R. TenOever, Imbalanced host response to sars-cov-2 drives development of covid-19. Cell (2020). 36. Z. Zhou, L. Ren, L. Zhang, J. Zhong, Y. Xiao, Z. Jia, L. Guo, J. Yang, C. Wang, S. Jiang, D. Yang, G. Zhang, H. Li, F. Chen, Y. Xu, M. Chen, Z. Gao, J. Yang, J. Dong, B. Liu, X. Zhang, W. Wang, K. He, Q. Jin, M. Li, J. Wang, Heightened innate immune responses in the respiratory tract of covid-19 patients. Cell Host & Microbe (2020). 37. A. Broggi, S. Ghosh, B. Sposito, R. Spreafico, F. Balzarini, A. Lo Cascio, N. Clementi, M. de Santis, N. Mancini, F. Granucci, I. Zanoni, Type iii interferons disrupt the lung epithelial barrier upon viral recognition. Science (2020). 38. L. Wei, S. Ming, B. Zou, Y. Wu, Z. Hong, Z. Li, X. Zheng, M. Huang, L. Luo, J. Liang, X. Wen, T. Chen, Q. Liang, L. Kuang, H. Shan, X. Huang, Viral invasion and type i inter- feron response characterize the immunophenotypes during covid-19 infection. Available at SSRN 3555695 (2020). 39. J. Y. Zhang, X. M. Wang, X. Xing, Z. Xu, C. Zhang, J. W. Song, X. Fan, P. Xia, J. L. Fu, S. Y. Wang, R. N. Xu, X. P. Dai, L. Shi, L. Huang, T. J. Jiang, M. Shi, Y. Zhang, A. Zumla, 23 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 M. Maeurer, F. Bai, F. S. Wang, Single-cell landscape of immunological responses in pa- tients with covid-19. Nature Immunology pp. 1–12 (2020). 40. E. Sallard, F. X. Lescure, Y. Yazdanpanah, F. Mentre, N. Peiffer-Smadja, Type 1 interferons as a potential treatment against covid-19. Antiviral Research p. 104791 (2020). 41. E. Wyler, K. Mösbauer, V. Franke, A. Diag, T. G. Lina, R. Arsie, F. Klironomos, D. Kopp- stein, S. Ayoub, C. Buccitelli, A. Richter, I. Legnini, A. Ivanov, T. Mari, S. D. Giudice, P. P. Jan, A. M. Marcel, D. Niemeyer, M. Selbach, A. Akalin, N. Rajewsky, C. Drosten, M. Landthaler, Bulk and single-cell gene expression profiling of sars-cov-2 infected human cell lines identifies molecular targets for therapeutic intervention. bioRxiv (2020). 42. Y. Xiong, Y. Liu, L. Cao, D. Wang, M. Guo, A. Jiang, D. Guo, W. Hu, J. Yang, Z. Tang, H. Wu, Y. Lin, M. Zhang, Q. Zhang, M. Shi, Y. Liu, Y. Zhou, K. Lan, Y. Chen, Transcrip- tomic characteristics of bronchoalveolar lavage fluid and peripheral blood mononuclear cells in covid-19 patients. Emerging microbes & infections 9, 761–770 (2020). 43. D. Michalovich, N. Rodriguez-Perez, S. Smolinska, M. Pirozynski, D. Mayhew, S. Ud- din, S. Van Horn, M. Sokolowska, C. Altunbulakli, A. Eljaszewicz, B. Pugin, W. Barcik, M. Kurnik-Lucka, K. A. Saunders, K. D. Simpson, P. Schmid-Grendelmeier, R. Ferstl, R. Frei, N. Sievi, M. Kohler, P. Gajdanowicz, K. B. Graversen, K. Lindholm Bøgh, M. Ju- tel, J. R. Brown, C. A. Akdis, E. M. Hessel, L. O’Mahony, Obesity and disease severity magnify disturbed microbiome-immune interactions in asthma patients. Nature communi- cations 10, 1–14 (2019). 44. M. Liao, Y. Liu, J. Yuan, Y. Wen, G. Xu, J. Zhao, L. Cheng, J. Li, X. Wang, F. Wang, L. Liu, I. Amit, S. Zhang, Z. Zhang, Single-cell landscape of bronchoalveolar immune cells in patients with covid-19. Nature medicine pp. 1–3 (2020). 24 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 45. C. Morse, T. Tabib, J. Sembrat, K. L. Buschur, H. T. Bittar, E. Valenzi, Y. Jiang, D. J. Kass, K. Gibson, W. Chen, A. Mora, P. V. Benos, M. Rojas, R. Lafyatis, Proliferating spp1/mertk- expressing macrophages in idiopathic pulmonary fibrosis. European Respiratory Journal 54 (2019). 46. C. Tikellis, M. Thomas, Angiotensin-converting enzyme 2 (ace2) is a key modulator of the renin angiotensin system in health and disease. International journal of peptides 2012 (2012). 47. G. Simmons, D. N. Gosalia, A. J. Rennekamp, J. D. Reeves, S. L. Diamond, P. Bates, Inhibitors of cathepsin l prevent severe acute respiratory syndrome coronavirus entry. Pro- ceedings of the National Academy of Sciences 102, 11876–11881 (2005). 48. S. Lukassen, R. L. Chua, T. Trefzer, N. C. Kahn, M. A. Schneider, T. Muley, H. Winter, M. Meister, C. Veith, A. W. Boots, B. P. Hennig, M. Kreuter, C. Conrad, R. Eils, Sars-cov-2 receptor ace 2 and tmprss 2 are primarily expressed in bronchial transient secretory cells. The EMBO journal 39, e105114 (2020). 49. R. Ueha, T. Sato, T. Goto, A. Yamauchi, K. Kondo, T. Yamasoba, Expression of ace2 and tmprss2 proteins in the upper and lower aerodigestive tracts of rats. bioRxiv (2020). 50. J. Shilts, G. J. Wright, No evidence for basigin/cd147 as a direct sars-cov-2 spike binding receptor. bioRxiv (2020). 51. J. Mercer, A. Helenius, Virus entry by macropinocytosis. Nature cell biology 11, 510–520 (2009). 52. D. L. McKee, A. Sternberg, U. Stange, S. Laufer, C. Naujokat, Candidate drugs against sars-cov-2 and covid-19. Pharmacological Research p. 104859 (2020). 25 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 53. H. K. Law, C. Y. Cheung, H. Y. Ng, S. F. Sia, Y. O. Chan, W. Luk, J. M. Nicholls, J. Peiris, Y. L. Lau, Chemokine up-regulation in sars-coronavirus–infected, monocyte-derived hu- man dendritic cells. Blood 106, 2366–2374 (2005). 54. C. Y. Cheung, L. L. M. Poon, I. H. Y. Ng, W. Luk, S.-F. Sia, M. H. S. Wu, K.-H. Chan, K.-Y. Yuen, S. Gordon, Y. Guan, J. S. M. Peiris, Cytokine responses in severe acute respiratory syndrome coronavirus-infected macrophages in vitro: possible relevance to pathogenesis. Journal of virology 79, 7819–7826 (2005). 55. L. Li, J. Wo, J. Shao, H. Zhu, N. Wu, M. Li, H. Yao, M. Hu, R. H. Dennin, Sars-coronavirus replicates in mononuclear cells of peripheral blood (pbmcs) from sars patients. Journal of Clinical Virology 28, 239–244 (2003). 56. W. Zhao, X. He, K. A. Hoadley, J. S. Parker, D. N. Hayes, C. M. Perou, Comparison of rna-seq by poly (a) capture, ribosomal rna depletion, and dna microarray for expression profiling. BMC genomics 15, 1–11 (2014). 57. E. W. Sayers, R. Agarwala, E. E. Bolton, J. R. Brister, K. Canese, K. Clark, R. Connor, N. Fiorini, K. Funk, T. Hefferon, J. B. Holmes, S. Kim, A. Kimchi, P. A. Kitts, S. Lathrop, Z. Lu, T. L. Madden, A. Marchler-Bauer, L. Phan, V. A. Schneider, C. L. Schoch, K. D. Pruitt, J. Ostell, Database resources of the national center for biotechnology information. Nucleic acids research 36, D13–D21 (2007). 58. R. Leinonen, R. Akhtar, E. Birney, L. Bower, A. Cerdeno-Tárraga, Y. Cheng, I. Cleland, N. Faruque, N. Goodgame, R. Gibson, G. Hoad, M. Jang, N. Pakseresht, S. Plaister, R. Rad- hakrishnan, K. Reddy, S. Sobhany, P. T. Hoopen, R. Vaughan, V. Zalunin, G. Cochrane, The european nucleotide archive. Nucleic acids research 39, D28–D31 (2010). 26 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 59. L. Riva, S. Yuan, X. Yin, L. Martin-Sancho, N. Matsunaga, L. Pache, S. Burgstaller- Muehlbacher, P. D. De Jesus, P. Teriete, M. V. Hull, M. W. Chang, J. F. W. Chan, J. Cao, V. K. M. Poon, K. M. Herbert, K. Cheng, T. T. H. Nguyen, A. Rubanov, Y. Pu, C. Nguyen, A. Choi, R. Rathnasinghe, M. Schotsaert, L. Miorin, M. Dejosez, T. P. Zwaka, K. Y. Sit, L. Martinez-Sobrido, W. C. Liu, K. M. White, M. E. Chapman, E. K. Lendy, R. J. Glynne, R. Albrecht, E. Ruppin, A. D. Mesecar, J. R. Johnson, C. Benner, R. Sun, P. G. Schultz, A. I. Su, A. García-Sastre, A. K. Chatterjee, K. Y. Yuen, S. K. Chanda, Discovery of SARS- CoV-2 antiviral drugs through large-scale compound repurposing. Nature 586, 113–119 (2020). 60. Z. Zhang, et al., Database resources of the national genomics data center in 2020. Nucleic Acids Research 48, D24 (2020). 61. A. D. Yates, et al., Ensembl 2020. Nucleic acids research 48, D682–D688 (2020). 62. S. Andrews, Fastqc: a quality control tool for high throughput sequence data (2010). 63. A. M. Bolger, M. Lohse, B. Usadel, Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics 30, 2114–2120 (2014). 64. N. L. Bray, H. Pimentel, P. Melsted, L. Pachter, Near-optimal probabilistic rna-seq quan- tification. Nature biotechnology 34, 525–527 (2016). 65. H. Pimentel, N. L. Bray, S. Puente, P. Melsted, L. Pachter, Differential analysis of rna-seq incorporating quantification uncertainty. Nature methods 14, 687 (2017). 66. D. Risso, K. Schwartz, G. Sherlock, S. Dudoit, Gc-content normalization for rna-seq data. BMC bioinformatics 12, 480 (2011). 27 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 67. M. Carlson, S. Falcon, H. Pages, N. Li, org. hs. eg. db: Genome wide annotation for human. R package version 3 (2017). 68. D. J. McCarthy, K. R. Campbell, A. T. Lun, Q. F. Wills, Scater: pre-processing, quality control, normalization and visualization of single-cell rna-seq data in r. Bioinformatics 33, 1179–1186 (2017). 69. B. Langmead, S. L. Salzberg, Fast gapped-read alignment with bowtie 2. Nature methods 9, 357 (2012). 70. H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, The sequence alignment/map format and samtools. Bioinformatics 25, 2078– 2079 (2009). 71. R. Kolde, pheatmap: Pretty Heatmaps (2019). R package version 1.0.12. 72. Z. Gu, R. Eils, M. Schlesner, Complex heatmaps reveal patterns and correlations in multi- dimensional genomic data. Bioinformatics 32, 2847–2849 (2016). Acknowledgements: The authors thank professor Ke Xu from Wuhan University and professor Dimitri Lavillette from Institut Pasteur of Shanghai for helpful conversations. Funding: This work was partially funded by grant 01Kl20185B (SECOVIT) of the German Federal Ministry of Education and Research. Author Contributions: Pei Hao and Yingying Cao conceived the research. Daniel Hoffmann, Pei Hao, and Yingying Cao designed the analyses. Yingying Cao, Xintian Xu conducted the analyses. All authors wrote the manuscript. Competing Interests: The authors declare that they have no competing financial interests. Data and materials availability: Additional data and materials are available online. 28 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Figures and Tables: Table 1. Data of cell lines (cells) included in this study Virus Virus strain Virus dose (MOI) Time Replicates Species of origin Cell type Library preparation Accession number SARS-CoV-2 USA-WA1/2020 2 24h 3 Homo sapiens NHBE polyA+ selection GSE147507 Mock Mock Mock 24h 3 Homo sapiens NHBE polyA+ selection GSE147507 SARS-CoV-2 USA-WA1/2020 0.2 24h 3 Homo sapiens A549 polyA+ selection GSE147507 Mock Mock Mock 24h 3 Homo sapiens A549 polyA+ selection GSE147507 SARS-CoV-2 USA-WA1/2020 2 24h 3 Homo sapiens A549 polyA+ selection GSE147507 Mock Mock Mock 24h 3 Homo sapiens A549 polyA+ selection GSE147507 SARS-CoV-2 USA-WA1/2020 0.2 24h 3 Homo sapiens A549-ACE2 polyA+ selection GSE147507 Mock Mock Mock 24h 3 Homo sapiens A549-ACE2 polyA+ selection GSE147507 SARS-CoV-2 USA-WA1/2020 2 24h 3 Homo sapiens A549-ACE2 polyA+ selection GSE147507 Mock Mock Mock 24h 3 Homo sapiens A549-ACE2 polyA+ selection GSE147507 SARS-CoV-2 USA-WA1/2020 2 24h 3 Homo sapiens Calu3 polyA+ selection GSE147507 Mock Mock Mock 24h 3 Homo sapiens Calu3 polyA+ selection GSE147507 SARS-CoV-2 Munich/BavPat1/2020 0.3 24h 2 Homo sapiens Calu3 rRNA-depletion GSE148729 Mock Mock Mock 24h 2 Homo sapiens Calu3 rRNA-depletion GSE148729 SARS-CoV-2 Munich/BavPat1/2020 0.3 24h 2 Homo sapiens Calu3 polyA+ selection GSE148729 Mock Mock Mock 24h 2 Homo sapiens Calu3 polyA+ selection GSE148729 SARS-CoV-2 Munich/BavPat1/2020 0.3 24h 2 Homo sapiens Caco2 polyA+ selection GSE148729 Mock Mock Mock 24h 2 Homo sapiens Caco2 polyA+ selection GSE148729 SARS-CoV-2 Munich/BavPat1/2020 0.3 24h 2 Homo sapiens H1299 polyA+ selection GSE148729 Mock Mock Mock 36h^ 2 Homo sapiens H1299 polyA+ selection GSE148729 SARS-CoV-2 USA-WA1/2020 0.3 24h 2* Chlorocebus sabaeus Vero E6 rRNA-depletion GSE153940 Mock Mock Mock 24h 3 Chlorocebus sabaeus Vero E6 rRNA-depletion GSE153940 SARS-CoV Frankfurt strain 0.3 24h 2 Homo sapiens Calu3 polyA+ selection GSE148729 SARS-CoV Frankfurt strain 0.3 24h 2 Homo sapiens Calu3 rRNA-depletion GSE148729 SARS-CoV Frankfurt strain 0.3 24h 2 Homo sapiens Caco2 polyA+ selection GSE148729 SARS-CoV Frankfurt strain 0.3 24h 2 Homo sapiens H1299 polyA+ selection GSE148729 SARS-CoV Urbani strain 0.1 24h 3 Homo sapiens MRC5 polyA+ selection GSE56189 SARS-CoV Urbani strain 3 24h 3 Homo sapiens MRC5 polyA+ selection GSE56189 SARS-CoV Urbani strain 0.1 24h 3 Chlorocebus sabaeus Vero E6 polyA+ selection GSE56189 SARS-CoV Urbani strain 3 24h 3 Chlorocebus sabaeus Vero E6 polyA+ selection GSE56189 MERS-CoV EMC/2012 0.1 24h 3 Homo sapiens MRC5 polyA+ selection GSE56189 MERS-CoV EMC/2012 3 24h 3 Homo sapiens MRC5 polyA+ selection GSE56189 MERS-CoV EMC/2012 0.1 24h 3 Chlorocebus sabaeus Vero E6 polyA+ selection GSE56189 MERS-CoV EMC/2012 3 24h 3 Chlorocebus sabaeus Vero E6 polyA+ selection GSE56189 Mock Mock Mock 24h 3 Homo sapiens MRC5 polyA+ selection GSE56189 Mock Mock Mock 24h 3 Homo sapiens Vero E6 polyA+ selection GSE56189 ^No corresponding 24h mock control samples for H1299 cells, 36h mock control samples were used instead. * There are three replicates, but when the manuscript was in preparation only two of them are available for downloading. Table 2. Data of COVID-19 patients included in this study Individuals Tissue Data Type Accession number 2 bronchoalveolar lavage fluid from COVID-19 patients bulk RNA-Seq CRA002390 3 bronchoalveolar lavage fluid from healthy negative control bulk RNA-Seq PRJNA434133^ 3 peripheral blood mononuclear cells from COVID-19 patients bulk RNA-Seq CRA002390 3 peripheral blood mononuclear cells from healthy negative control bulk RNA-Seq CRA002390 2 lung biopsy from postmortem COVID-19 patients bulk RNA-Seq GSE147507 2 lung biopsy from healthy negative control bulk RNA-Seq GSE147507 6 bronchoalveolar lavage fluid from COVID-19 patients (severe) single cell RNA-Seq GSE145926 3 bronchoalveolar lavage fluid from COVID-19 patients (moderate) single cell RNA-Seq GSE145926 1 bronchoalveolar lavage fluid from healthy negative control single cell RNA-Seq PRJNA526088* ^Three samples under project PRJNA434133: SRR10571724, SRR10571730, and SRR10571732 were used. * One sample with accession number GSM3660650 under project PRJNA526088 was used. 29 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 ●● ● ●●● ●●● ● ●● ●● ●● ●● ●●● ●●● ● ● ● ●●● ●● ● ● ● ● ●● ● ●● ●●● ●● ●●●●● ●●● ●●● ●●● 0 25 50 75 100 M R C 5− 0. 1M O I M R C 5− 3M O I H 12 99 −0 .3 M O I Ve ro E 6− 0. 1M O I Ve ro E 6− 3M O I C al u3 −0 .3 M O I C al u3 −0 .3 M O I− to ta lR N A C ac o2 −0 .3 M O I M R C 5− 0. 1M O I M R C 5− 3M O I Ve ro E 6− 0. 1M O I Ve ro E 6− 3M O I A 54 9− 0. 2M O I 2− A 54 9− 2M O I H 12 99 −0 .3 M O I N H B E −2 M O I A 54 9− AC E 2− 0. 2M O I A 54 9− AC E 2− 2M O I C ac o2 −0 .3 M O I C al u3 −0 .3 M O I C al u3 −0 .3 M O I− to ta lR N A C al u3 −2 M O I Ve ro E 6− 0. 3M O I.t ot al R N A m ap pi ng ra te to v iru s ge no m e (% ) ● ● ● MERS−CoV SARS−CoV SARS−CoV−2 Fig. 1. Mapping rate to virus genome. The dots represent the mapping rates to the virus genome for each individual replicate under the given conditions (cell line, MOI, and virus). Bar heights are mean mapping rates to the virus genome for each condition. 30 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 31 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Fig. 2. The number of reads mapped to the corresponding virus genome. (A-E) The dot plots show the number of reads mapped to each site of the corresponding virus genome. The annotation of the genome of each virus is from NCBI (SARS: GCF_000864885.1, SARS-CoV- 2: GCF_009858895.2, MERS: GCF_000901155.1). Labels in grey title bars correspond to conditions as in Fig. 1. 32 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 ●● ●●●● ● ● ● ●● ● ● ● ● ●●●● ●● ●● ● ●●● ●●● 0.0 0.5 1.0 1.5 2.0 A 54 9 A 54 9. AC E 2 C ac o2 C al u3 H 12 99 M R C 5 N H B E Ve ro E 6 lo g1 0( TP M +1 ) DPP4A ●●●●●● ●●●● ●● ●● ●● ●●● ●● ●●● ●●● ●●● 0.00 1.00 2.00 3.00 4.00 A 54 9 A 54 9. AC E 2 C ac o2 C al u3 H 12 99 M R C 5 N H B E Ve ro E 6 lo g1 0( TP M +1 ) ACE2B ●●●●●● ●●●●● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ● ●● 0.00 0.25 0.50 0.75 1.00 A 54 9 A 54 9. AC E 2 C ac o2 C al u3 H 12 99 M R C 5 N H B E Ve ro E 6 lo g1 0( TP M +1 ) TMPRSS2C ●●●●●● ●●●●●● ●● ●● ●● ● ●● ●●● ●●● ●●●0.0 0.5 1.0 1.5 2.0 A 54 9 A 54 9. AC E 2 C ac o2 C al u3 H 12 99 M R C 5 N H B E Ve ro E 6 lo g1 0( TP M +1 ) TMPRSS4D ●●●●●● ●●●● ●● ●● ●● ●●● ●● ●●● ●● ● ●●● 0.00 1.00 2.00 3.00 A 54 9 A 54 9. AC E 2 C ac o2 C al u3 H 12 99 M R C 5 N H B E Ve ro E 6 lo g1 0( TP M +1 ) CTSBE ●●● ●●● ●● ●●●● ●● ●● ●●● ●● ●●● ●●● ●●● 0.00 1.00 2.00 3.00 A 54 9 A 54 9. AC E 2 C ac o2 C al u3 H 12 99 M R C 5 N H B E Ve ro E 6 lo g1 0( TP M +1 ) CTSLF ●●●●●● ●●● ●●● ●● ●●●●● ●● ●●● ● ●● ●●● 0.00 1.00 2.00 3.00 A 54 9 A 54 9. AC E 2 C ac o2 C al u3 H 12 99 M R C 5 N H B E Ve ro E 6 lo g1 0( TP M +1 ) BSGG ●●● ●●● ●● ●● ●● ●● ●●●● ● ●● ●●● ●●● ●●● 0.0 0.1 0.2 0.3 0.4 0.5 A 54 9 A 54 9. AC E 2 C ac o2 C al u3 H 12 99 M R C 5 N H B E Ve ro E 6 lo g1 0( TP M +1 ) CD209H 4.01 0.75 0.00 0.00 0.01 0.18 0.21 1.22 2.63 2.17 2.80 2.62 2.93 2.55 2.68 2.67 0.03 0.14 0.02 0.01 0.01 0.00 0.02 0.01 2.38 1.83 2.72 2.41 2.03 3.16 2.32 2.48 2.48 3.24 2.08 2.74 1.73 2.07 1.77 1.64 0.88 1.02 1.48 0.31 0.05 0.53 1.96 1.59 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.04 0.01 0.03 0.13 0.06 0.05 0.00 0.00 0.01 0.00 0.03 1.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.73 0.04 0.11 0.01 0.00 0.00 0.00 0.00 0.16 0.00 0.00 0.03 0.02 0.02 0.03 0.01 0.35 0.39 0.43 0.07 0.02 0.01 0.01 0.00 0.00 0.00 0.04 0.01 0.29 0.01 0.00 0.00 0.79 0.86 0.81 0.10 0.03 0.01 0.09 0.04 0.02 0.07 0.85 0.02 0.01 0.04 0.03 0.01 1.72 1.39 1.81 0.13 0.07 0.18 0.11 0.11 0.02 0.27 0.60 0.05 0.03 0.03 0.14 0.00 0.03 0.18 0.04 0.06 0.03 0.02 0.02 0.00 0.02 0.03 0.03 AC E 2 B S G C D 20 9 C TS B C TS L D P P 4 TM P R S S 11 A TM P R S S 11 B TM P R S S 11 D TM P R S S 11 E TM P R S S 11 F TM P R S S 13 TM P R S S 15 TM P R S S 2 TM P R S S 3 TM P R S S 4 TM P R S S 5 TM P R S S 6 TM P R S S 7 1a 1c 1b A549.ACE2 VeroE6 MRC5 A549 H1299 NHBE Caco2 Calu3 0 1 2 3 log10(TPM+1)I J 33 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Fig. 3. The expression levels of the receptors and proteases. (A-H) Each dot represents the expression value in each sample. (I) Heatmap of the expression levels of coronavirus as- sociated receptors and factors of different cell types. Labels 1a, 1b, 1c mark cell clusters that likely share entry routes sketched in panel J. (J) Entry mechanisms involved in SARS-CoV-2 entry into cells. Schematic is based on a figure by Vega Asensio - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=88682468. 34 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 ●●●●●● ●●● ● ● ● 0 10 20 30 M oc k 0. 2 M O I 2 M O I TP M IFNB1 A549−ACE2 SARS−CoV−2B ●●●●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 2 M O I 2 M O I IFNB1 A549 SARS−CoV−2 ●●●●● ●● ●●● 0 10 100 400 1500 M oc k 0. 3 M O I 2 M O I IFNB1 Calu3 SARS−CoV−2 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNB1 Caco2 SARS−CoV−2 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNB1 H1299 SARS−CoV−2 ●●● ●●●0 1 2 3 4 5 M oc k 2 M O I IFNB1 SARS−CoV−2 NHBE ●● ●● 0 100 200 300 400 M oc k 0. 3 M O I IFNB1 SARS−CoV Calu3 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNB1 SARS−CoV Caco2 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNB1 SARS−CoV H1299 ●●●●●● ●●● ● ● ● 0 10 20 30 M oc k 0. 2 M O I 2 M O I TP M IFNL1 A549−ACE2 SARS−CoV−2C ●●●●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 2 M O I 2 M O I IFNL1 A549 SARS−CoV−2 ●●●●● ●● ●●● 0 10 100 400 1500 M oc k 0. 3 M O I 2 M O I IFNL1 Calu3 SARS−CoV−2 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNL1 Caco2 SARS−CoV−2 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNL1 H1299 SARS−CoV−2 ●● ● ●●●0 1 2 3 4 5 M oc k 2 M O I IFNL1 NHBE SARS−CoV−2 ●● ●● 0 100 200 M oc k 0. 3 M O I IFNL1 SARS−CoV Calu3 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNL1 SARS−CoV Caco2 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNL1 SARS−CoV H1299 ●●●●●● ●●● ● ● ● 0 10 20 M oc k 0. 2 M O I 2 M O I TP M IFNL2 A549−ACE2 SARS−CoV−2D ●●●●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 2 M O I 2 M O I IFNL2 A549 SARS−CoV−2 ●●●●● ●● ●●● 0 10 100 500 M oc k 0. 3 M O I 2 M O I IFNL2 Calu3 SARS−CoV−2 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNL2 Caco2 SARS−CoV−2 ● ● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNL2 H1299 SARS−CoV−2 ●● ● ●●●0 1 2 3 4 5 M oc k 2 M O I IFNL2 NHBE SARS−CoV−2 ●● ●● 0 50 100 150 M oc k 0. 3 M O I IFNL2 SARS−CoV Calu3 ●● ● ● 0 1 2 3 4 5 M oc k 0. 3 M O I IFNL2 SARS−CoV Caco2 ● ● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNL2 SARS−CoV H1299 ●●●●●● ●●● ● ● ● 0 5 10 15 20 M oc k 0. 2 M O I 2 M O I TP M IFNL3 A549−ACE2 SARS−CoV−2E ●●●●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 2 M O I 2 M O I IFNL3 A549 SARS−CoV−2 ●●●●● ●● ●●● 0 10 100 800 M oc k 0. 3 M O I 2 M O I IFNL3 Calu3 SARS−CoV−2 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNL3 Caco2 SARS−CoV−2 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNL3 H1299 SARS−CoV−2 ●●● ●●●0 1 2 3 4 5 M oc k 2 M O I IFNL3 NHBE SARS−CoV−2 ●● ●● 0 50 100 M oc k 0. 3 M O I IFNL3 SARS−CoV Calu3 ●● ● ● 0 1 2 3 4 5 M oc k 0. 3 M O I IFNL3 SARS−CoV Caco2 ●● ●●0 1 2 3 4 5 M oc k 0. 3 M O I IFNL3 SARS−CoV H1299 A F G ● ●● ● ●● ●● ● ● 0 5 30 B A LF .h ea lth y B A LF .m od er at e B A LF .s ev er e TP M IFNB1 ● ● ● ● ●●●● ● ● 0 5 30 B A LF .h ea lth y B A LF .m od er at e B A LF .s ev er e IFNL1 ● ●●● ●●● ●●● 0 1 B A LF .h ea lth y B A LF .m od er at e B A LF .s ev er e IFNL2 ● ●●● ●● ●● ● ● 0 1 B A LF .h ea lth y B A LF .m od er at e B A LF .s ev er e IFNL3 ●●● ● ● 0 10 20 30 40 M oc k 0. 3 M O I TP M IFNL1 SARS−CoV2 VeroE6 ●●● ● ● ● ●● ● 0 5 10 15 M oc k 0. 1 M O I 3 M O I IFNL1 SARS−CoV VeroE6 ●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 1 M O I 3 M O I IFNL1 MERS−CoV VeroE6 1 2 SARS−CoV−2_A549.ACE2_0.2MOI SARS−CoV−2_NHBE_2MOI SARS−CoV−2_A549_0.2MOI MERS−CoV_VeroE6_3MOI SARS−CoV−2_A549_2MOI SARS−CoV_Caco2_0.3MOI SARS−CoV−2_Caco2_0.3MOI SARS−CoV_MRC5_3MOI MERS−CoV_MRC5_0.1MOI MERS−CoV_VeroE6_0.1MOI MERS−CoV_MRC5_3MOI SARS−CoV_MRC5_0.1MOI SARS−CoV_H1299_0.3MOI SARS−CoV−2_H1299_0.3MOI PBMC.1 PBMC.2 PBMC.3 BALF.moderate.3 BALF.moderate.1 BALF.moderate.2 BALF.severe.1 BALF.severe.2 BALF.severe.3 BALF.severe.6 BALF.severe.4 BALF.severe.5 SARS−CoV−2_Calu3_0.3MOI SARS−CoV−2_Calu3_0.3MOI_totalRNA SARS−CoV−2_Calu3_2MOI SARS−CoV_Calu3_0.3MOI_totalRNA SARS−CoV_Calu3_0.3MOI SARS−CoV−2_A549.ACE2_2MOI Lung.1 Lung.2 BALF.1 BALF.2 SARS−CoV_VeroE6_0.1MOI SARS−CoV_VeroE6_3MOI SARS−CoV−2_VeroE6_0.3MOI_totalRNA D D X 58 IF IH 1 D H X 58 TL R 1 TL R 2 TL R 3 TL R 4 TL R 5 TL R 6 TL R 7 TL R 8 TL R 10 IR F1 IR F2 IR F3 IR F4 IR F5 IR F6 IR F7 IR F8 IR F9 TB K 1 N FK B 1 N FK B 2 IF N A 14 IF N A 2 IF N A 6 IF N A 8 IF N B 1 IF N E IF N G IF N K IF N L1 IF N W 1 IF N A R 1 IF N G R 1 IF N G R 2 IF N LR 1 JA K 1 JA K 2 JA K 3 TY K 2 S TA T1 S TA T2 S TA T3 S TA T4 S TA T5 A S TA T5 B S TA T6 IS G 15 IS G 20 IS G 20 L2 M X 1 O A S 1 O A S 2 O A S 3 O A S L IF IT 1 IF IT 1B IF IT 2 IF IT 3 IF IT 5 IF IT M 5 IF IT M 10 C C L2 C C L5 C C L1 6 C C L2 0 C C L2 4 C C L2 6 C C L2 7 C C L2 8 C X C L1 C X C L2 C X C L3 C X C L6 C X C L8 C X C L9 C X C L1 0 C X C L1 1 C X C L1 2 C X C L1 3 C X C L1 4 C X C L1 6 C X C L1 7 logFC −2 −1 0 1 2 35 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Fig. 4. Expression levels of genes related to immune responses (A) Heatmap of the logFC of IFNs, ISGs and pro-inflammatory cytokines. The clustering of samples produces a clus- ter 1 (top) with little IFN/ISG expression comprising MERS infections and non-infectable cells/SARS-CoV-1/2 (except for Caco2 cells), and a cluster 2 (bottom) strong IFN/ISG ex- pression with SARS-CoV-1/2 infectable cells and patient samples. (B-G) Expression levels of IFNs. Each dot represents the expression value of a sample. Bars indicate mean expression levels (in TPM) of respective IFN at different MOI values. 36 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Supplementary Materials: Additional information about public data All data can be downloaded from public repositories, the three main sources are NCBI (57) (https://www.ncbi.nlm.nih.gov/) and ENA (58) (https://www.ebi.ac.uk/ena) and BIG Data Cen- ter (60) (https://bigd.big.ac.cn/). GSE147507 dataset (35) From this dataset we downloaded: Biological triplicates of primary human lung epithelium (NHBE) which were mock treated or infected with SARS-CoV-2 (USA-WA1/2020) at an MOI of 2; Biological triplicates of transformed lung alveolar (A549) cells which were mock treated or infected with SARS-CoV-2 (USA-WA1/2020) at an MOI of 0.2 or 2; Biological triplicates of transformed lung alveolar (A549) transduced with a vector expressing human ACE2, which were also mock treated or infected with SARS-CoV-2 (USA-WA1/2020) at an MOI of 0.2 or 2; Biological triplicates of transformed lung-derived Calu-3 cells which were mock treated or infected with SARS-CoV-2 (USA-WA1/2020) at an MOI of 2; COVID-19 patient samples: Uninfected human lung biopsies derived from one male (age 72) and one female (age 60) and used as control biological replicates, and lung samples derived from a single male COVID-19 deceased patient (age 74) which were processed in technical replicates. Library preparation method polyA+ selection was used to remove rRNAs before sequencing. GSE148729 dataset (41) From this dataset we downloaded biological replicates of Calu-3, Caco-2 and H1299 cells which were mock treated or infected with SARS-CoV-2 (patient isolate BetaCoV/Munich/BavPat1/2 020/EPI_ISL_406862) or SARS-CoV (Frankfurt strain) at an MOI of 0.3. Library preparation method polyA+ selection was used to remove rRNAs before sequencing Caco-2 and H1299 cells. For Calu-3 cells, two library preparation method polyA+ selection and rRNA-depletion were used respectively to remove rRNAs before sequencing. 37 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 GSE153940 dataset (59) From this dataset we downloaded RNA sequencing data of Vero E6 cells which were either mock-infected or infected with SARS-CoV-2 USA-WA1/2020 (MOI = 0.3) with three repli- cates. However, when we downloaded the data one sample with accession number GSM4658806 was not available for downloading. Cells were harvested at 24 hours after infection, and rRNA- depletion method was used to extract RNA for sequencing. GSE56189 dataset From this dataset we downloaded: Biological triplicates of MRC5 and Vero E6 cells which were mock treated or infected with SARS-CoV (Urbani strain) or MERS-CoV (EMC/2012) at an MOI of 0.1 or 3. Library preparation method polyA+ selection was used to remove rRNAs before sequencing. CRA002390 dataset (42) This dataset is public available in https://bigd.big.ac.cn/gsa/browse/CRA002390. From this dataset we downloaded: The raw FASTQ data of PBMC and BALF samples of COVID-19 patients and corresponding PBMC controls. PRJNA434133 dataset (43) From this dataset we downloaded the raw FASTQ data for BALF healthy control samples with accession numbers SRR10571724, SRR10571730, and SRR10571732. GSE145926 dataset (44) From this dataset we downloaded the preprocessed single cell RNA-Seq data of BALF samples from 6 severe COVID-19 patients and 3 mild COVID-19 patients. PRJNA526088 dataset (45) From this dataset we downloaded the preprocessed single cell RNA-Seq data of BALF sample from a healthy control with accession number GSM3660650. 38 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Supplementary figures Fig. S1. Workflow of bulk RNA-Seq. Bulk RNA-Seq raw data FastQC Trimmomatic Align to virus genome Pseudoalign to host transcriptome B ow tie 2 Kallisto Samtools Sleuth Reads coverage along virus genome Gene level TPM values Bulk RNA-Seq clean data Fig. S2. Workflow of single cell RNA-Seq data. Count matrix of scRNA-Seq of COVID-19 patients Sum counts across all cells to obtain “pseudo-bulk” samples EDASeq Obtain gene length org.Hs.eg.db Scater Obtain gene level TPM values 39 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Fig. S3. Expression levels of IFNs in MRC5 cells infected with SARS-CoV and MERS- CoV. ●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 1 M O I 3 M O I TP M IFNB1 SARS-CoV MRC5 ●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 1 M O I 3 M O I IFNL1 SARS-CoV MRC5 ●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 1 M O I 3 M O I IFNL2 SARS-CoV MRC5 ●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 1 M O I 3 M O I IFNL3 SARS-CoV MRC5 ●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 1 M O I 3 M O I TP M IFNB1 MERS-CoV MRC5 ●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 1 M O I 3 M O I IFNL1 MERS-CoV MRC5 ●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 1 M O I 3 M O I IFNL2 MERS-CoV MRC5 ●●● ●●● ●●●0 1 2 3 4 5 M oc k 0. 1 M O I 3 M O I IFNL3 MERS-CoV MRC5 40 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Fig. S4. Expression levels of IFNs in BALF samples of patients. ●●● ●●0 1 H e a lth y. B A L F P a tie n t. B A L F T P M IFNB1A ●●● ●●0 1 H e a lth y. B A L F P a tie n t. B A L F T P M IFNL1B ●●● ●●0 1 H e a lth y. B A L F P a tie n t. B A L F T P M IFNL2C ●●● ●●0 1 H e a lth y. B A L F P a tie n t. B A L F T P M IFNL3D 41 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Fig. S5. Expression levels of IFNs in PBMC samples of patients. ●● ● ●●● 0 1 H e a lth y. P B M C P a tie n t. P B M C T P M IFNB1A ● ● ● ●● ● 0 1 H e a lth y. P B M C P a tie n t. P B M C T P M IFNL1B ●● ● ●●●0 1 H e a lth y. P B M C P a tie n t. P B M C T P M IFNL2C ●●● ●●●0 1 H e a lth y. P B M C P a tie n t. P B M C T P M IFNL3D 42 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Fig. S6. Expression levels of IFNs in lung samples of patients. ●● ●●0 1 H e a lth y. L u n g P a tie n t. L u n g T P M IFNB1A ●● ● ● 0 10 H e a lth y. L u n g P a tie n t. L u n g T P M IFNL1B ●● ●●0 1 H e a lth y. L u n g P a tie n t. L u n g T P M IFNL2C ● ● ●●0 1 H e a lth y. L u n g P a tie n t. L u n g T P M IFNL3D 43 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Fig. S7. The number of reads mapped to the SARS-CoV-2 genome in lung samples of patients. ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ● ●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● nsp1 nsp2 nsp3 nsp4−nsp16 S orf3a E M orf6−8 N ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● nsp1 nsp2 nsp3 nsp4−nsp16 S orf3a E M orf6−8 N SAMN14563387 SAMN14563388 0 10000 20000 30000 0 10000 20000 30000 0 2 4 6 Genomic Position S A R S − C o V − 2 R e a d s (l o g 1 0 ) SARS−CoV−2 44 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 Fig. S8. The expression levels of ACE2 in the PBMC, lung and BALF samples of healthy individuals. ●●● ●●● ● ● 0 1 2 3 4 5 H e a lth y. B A L F H e a lth y. P B M C H e a lth y. L u n g T P M ACE2 Additional files that are too large to be embedded into the .tex file: Table S1 to TableS1.xlsx Table S2 to TableS2.csv 45 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.07.425716doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425716 10_1101-2021_01_07_425773 ---- A Self-Supervised Machine Learning Approach for Objective Live Cell Segmentation and Analysis Michael C. Robitaille1, Jeff M. Byers1, Joseph A. Christodoulides1, Marc P. Raphael*1 1 Materials Science and Technology Division, U.S. Naval Research Laboratory, Washington D.C. * Corresponding author: Marc.Raphael@nrl.navy.mil Abstract Machine learning algorithms hold the promise of greatly improving live cell image analysis by way of (1) analyzing far more imagery than can be achieved by more traditional manual approaches and (2) by eliminating the subjective nature of researchers and diagnosticians selecting the cells or cell features to be included in the analyzed data set. Currently, however, even the most sophisticated model based or machine learning algorithms require user supervision, meaning the subjectivity problem is not removed but rather incorporated into the algorithm’s initial training steps and then repeatedly applied to the imagery. To address this roadblock, we have developed a self-supervised machine learning algorithm that recursively trains itself directly from the live cell imagery data, thus providing objective segmentation and quantification. The approach incorporates an optical flow algorithm component to self-label cell and background pixels for training, followed by the extraction of additional feature vectors for the automated generation of a cell/background classification model. Because it is self-trained, the software has no user- adjustable parameters and does not require curated training imagery. The algorithm was applied to automatically segment cells from their background for a variety of cell types and five commonly used imaging modalities - fluorescence, phase contrast, differential interference contrast (DIC), transmitted light and interference reflection microscopy (IRM). The approach is broadly applicable in that it enables completely automated cell segmentation for long-term live cell phenotyping applications, regardless of the input imagery’s optical modality, magnification or cell type. Key Words: live cell imaging, segmentation, phenotyping, machine learning, unsupervised, classification Introduction Live cell phenotyping is an information rich experimental approach, capable of providing mechanistic insights into cell biology1,2, guiding drug development3 and elucidating disease pathologies4,5. The wealth of information available from live cell microscopy results from the fact that there are numerous optical modalities that can be integrated within a given experiment – from fluorescence imaging which provides spatio-temporal information on specific signaling pathways and organelles to label-free techniques such as phase contrast and differential interference contrast (DIC) which enable the visualization of whole cellular morphologies and dynamics. Each of these modalities provides its own outcome measures which can be viewed as static snapshots or dynamic variations within the four-dimensional space of x, y, z and time6. However, compared to genotyping - its synergistic partner technique - live cell phenotyping remains a far more subjective science. The generation of genomic sequencing data and its analysis can now be achieved autonomously by employing a combination of robotics and microfluidics for sample preparation and 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 machine learning algorithms for data collection and interpretation. In contrast, the extraction of quantitative information from live cell imagery by manual means is still commonplace in live cell microscopy, a fact which speaks to the human visual system’s adeptness at detecting small changes and low contrast features with high fidelity. But with automated live cell microscopes now able to collect high resolution imagery for days on end, the resulting data files can quickly grow to tens of gigabytes, leaving the analyst with an overwhelming amount of imagery to work through.7 Furthermore, if the analyst is not blinded to the experimental design, unconscious bias can creep into the data extraction process. Enter computational algorithms capable of extracting the relevant outcome variables from the imagery in an automated fashion.8-10 Broadly speaking, the algorithms are often classified as model based approaches (e.g. Cell Profiler)11, and machine learning algorithms, (e.g. U-Net, ilastik)12-14. Neither approach is completely autonomous when it comes to cell segmentation: model-based approaches require the manual tuning of multiple parameters, while machine learning requires the user provide curated data from which the algorithm is trained. Once tuned or trained, the software is able to process far more imagery than could be achieved manually - but there is still a human-in-the-loop. It is just that the manual contribution has been moved to the front end for training purposes and is then continuously reapplied by the algorithm. Algorithms that are tuned or trained at the onset can problematically miss relevant features as the cellular phenotypes or background characteristics evolve, inadvertently skewing the analysis. For instance, variations in label intensity (e.g. photobleaching, quenching) or new morphological features that were not present during the initial training (e.g. differentiation, mitosis, blebbing) can go undetected if not retrained with a freshly curated data set or parameters that capture the offending features. In the same way, temporal variations in the background illumination intensity or homogeneity can also result in improper cell segmentation.7 Especially concerning is that the user-supervised training process is inherently subjective in nature and can cause unconscious biases to be effectively baked in to the extracted data by the training process. To optimize objectivity and efficiency, an essential goal is to develop software that can accept imagery from any optical modality, labeled or unlabeled, and extract the cellular features of interest without input from the user. As participants in a synthetic biology real-time reproducibility project administered by U.S. Defense Advanced Research Projects Agency (DARPA), referred to as Independent Verification & Validation (IV&V), we have recently experienced all of these algorithmic limitations and how they can result in large amounts of data either being incorrectly segmented, subjectively segmented, or left unanalyzed due to time constraints.15 The program involves a wide range of cell types (amoeboid to eukaryotic) from multiple cell biology laboratories; multiple imaging modalities – both fluorescent and label free; and objective magnifications ranging from 10X to 100X. The cumbersome process of retraining supervised machine learning software to match this variety of conditions proved impractical and a human-in-the loop training step was deemed too subjective. The challenge then was to develop a completely automated segmentation algorithm for live cell microscopy applications. In particular, the image analysis software should be ‘self-supervised’, meaning it trains itself to classify cells versus background and then regularly updates this training so that it can adapt to evolving intensities and morphologies. The software was required to segment a variety of cell types from live cell imagery given the most common imaging modalities as inputs - phase contrast, transmitted light, DIC, fluorescence and interference reflection microscopy (IRM) – and to do so without user-adjustable parameters or user-selected training imagery. It was additionally required that the generated models adapt to changing cell phenotypes and lighting conditions for long-term imaging applications (hours to days). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 Methods To replace more manual model based and machine learning training approaches for segmenting cells with an automated, self-supervised algorithm, we took advantage of the one phenotypic feature which is present in live cell microscopy no matter what the modality: motion. From the nanoscale diffusion of proteins and vesicles to the migration of cells that are tens of microns in length, the ever present dynamics captured by live cell microscopy make it ideal for applying optical flow (OF) algorithms designed to identify not just spatial intensity features in a given frame but also the variation or ‘flow’ of those features from frame to frame. The central assumption in optical flow algorithms is that the overall image intensity will remain constant if the time difference between frames is reasonably small.16 This leads to the following time-derivative constraint equation:   ( , , ) 0 0 d dx I dy I I I x y t dt dt x dt y t u v ∂ ∂ ∂ = → + + = ∂ ∂ ∂  ( , , ) 0 0 d dx I dy I I I x y t dt dt x dt y t u v ∂ ∂ ∂ = → + + = ∂ ∂ ∂ where 𝐼𝐼(𝑥𝑥, 𝑦𝑦, 𝑡𝑡) is the in-plane image intensity at time 𝑡𝑡, 𝑢𝑢 and 𝑣𝑣 being the optical flow in the x and y directions, respectively. The methods used to solve this constraint equation are matched with the imaging goal, such as reducing jitter in imagery taken from helicopters, aligning medical imagery or, in the case of this study, cell motion segmentation. In testing a range of optical flow algorithms for cell segmentation, we found the Farnebäck method to be the most robust due to its sensitivity to object deformation – a natural fit for cells which are morphologically variable.17,18 OF assumptions may or may not be met for fluorescence time-lapse imagery applications in which extended time intervals are sometimes employed to avoid phototoxicity or photobleaching.6,19 For this reason, it was important that our technique be co-validated with label free techniques such as transmitted light and phase contrast which are minimally invasive. Overlays of less frequently accumulated fluorescence imagery with cells segmented using a label-free imaging channel is then straightforward. Furthermore, there has been an increased appreciation for the morphological information label-free approaches can provide as a result of algorithmic-based phenotyping.20-22 Our approach to self-supervised learning and automated model generation begins with utilizing the Farnebäck OF method as a means of classification bootstrapping (Fig 1). Typical segmentation strategies involve utilizing static information in a single image at time frame (t), which can have difficulty distinguishing ‘cell’ from ‘background’ pixels in a generalizable manner (Fig 1a). In contrast, our approach begins with an OF calculation based on images from consecutive time frames (t-1, t). This enables us to leverage the ubiquitous nature of intracellular motion and build a dynamics-based feature vector: pixels with the highest flow are automatically labeled as ‘cell’ pixels, those with the lowest flow are automatically labeled as ‘background’ pixels, and those that do not fit either category remain unlabeled (Fig 1b,c). We note that this automatic self-labeling is broadly applicable in that it is not dependent on principles of any specific optical modality, cell type, or phenotype. The OF-based self-labeling approach outputs a set of ‘cell’ and ‘background’ labeled pixels which are then used to generate additional entropy and gradient feature vectors at each time point. These static feature vectors are used to train and generate a classifier model which, in the final step, is applied to all pixels in the image for cell segmentation. The algorithm is written in stand-alone MATLAB script and utilizes functions from the Image Processing, Statistics and Machine Learning, and Computer Vision Toolboxes. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 Fig. 1 Overview of the optical flow self-labeling strategy. (a) The vast majority of cell segmentation techniques utilize single image frames and the static information contained within as means to distinguish ‘cell’ from ‘background’, oftentimes represented in a histogram. The self-supervised algorithm utilizes optical flow as a means to self-label pixels in an automated fashion. (b) Due to the prevalence of intracellular dynamics in time-lapse live cell imagery, optical flow can be calculated for each pair of consecutive images (𝑡𝑡 − 1, 𝑡𝑡). The optical flow can then be represented as vectors associated with each pixel (right). (c) The magnitude of the optical flow then offers a means to distinguish cells from their background, as shown in the bivariate histogram which co-plots the pixel intensity of a single image at t to the optical flow vector magnitudes calculated between consecutive images (𝑡𝑡 − 1, 𝑡𝑡). Pixels with the highest flow can be automatically labeled ‘cell’ (left of the green dashed line) and those with the lowest flow can be labeled ‘background’ (right of the yellow dashed line). Pixels that do not meet either criteria remain unlabeled, while the self-labeled pixels are used to create a training data set for classification. Time increment: 600 sec, scale bar = 20 µm. The self-supervised training approach is illustrated in Fig 2 using time lapse DIC imagery of multiple (top) and a single highlighted (bottom) MDA-MB-231 cell. From the raw imagery (Fig 2a,b), many portions of individual cells appear to blend in with the background. However, when the OF self-labeling strategy is applied, the algorithm automatically identifies pixels with high flow magnitude, highlighted as green pixels (Fig 2c,d), which are selected as having the highest probability of correctly being labeled ‘cell’. To automatically label the background, the algorithm over segments, that is, a liberal (low) OF threshold is employed which captures motion from not only the cell but also from nearby background pixels as well. The algorithm sets these pixel values to zero and labels the pixels in which no significant motion was detected as ‘background’ (Fig 2c,d yellow pixels). Once labeled ‘cell’ or ‘background’ in this unsupervised manner by OF (dynamic features from image pair (𝑡𝑡 − 1, 𝑡𝑡) ), entropy and gradient feature vectors (static features from image at t) are generated for each of these training pixels using their local neighborhood of pixels (S.I., Fig S2). These additional feature vectors are then used train and generate a Naïve Bayesian classifier model which is applied to the entire image in a pixel-wise fashion. The information gained from 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 the entropy and gradient feature vectors enables pixels which were left unlabeled in the OF training steps (Fig 2c,d grey pixels) to be classified. The contrast enhanced image (Fig 2b) and model-generated segmentation (Fig 2f, teal pixels) show that the algorithm is able to segment the cell with high fidelity (DIC image/segmented boundary overlay, Fig 2g). Importantly, this labeling, training and classifying procedure occurs recursively on each successive pair of (𝑡𝑡 − 1, 𝑡𝑡) images, enabling the classifier model to adapt to changing backgrounds and phenotypes. By using optical flow to label the highest flow pixels as ‘cells’ and lowest flow pixels as ‘background’, the labeling process has become automated (or ‘self-supervised’) and no manual inputs or training images are needed. For extremely low contrast imagery there can be too few training pixels labeled ‘cell’ for robust segmentation to occur given the initial OF threshold setting. In such cases, the algorithm calculates the entropy associated with ‘cell’ pixels and iteratively reduces the OF threshold until the associated ‘cell’ entropy feature vector is well distinguished from that of the ‘background’ entropy feature vector. Fig. 2 Overview of the automated self-supervised learning algorithm. a. The contrast enhanced DIC image of several and b a single highlighted MDA-MB-231 cell illustrates the range of intensities inherent within the cells. (20X objective). c. & d. Unsupervised learning via OF: high threshold OF is used to select only those pixels exhibiting the highest flow magnitudes and labels them as ‘cell’ (green pixels). Similarly, low threshold OF is used to identify pixels with a much wider range of flow magnitudes than the high flow regime. The lowest flow magnitude pixels are labelled ‘background’ (yellow pixels). Pixels that exhibit OF in between these regimes remain unlabeled (gray pixels). e. & f. Supervised learning via self-labeled training data. The self-labeled pixels (green and yellow) are then used to generate static feature vectors, which are in turn used to train the classifier model. g. The blue outline is the resulting segmentation which outlines all pixels classified by the OF trained model as ‘cell’ and is also overlaid on the image in b. This process is repeated at every time step, thereby using the most recent imagery to update the training data. Scale bar: 25 µm (20X objective, time increment: 300 sec). Results 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 The Fig 3 imagery shows the generality of this approach and also demonstrates how the self-supervised algorithm additionally automates commonly required manual inputs such as size filtering and hole filling. The segmented cells were processed from imagery acquired from a range of cell types, imaging modalities, magnifications and time increments (S.I. Table S1). The OF algorithm enabled a straightforward approach to automated size filtering which is a common user adjustable parameter in supervised machine learning approaches. To accomplish this, a stand-alone application of OF was applied to the imagery which lacked the added steps of self-tuning and model building described above. While some cell features are missed, this simpler, faster approach was found to be more than precise enough to estimate average cell size and to exclude much smaller objects, thus automating the size filtering process. Because extraneous debris often lacked the motion of the live cells, this debris was also automatically labeled as background by the OF algorithm. Fig 3a and b demonstrate the self-supervised code’s ability to size filter, while also adapting to cell types of differing sizes, by comparing the segmentation of human fibroblasts (10X, phase contrast) to those of the much smaller Dictyostelium amoeboid cells (10X, transmitted light), respectively. Extraneous debris features in the Hs27 imagery (Fig 3a, white arrows) are correctly identified as ‘background’, even though similar in size and intensity to the Dictyostelium cells of Fig 3b. The background inhomogeneities observed in Fig 3a and 3b, which could potentially be mislabeled as ‘cell’, are correctly identified because they remain relatively constant from frame 𝑡𝑡 − 1 to frame 𝑡𝑡. The segmentation results of the MDA-MB-231 cells (10X, phase contrast) in Fig 3c illustrates the algorithm’s ability to adapt to a wide range of phenotypes, from rounded Fig 3c(i) to spread Fig 3c(ii), which is enabled without need for user input by continuously retraining the model on consecutive image pairs. The current instantiation of the software does not attempt to separate cells that are touching or close enough to be segmented as a single object. Well-developed approaches such as watershed transforms23 and levelset methods24 can be employed for such purposes. The algorithm works robustly for a range of optical modalities and magnifications as shown in Figs 3d-f. Figs 3d and 3e are segmentation results from IRM imagery (40X, Hs27 cell) and DIC imagery (20X, MDA- MB-231). As a fluorescence imaging example, a self-supervised segmentation of a GFP-actin labeled A549 cell at 100X magnification is shown in Fig 3f. As an additional option, OF can be applied not only as an algorithm labeling element, but also a measurement tool, as shown in the Fig 3f vector plot. The plotted OF vectors (blue) display the magnitude and direction of the measured GFP labelled actin flow between frames. Such measurements have been shown to be useful for quantifying intracellular protein and calcium signaling dynamics.25-27 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 Fig. 3 Self-supervised segmentation for a range of cell types, microscope modalities, time resolutions and magnifications. a. phase contrast of Hs27 fibroblasts (10X objective, time increment: 1200 sec) b. transmitted light of Dictyostelium (10X objective, time increment: 60 sec) c. phase contrast of MDA-MB-231 (10X objective, time increment: 600 sec) d. IRM image of a single Hs27 cell (40X objective, time increment: 600 sec) e. DIC image of MDA- MB-231 cells (20X objective, time increment: 120 sec ) f. fluorescence image of a single lifeAct (GFP-actin conjugate) transfected A549 cell (pseudo-colored) with the associated optical flow vector plot (100X objective, time increment: 10 sec). Insets i, ii, iii highlight boxed image regions. White arrows point to examples of debris that was correctly labelled ‘background’ due either to lack of motion or automated size filtering. Images have been contrast enhanced to highlight low contrast features and background inhomogeneities. DIC image (e) was additionally enhanced with a 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 sharpen filter to highlight interference induced shadowing of cell features. Scale bars: a, b, c: 50 µm; d, e: 25 µm; f: 10 µm. Hole filling, another often required manual input for model-based and machine learning algorithms, has also been automated by this approach. Common examples of when hole filling input is required include fluorescent labels that do not penetrate the nucleus or, for label-free microscopy modes such as phase contrast, large spread cells in which the algorithm has a difficult time associating the interference enhanced cell edges with the enclosed lamellipodia. We found that motion within cells was ubiquitously detected by OF, regardless of imaging modality or whether imaging the cell membrane, nucleus or cytoplasm. Because motion detection was far more common than not for a given pixel within an area labeled ‘cell’, a fixed morphological blurring tool (circular with a radius of 5 pixels) was found to robustly hole fill regardless of cell type or microscope configuration. The calculated cell area was found to be invariant for a range of blurring tool radii (Fig S2). In all cases, the use of optical flow to identify motion and the 5 pixel radius blurring tool was sufficient to correctly fill in the cell. By re-training on every pair of consecutive images the self-supervised algorithm remains accurate throughout long-term imaging applications, despite changes in background or cell phenotypes. This allows for a rich behavior of dynamic morphology and migration to readily be collected and analyzed – a key point given the known inter-relationship between cellular shape and function.2,28,29 Furthermore, the emerging role that not just cell shape, but cell shape dynamics play in fundamental biological processes is becoming increasing clear.30 Fig 4 demonstrates how such quantitative morphological information is readily mined in a long-term imaging application. Fig 4a-c shows the tracking of several MDA-MB-231 cells segmented via the self- supervised approach under 10X phase contrast microscopy on cRGD functionalized gold coverslips.31 Fig 4a shows the labeled tracks of the cells’ centroids over the course of 400 minutes, with the corresponding initial and final image shown in Figs 4b,c. The cell associated with track 2 undergoes mitosis at approximately 320 minutes, creating two new tracks (5 and 6) for the daughter cells. Because the self- supervised approach automatically re-trains continuously on consecutive frame pairs, the morphological changes from Fig 4b to Fig 4c are quantified with high fidelity, as can be seen by plotting the segmented boundaries as a function of time (Fig 4d). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 Fig. 4 Tracking of MDA-MB-231 cells under 10X phase contrast microscopy and time evolution of cell morphology through mitosis. a. The resulting tracks of multiple segmented cells from a single field of view over the course of 400 minutes b. corresponding images at times t = 0 min and c. 400 min. Track 2 undergoes mitosis resulting in tracks 5 and 6 of the daughter cells (blue line). d. (left) Time evolution of segmented morphology of track 2 (black) with the centroid of each shape denoted by an open circle until mitosis, after which the track splits into 5 (green) and 6 (blue), with the cell separation event denoted by a single red open circle. d. (right) selected images showing raw data overlapped with the self-supervised segmentation throughout mitosis event. (10X objective, time increment: 600 sec) Scale bar: 100 μm. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 Discussion & Conclusions There are numerous advantages to this self-supervised machine learning approach. The most obvious is that because the training data is generated by tracking motion, the approach can be used with any live cell imaging microscopy technique, whether labeled or label-free. Also unique is the use of the optical flow labeled pixels to self-supervise the building of a classifier model, which in turn is modular with regards to the incorporated feature vectors. While we have employed only two feature vectors in this current instantiation of the classification code (gradient and entropy), there are many additional image features that can be added based on the application. We have also shown that the incorporation of OF enables the straightforward automation of morphological operations such as size filtering and hole filling, eliminating the need for manually tuning these parameters. The automation described here is markedly different from machine learning approaches that require user assisted training. The most time consuming aspect of model-based tuning and machine learning approaches is the training process. The process is one of trial and error, requiring retraining if the model’s performance is not deemed adequate. The complete automation of both the training and segmentation algorithms not only saves time but also removes the chances of unconscious bias from entering the training process. Because the training is conducted recursively with each new image, evolutions in phenotype and background structure over extended time periods are accounted for without the need for preprocessing. The sum of all these advantages is segmentation under a wide range of magnifications, time resolutions, cell types and optical modalities that is both automated and robust. This results in the ability to track cells for hours or days and quantify a range morphological and phenotypic features without the need for user input, thus having broad applicability throughout live cell microscopy. The crux of the introduced self- supervised approach relies upon using the dynamic information embedded in each pixel – motion characterized via optical flow – as an elegant means to self-label cells versus background in time-lapse imagery. While cellular dynamics has long been appreciated as information rich with regards to understanding cell function, our approach demonstrates that it also provides the means for robust segmentation – a foundational step for achieving quantitative and objective live cell analysis. Acknowledgements The authors gratefully acknowledge the Devreotes laboratory of Johns Hopkins University for the Dictyostelim discoideum cell line. M.C.R. gratefully acknowledges support from the National Research Council Research Associateship Program and the Jerome and Isabella Karle Distinguished Scholar Fellowship Program. Funding for this project was provided by the Office of Naval Research through the Naval Research Laboratory’s Basic Research Program and by the Biological Technology Office of the Defense Advanced Research Program Agency. Author Contributions Michael C. Robitaille: conceptualization, methodology, investigation, data curation, software, visualization, and writing. Jeff M. Byers: conceptualization, methodology, formal analysis, and software. Joseph A. Christodoulides: Resources, validation, and writing. Marc P. Raphael: conceptualization, funding acquisition, methodology, investigation, software, visualization, and writing. 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 Financial Conflicts of Interest The authors do not have any conflict of interests with this work. References 1 Caicedo, J. C., Singh, S. & Carpenter, A. E. Applications in image-based profiling of perturbations. Current Opinion in Biotechnology 39, 134-142, doi:10.1016/j.copbio.2016.04.003 (2016). 2 Cadart, C., Zlotek-Zlotkiewicz, E., Le Berre, M., Piel, M. & Matthews, H. K. Exploring the Function of Cell Shape and Size during Mitosis. Developmental Cell 29, 159-169, doi:10.1016/j.devcel.2014.04.009 (2014). 3 Zhou, X. B. & Wong, S. T. C. High content cellular imaging for drug development. Ieee Signal Processing Magazine 23, 170-174, doi:10.1109/msp.2006.1598095 (2006). 4 Zhong, J. et al. Persistent hepatitis C virus infection in vitro: Coevolution of virus and host. Journal of Virology 80, 11082-11093, doi:10.1128/jvi.01307-06 (2006). 5 Zhu, N. et al. Morphogenesis and cytopathic effect of SARS-CoV-2 infection in human airway epithelial cells. Nature Communications 11, doi:10.1038/s41467-020-17796-z (2020). 6 Skylaki, S., Hilsenbeck, O. & Schroeder, T. Challenges in long-term imaging and quantification of single-cell dynamics. Nature Biotechnology 34, 1137-1144, doi:10.1038/nbt.3713 (2016). 7 Caicedo, J. C. et al. Data-analysis strategies for image-based cell profiling. Nature Methods 14, 849-863, doi:10.1038/nmeth.4397 (2017). 8 Deep learning gets scope time. Nature Methods 16, 1195-1195, doi:10.1038/s41592-019-0670-x (2019). 9 Grys, B. T. et al. Machine learning and computer vision approaches for phenotypic profiling. Journal of Cell Biology 216, 65-71, doi:10.1083/jcb.201610026 (2017). 10 Moen, E. et al. Deep learning for cellular image analysis. Nature Methods 16, 1233-1246, doi:10.1038/s41592-019-0403-1 (2019). 11 Carpenter, A. E. et al. CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biology 7, doi:10.1186/gb-2006-7-10-r100 (2006). 12 Al-Kofahi, Y., Zaltsman, A., Graves, R., Marshall, W. & Rusu, M. A deep learning-based algorithm for 2-D cell segmentation in microscopy images. Bmc Bioinformatics 19, doi:10.1186/s12859-018- 2375-z (2018). 13 Falk, T. et al. U-Net: deep learning for cell counting, detection, and morphometry (vol 16, pg 67, 2019). Nature Methods 16, 351-351, doi:10.1038/s41592-019-0356-4 (2019). 14 Sommer, C., Straehle, C., Kothe, U., Hamprecht, F. A. & Ieee. in 2011 8th Ieee International Symposium on Biomedical Imaging: From Nano to Macro IEEE International Symposium on Biomedical Imaging 230-233 (2011). 15 Raphael, M. P., Sheehan, P. E. & Vora, G. J. A controlled trial for reproducibility. Nature 579, 190- 192, doi:10.1038/d41586-020-00672-7 (2020). 16 Beauchemin, S. S. & Barron, J. L. The computation of optical flow. ACM Comput. Surv. 27, 433- 467, doi:10.1145/212094.212141 (1995). 17 Farneback, G. in Image Analysis, Proceedings Vol. 2749 Lecture Notes in Computer Science (eds J. Bigun & T. Gustavsson) 363-370 (2003). 18 Robitaille, M. C., Byers, J. M., Christodoulides, J. A. & Raphael, M. P. Robust Optical Flow Algorithm for General, Label-free Cell Segmentation. bioRxiv, 2020.2010.2026.355958, doi:10.1101/2020.10.26.355958 (2020). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 19 Schroeder, T. Long-term single-cell imaging of mammalian stem cells. Nature Methods 8, S30-S35, doi:10.1038/nmeth.1577 (2011). 20 Jaccard, N. et al. Automated Method for the Rapid and Precise Estimation of Adherent Cell Culture Characteristics from Phase Contrast Microscopy Images. Biotechnol. Bioeng. 111, 504-517, doi:10.1002/bit.25115 (2014). 21 Ounkomol, C., Seshamani, S., Maleckar, M. M., Collman, F. & Johnson, G. R. Label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy. Nature Methods 15, 917-+, doi:10.1038/s41592-018-0111-2 (2018). 22 Vicar, T. et al. Cell segmentation methods for label-free contrast microscopy: review and comprehensive comparison. Bmc Bioinformatics 20, 25, doi:10.1186/s12859-019-2880-8 (2019). 23 Wang, M. et al. Novel cell segmentation and online SVM for cell cycle phase identification in automated microscopy. Bioinformatics 24, 94-101, doi:10.1093/bioinformatics/btm530 (2008). 24 Nath, S. K., Palaniappan, K. & Bunyak, F. in Medical Image Computing and Computer-Assisted Intervention - Miccai 2006, Pt 1 Vol. 4190 Lecture Notes in Computer Science (eds R. Larsen, M. Nielsen, & J. Sporring) 101-108 (2006). 25 Buibas, M., Yu, D., Nizar, K. & Silva, G. A. Mapping the Spatiotemporal Dynamics of Calcium Signaling in Cellular Neural Networks Using Optical Flow. Annals of Biomedical Engineering 38, 2520-2531, doi:10.1007/s10439-010-0005-7 (2010). 26 Delpiano, J. et al. Performance of optical flow techniques for motion analysis of fluorescent point signals in confocal microscopy. Machine Vision and Applications 23, 675-689, doi:10.1007/s00138-011-0362-8 (2012). 27 Lee, R. M. et al. Quantifying topography-guided actin dynamics across scales using optical flow. Mol. Biol. Cell 31, 1753-1764, doi:10.1091/mbc.E19-11-0614 (2020). 28 Meyers, J., Craig, J. & Odde, D. J. Potential for control of signaling pathways via cell size and shape. Current Biology 16, 1685-1693, doi:10.1016/j.cub.2006.07.056 (2006). 29 Rangamani, P. et al. Decoding Information in Cell Shape. Cell 154, 1356-1369, doi:10.1016/j.cell.2013.08.026 (2013). 30 Akanuma, T., Chen, C., Sato, T., Merks, R. M. H. & Sato, T. N. Memory of cell shape biases stochastic fate decision-making despite mitotic rounding. Nature Communications 7, doi:10.1038/ncomms11963 (2016). 31 Robitaille, M. C. et al. Problem of Diminished cRGD Surface Activity and What Can Be Done about It. Acs Applied Materials & Interfaces 12, 19337-19344, doi:10.1021/acsami.0c04340 (2020). 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425773doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425773 10_1101-2021_01_07_425782 ---- dynUGENE: an R package for uncertainty-aware gene regulatory network inference, simulation, and visualization dynUGENE: an R package for uncertainty-aware gene regulatory network inference, simulation, and visualization Tianyu Lu1,2� and Anjali Silva2,3,4 1 Department of Computer Science, University of Toronto, Toronto, Canada 2 Department of Cell and Systems Biology, University of Toronto, Toronto, Canada 3 Princess Margaret Cancer Centre, University Health Network, Toronto, Canada 4 Vector Institute, Toronto, Canada Methods for gene regulatory network inference focus on net- work architecture identification but neglect model selection and simulation. We implement an extension to the dynGENIE3 al- gorithm that accounts for model uncertainty as an R package, providing users with an easy to use interface for model selection and gene expression profile simulation. Source code is avail- able at https://github.com/tianyu-lu/dynUGENE with a detailed user guide. A webserver with interactive controls is available at https://tianyulu.shinyapps.io/dynUGENE/. Gene regulatory network | network inference Correspondence: tianyu.lu @mail.utoronto.ca Introduction Complex phenomena such as cell development and apopto- sis emerge from coordinated dynamics of gene regulatory networks (GRN). Inferring network structure from data can be used for hypothesis generation, revealing mechanisms in cell development and disease (Huang et al., 2009), and mod- elling network evolution (Crombach and Hogeweg, 2008). Accurate dynamical models allow us to predict the effects of network perturbations on biological function, for example to push cells out of a disease state (Karlebach and Shamir, 2010), or to design synthetic GRNs given the desired dynam- ics of a network (Hiscock, 2019). The ideal model should be flexible enough to capture highly nonlinear interactions while not sacrificing model interpretability and computation time. We present dynUGENE (dynamical Uncertainty-aware GEne NEwork inference), an R package that extends the functional- ity of dynGENIE3, a state-of-the-art method for GRN infer- ence (Geurts et al., 2018). We build on dynGENIE3 because it satisfies all three of our model desiderata. Existing exten- sions include TIMEOR and BENIN which both incorporate heterogeneous data to improve network inference accuracy (Wonkap and Butler, 2020; Conard et al., 2020). Here, we take a different approach and instead account for uncertainty in dynGENIE3, allowing for stochastic gene expression sim- ulations and parsimonious model selection. Our extension is available as an easy to use R package and also as an interac- tive web server. Package Design dynGENIE3 Background. dynGENIE3 poses GRN infer- ence as a feature selection problem. It first trains random forests to predict the change in concentration of each species given the current concentrations of all species. Each interac- tion from species xi to species xj is associated with an im- portance score, calculated by the reduction in variance from using xi to predict the change in xj. The importance score for an interaction, when normalized, is interpreted as the proba- bility of that interaction to exist. For a detailed treatment, see the vignette and (Geurts et al., 2018). Model Selection. The inferred network can be visualized as a p×p matrix where the entry [xi,xj] is the importance score of xi for inferring xj (Fig. 1). However, real GRNs are of- ten not fully connected and the presence of an interaction is binary (Mangan et al., 2016). To address this, dynUGENE includes a function for model selection based on visualizing the Pareto front (Mangan et al., 2016). However, we note that the model at the sharp drop in the Pareto front is not al- ways the best model (Supplementary Fig. S1). We include an additional function on the web server where users can choose which interactions to mask. The masked networks can then be simulated, allowing for application-specific tun- ing of model complexity. Model Simulation. The inferred networks and masked net- works can be used to simulate gene expression profiles by numerically solving the system of ordinary differential equa- tions learned by the random forests. In addition to determin- istic simulations, we provide an option that accounts for the uncertainty in the random forests predictions for stochastic simulations. For stochastic simulations, instead of only tak- ing the mean of a random forest’s predictions, we sample from the Gaussian N(µ,σ2) where µ is the mean and σ2 is the variance of the random forest’s predictions. Provided Datasets. The dynUGENE package provides four example time-series datasets: repressilator, stochastic re- pressilator, Hodgkin-Huxley, and stochastic Hodgkin-Huxley (Elowitz and Leibler, 2000; Hodgkin and Huxley, 1952). These datasets were generated from systems of ordinary or Lu et al. | bioRχiv | January 7, 2021 | 1–3 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425782doi: bioRxiv preprint https://github.com/tianyu-lu/dynUGENE https://tianyulu.shinyapps.io/dynUGENE/ https://doi.org/10.1101/2021.01.07.425782 http://creativecommons.org/licenses/by/4.0/ Fig. 1: Bottom: inferred importance scores on the repressilator dataset for the 16th network in the step-wise column masks plot (Supplementary Fig. S2). Top: Simulated trajectory using the inferred network. stochastic differential equations. Details are provided in the vignette. The package also includes one steady state dataset, SynTReN300, taken from GRNdata (Bellot et al., 2020). Users can provide their own data as input following the for- mat specified in ?inferNetwork. Discussion A requirement for dynGENIE3 and dynUGENE is that all species must be tracked through time. This requirement is difficult to satisfy in practice as there are often unknown species in a biological process of interest. Methods that can identify or approximate latent structure in partially-observed systems are more appropriate here (Hiscock, 2019). An omics treatment such as RNA-seq can cover breadth but cur- rent sequencing techniques require cells to be destroyed, thus making time series data collection difficult. Non-destructive sequencing techniques could address this issue. The implementation of an inferred network as a gene circuit will require more thought. Even for networks with sparse interactions, the likelihood of finding a set of genes and pro- teins that satisfy the interaction strengths and activation or inhibitory effects is unknown. In fact, whether a species is an activator or inhibitor is not explicitly given in the interac- tion matrix. We can address this by posing dynUGENE as a constrained optimization problem where it is limited to using only a given set of parts (genes, promoters, ribosome bind- ing sites, proteins, etc.) thus relating the importance scores with biological interaction strengths. We leave this for future work. Data and code availability Source code is available at https://github.com/tianyu- lu/dynUGENE with a detailed user guide. A webserver with interactive controls is available at https://tianyulu.shinyapps.io/dynUGENE/. ACKNOWLEDGEMENTS The authors thank the authors of dynGENIE3 for their work and Alan Moses for guidance. FUNDING This work was supported by a Postdoctoral Fellowship from Canadian Institutes of Health Research. Bibliography Sui Huang, Ingemar Ernberg, and Stuart Kauffman. Cancer attractors: a systems view of tu- mors from a gene network dynamics and developmental perspective. In Seminars in cell & developmental biology, volume 20, pages 869–876. Elsevier, 2009. Anton Crombach and Paulien Hogeweg. Evolution of evolvability in gene regulatory networks. PLoS Computational Biology, 4(7):e1000112, 2008. Guy Karlebach and Ron Shamir. Minimally perturbing a gene regulatory network to avoid a disease phenotype: the glioma network as a test case. BMC Systems Biology, 4(1):15, 2010. Tom W Hiscock. Adapting machine-learning algorithms to design gene circuits. BMC Bioinfor- matics, 20(1):1–13, 2019. Pierre Geurts et al. dyngenie3: dynamical genie3 for the inference of gene networks from time series expression data. Scientific Reports, 8(1):1–12, 2018. Stephanie Kamgnia Wonkap and Gregory Butler. Benin: Biologically enhanced network inference. Journal of Bioinformatics and Computational Biology, 18(03):2040007, 2020. Ashley Mae Conard, Nathaniel Goodman, Yanhui Hu, Norbert Perrimon, Ritambhara Singh, Charles Lawrence, and Erica Larschan. Timeor: a web-based tool to uncover temporal regu- latory mechanisms from multi-omics data. bioRxiv, 2020. Niall M Mangan, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Inferring biological networks by sparse identification of nonlinear dynamics. IEEE Transactions on Molecular, Biological and Multi-Scale Communications, 2(1):52–63, 2016. Michael B Elowitz and Stanislas Leibler. A synthetic oscillatory network of transcriptional regula- tors. Nature, 403(6767):335–338, 2000. Alan L Hodgkin and Andrew F Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4):500, 1952. Pau Bellot, Catharina Olsen, and Patrick E Meyer. grndata: Synthetic Expression Data for Gene Regulatory Network Inference, 2020. R package version 1.20.0. Carl Ganz. rintrojs: A wrapper for the intro. js library. Journal of Open Source Software, 1(6):63, 2016. Gregory R. Warnes, Ben Bolker, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber, Andy Liaw, Thomas Lumley, Martin Maechler, Arni Magnusson, Steffen Moeller, Marc Schwartz, and Bill Venables. gplots: Various R Programming Tools for Plotting Data, 2020. R package version 3.1.0. Hadley Wickham. ggplot2: elegant graphics for data analysis. springer, 2016. Christopher Rackauckas and Qing Nie. Adaptive methods for stochastic differential equations via natural embeddings and rejection sampling with memory. Discrete and Continuous Dynamical Systems. Series B, 22(7):2731, 2017a. Christopher Rackauckas and Qing Nie. Differentialequations. jl–a performant and feature-rich ecosystem for solving differential equations in julia. Journal of Open Research Software, 5 (1), 2017b. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2020. 2 | bioRχiv Lu et al. | dynUGENE .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425782doi: bioRxiv preprint https://github.com/tianyu-lu/dynUGENE https://github.com/tianyu-lu/dynUGENE https://tianyulu.shinyapps.io/dynUGENE/ https://doi.org/10.1101/2021.01.07.425782 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_01_07_425794 ---- Impact of gene annotation choice on the quantification of RNA-seq data Impact of gene annotation choice on the quantification of RNA-seq data David Chisanga 1,2,3,4, Yang Liao 1,2,3,4 and Wei Shi 1,2,3,5* 1Olivia Newton-John Cancer Research Institute, Heidelberg, Victoria, 3084, Australia, 2School of Cancer Medicine, La Trobe University, Bundoora, Victoria, 3083, Australia, 3Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, 3052, Australia, 4Department of Medical Biology, The University of Melbourne, Parkville, Victoria, 3010, Australia and 5School of Computing and Information Systems, The University of Mel- bourne, Parkville, Victoria, 3010, Australia Abstract RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis. In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods. *To whom correspondence should be addressed. Tel: +61 3 9496 5726; Fax: +61 3 9496 5334; Email: Wei.Shi@onjcri.org.au 1 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ 1 INTRODUCTION Gene expression profiling using RNA sequencing (RNA-seq) is a core activity in molec- ular biology. Comprehensive gene expression analysis in various settings is important for generating hypotheses for ongoing research, investigating drug-effects in biological or clinical settings and as a diagnostic tool. In this paper, we explore the fact that a popular approach in gene-level quantification from RNA-seq data involves mapping reads to a ref- erence genome and then counting mapped reads associated with each gene [1, 2, 3, 4, 5]. The process of counting mapped reads to genes requires a database of known genes. A gene is only quantified if it or its components have genomic coordinates already defined with respect to the genome sequence in a process called annotation. For each genome annotation model, a different set of annotation techniques and information sources are used and as such, these annotations vary in terms of comprehensiveness and accuracy of annotated genomic features. Annotation techniques often include computer-based predic- tions and/or evidence-based techniques such as manual curation [6, 7]. Computer-based predictions result in more complex gene models that have a higher proportion of predic- tive genomic features while evidence-based generated gene models are simpler with fewer genes and isoforms. Common annotation models for human and mouse genomes include Ensembl [8], RefSeq [9], GENCODE [10] and UCSC [11] annotations. Annotations are, therefore, an important component in an RNA-seq analysis as the results are dependent on what is known in the annotation database. Despite the importance of gene annotations in RNA-seq data analysis, very little re- search has been conducted to examine how differences in annotations impact on gene expression quantification, which is crucial for downstream analyses such as discovery of differentially expressed genes and identification of perturbed pathways. Previous studies compared the effect of human genome annotations from popular databases including En- sembl, GENCODE and RefSeq on various aspects of RNA-seq analysis and they showed that the choice of annotations had an impact on gene-level quantification in the RNA- seq analysis [12, 13]. However, these studies are out of date as they were based on old annotations and they also lacked a reliable ground truth for assessing the impact of annotation. Major annotation databases have undergone significant expansions over the years, thanks to the wide application of sequencing technologies and the massive amount of se- quencing data that have been generated across the world. However, it is unclear whether the quality of gene annotations have been successfully maintained. A recent study sug- gested that gene annotations have become less accurate and lagging during this expansion [6]. This can be attributed to the errors from sequencing experiments, sequence analysis 2 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ or automation in the annotation process. It is important to systematically assess the accuracy of the new gene annotations generated in recent years to ensure the popular annotation databases can continue to be utilized by the community for RNA-seq analysis. Furthermore, the use of different annotations in different studies makes it difficult for researchers to reproduce the findings from such studies. For example, large consortia such as the European Molecular Biology Laboratory (EMBL) use Ensembl in their studies while the National Centre for Biotechnology Information (NCBI) tend to use RefSeq. Since this can significantly impact on gene expression data, there is a need to develop a comprehensive understanding of how these differences in annotations impact the gene- level expression quantification. In this study, we compared three human gene annotations, including a recent Ensembl annotation (released in April 2020), a recent RefSeq annotation (released in August 2020) and an old RefSeq annotation (released in April 2015), to understand their impact on gene-level expression quantification in an RNA-seq data analysis pipeline. Although the old RefSeq annotation is not available at the NCBI RefSeq database anymore, it has been included as part of Rsubread, a popular RNA-seq quantification toolkit, for quantifying human RNA-seq data. We used a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC/MAQC III) consortium for this evaluation. We show that the use of RefSeq gene annotations led to better quantification accuracy than the use of Ensembl annotation, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known genome-wide titration ratios of gene expression and microarray gene expression data. We also show that the older RefSeq annotation yielded higher quantification accuracy than the recent RefSeq annotation in our evaluations, suggesting that the recent expansion and changes made to the RefSeq annotation have led to a decline in annotation accuracy resulting in less accurate quantification result. Furthermore, we investigated if any normalization method can mitigate the differences in quantification results caused by the annotation differences. Our results show that the quantification differences remained almost the same no matter how the RNA-seq data were normalized. 2 MATERIALS AND METHODS 2.1 SEQC/MAQC data The RNA-seq data used for evaluation in this study are a benchmark dataset generated by the Sequencing Quality Control (SEQC) project [1], the third stage of the MicroArray Quality Control (MAQC) study [14, 15]. The SEQC dataset includes the Universal 3 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ Human Reference RNA (UHRR) as sample A and the Human Brain Reference RNA (HBRR) as sample B. It also includes two other samples C and D, which are combination of A and B mixed in the ratios of 3:1 in C and 1:3 in D respectively. The samples were sequenced in four replicate paired-end libraries using an Illumina HiSeq 2000 sequencer at the Australian Genomics Research Facility (AGRF). Each library contains ∼20 million 100bp read pairs. A TaqMan real-time polymerase chain reaction (RT-PCR) dataset with expression values measured for over 1,000 genes, which was generated in the MAQC-I study [15], was used to validate the expression of the RNA-seq data in this study. The expression values were measured for both the UHRR and HBRR samples together with their respec- tive combinations. Around 800–900 TaqMan RT-PCR genes, which had matching gene identifiers with expressed RNA-seq genes from different annotations, were included for assessing the accuracy of RNA-seq quantification. In addition, microarray data generated in the MAQC-I study with samples A to D hybridized to the Illumina Human-6 Bead- Chip microarrays were also used in the assessment. The TaqMan RT-PCR and Illumina microarray datasets are available as part of the Bioconductor package ‘seqc’ [16]. 2.2 Annotations used Three human gene annotations were included in this study, including a recent Ensembl annotation, a recent RefSeq annotation and an old RefSeq annotation. All these anno- tations were generated based on the human reference genome GRCh38/hg38. The Ensembl gene annotation used in this study was generated in April 2020. Its ver- sion number is 100. It was downloaded from ftp://ftp.ensembl.org/pub/release-100/ gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz. The recent RefSeq gene annotation used was released by the NCBI in August 2020. Its release number is 109.20200815 and it is part of the RefSeq release version 202. It was downloaded from the NCBI FTP site ftp://ftp.ncbi.nlm.nih.gov/refseq/H_ sapiens/annotation/annotation_releases/109.20200815/GCF_000001405.39_GRCh38. p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz. We refer this RefSeq annota- tion as ‘RefSeq-NCBI’ in this study. The old RefSeq annotation included in this study was released by the NCBI in April 2015. It was released as part of the Patch 2 release of the GRCh38/hg38 genome build. This annotation has also been included in the popular RNA-seq quantification toolkit Rsubread [5] as the default annotation used for quantifying human RNA-seq data. The inclusion of this old RefSeq annotation allowed us to investigate how the annotation changes made recently to RefSeq affect the quantification result of RNA-seq data. The 4 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ RefSeq annotation in Rsubread is slightly different from the original one in that the overlapping exons from the same gene were collapsed to form a single continuous exon for the gene in the Rsubread annotation, however this difference will not change the gene-level RNA-seq quantification result because the set of exonic bases belonging to each gene is the same between the original annotation and the Rsubread annotation. As this old RefSeq annotation is no longer available for downloading at the NCBI FTP site, we instead used the Rsubread annotation in this study and we denote this annotation as ‘RefSeq-Rsubread’. When matching genes from different annotations, we converted the gene identifiers using the Bioconductor package ‘org.Hs.eg.db’ [17] and then compared them to find common genes between annotations. 2.3 Mapping, quantification and normalization of RNA-seq data Analysis of the RNA-seq data was performed using Bioconductor R packages Rsubread and limma [5, 18, 19]. The human reference genome (GRCh38) from GENCODE (version 34 downloaded from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/ release_34/GRCh38.primary_assembly.genome.fa.gz) was indexed using the buildin- dex function in Rsubread v2.2.6 [5]. Sequencing reads were then mapped to the reference genome using the align function in Rsubread [5, 20]. During the alignment, the En- sembl, RefSeq-NCBI and RefSeq-Rsubread annotations were also included as an extra parameter to improve alignment. Gene-level read counts were obtained with featureCounts [4, 5], a read count summa- rization function within the Rsubread package. The Ensembl, RefSeq-NCBI and RefSeq- Rsubread annotations were provided to featureCounts to generate read counts for genes included in these annotations respectively. The gene-level read counts were transformed using the voom function in limma [18, 21] and then normalized using the library size [22], quantile [23] and trimmed mean of M- values (TMM) [24] methods, respectively, prior to performing further analysis. The library size normalization was performed by providing raw read counts to voom and then running voom with the ‘normalize.method’ parameter set to ‘none’. The quantile nor- malization was performed by providing raw read counts to voom and then running voom with the ‘normalize.method’ parameter set to ‘quantile’. For TMM normalization, we first calculated the TMM normalization factor for each library using the calcNormFactors method in edgeR [25]. Then we provided raw read counts and the TMM normalization factors to voom and ran it with the ‘normalize.method’ parameter set to ‘none’. The log2CPM (log2 counts per million) values, produced by the voom function for each gene 5 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ in each library, were converted to log2FPKM (log2 fragments per kilo exonic bases per million mapped fragments) expression values for further analysis. 2.4 Titration monotonicity The RNA-seq data from the SEQC project have titration monotonicity built into them, such that a gene is considered to preserve titration monotonicity if the expression of the gene follows A ≥ C≥ D ≥B when its expression in sample A is greater than or equal to that in sample B, or follows A ≤ C≤ D ≤B when its expression in sample A is less than or equal to that in sample B. To test if the titration monotonicity is preserved, Equation (1) was used to compute the expected log2 fold-change for a gene in the comparison of C vs D given the log2 fold-change between A vs B. E = log2 ( 3 × 2x + 1 2x + 3 ) (1) where E is the expected log2 fold-change for C vs D and x is the log2 fold-change for A vs B. Expression levels of genes in the replicates of the same sample were averaged before fold change of gene expression was calculated between samples. 2.5 Validation Gene expression data generated using TaqMan RT-PCR and Illumina’s BeadChip mi- croarray were used to validate the gene-level quantification results from the RNA-seq analysis. Pearson correlation coefficients were computed to assess the concordance be- tween the RNA-seq quantification data obtained from using different annotations and the gene expression data obtained from the RT-PCR and microarray experiments. The genome-wide built-in truth of titration monotonicity of gene expression in the RNA-seq data was also utilized to evaluate the quantification accuracy of RNA-seq data generated from using different annotations. 2.6 Access to data and code The data and analysis code used in this study can be accessed at the following URL: https://github.com/ShiLab-Bioinformatics/GeneAnnotation. 6 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 RESULTS 3.1 Discrepancy between different gene annotations The Ensembl and NCBI RefSeq annotations are among the most widely used gene anno- tations that have been utilized for RNA-seq gene expression quantification in the field. In this study, we downloaded recent Ensembl and RefSeq annotations and also used an older version of Refseq annotation to assess the impact of gene annotation choice on the accuracy of RNA-seq expression quantification. The inclusion of an older RefSeq annotation allowed us to investigate the accuracy of new annotation data generated in recent years when the next-gen sequencing data have been used as a new data source for genome-wide annotation generation. The Ensembl annotation used in this study was released in April 2020 and it has a version number 100. The recent RefSeq annotation included in this study was released in August 2020. We call this annotation as ‘RefSeq-NCBI’ in this study. The older RefSeq annotation was released in April 2015, and it has also been included as part of the popular RNA-seq quantification toolkit ‘Rsubread’ for quantifying human RNA- seq data. As this annotation is not available in the NCBI RefSeq database anymore, we instead used the Rsubread RefSeq annotation in our evaluations and we denote this annotation as ‘RefSeq-Rsubread’. As RNA-seq gene-level expression quantification is typically performed for genes that contain exons [3, 4, 5], in this study we only focused on the genes that have annotated exons in each annotation. Figure 1A shows that, as expected, the Ensembl annotation contains a lot more exon-containing genes than the two RefSeq annotations. The En- sembl annotation is known to contain a large number of computationally predicted genes whereas RefSeq genes were mainly annotated based on the biological evidence. However, it is worth noting that the RefSeq-NCBI annotation still has >14,000 genes that are not included in the Ensembl annotation. Nearly 60% of the Ensembl genes were found to be absent from both of the two RefSeq annotations. In total, 23,424 common genes were found between the three annotations. Most of the genes included in the RefSeq-Rsubread annotation can be found in the RefSeq-NCBI or Ensembl annotations. We then examined the effective gene lengths in each annotation. The effective length of a gene is the total number of unique bases included in all the exons belonging to the gene. Figure 1B shows the distributions of effective lengths of genes in the three annota- tions. Around half of the Ensembl genes have an effective length less than 1,000 bases, whereas in the two RefSeq annotations only ∼25% of the genes are shorter than 1,000 bases in length. The median effective gene lengths in RefSeq-NCBI and RefSeq-Rsubread 7 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B Ensembl RefSeq-NCBI RefSeq-Rsubread 1,354 23,424 4,27820 35,885 673 10,479 E ns em bl R ef S eq -R su br ea d Lo g2 ef fe ct iv e ge ne le ng th R ef S eq -N C B I E ns em bl vs R ef S eq -N C B I E ns em bl vs R ef S eq -R su br ea d R ef S eq -N C B I vs R ef S eq -R su br ea d D iff er en ce in lo g2 ef fe ct iv e ge ne le ng th C To ta le ffe ct iv e ge ne le ng th s (x 10 ^6 ) 0 50 100 150 E ns em bl R ef S eq -R su br ea d R ef S eq -N C B I D 3 6 9 12 15 18 −4 −2 0 2 4 6 Figure 1: Concordance and differences between gene annotations. (A) Venn diagram showing genes that are common or unique in the Ensembl, RefSeq-NCBI and RefSeq-Rsubread annotations. (B) Boxplots showing the distribution of effective gene lengths (log2 scale) in each annotation. (C) Boxplots showing the differences in effective lengths of common genes between each pair of annotations. Values shown in the plots are the ratio of effective lengths of the same gene from two different annotations (log2 scale). (D) The size of transcriptome calculated from each annotation. Shown are the sum of effective gene lengths in each annotation. are ∼3,000 bases, which is much larger than that in Ensembl (∼1,000 bases). Although the Ensembl annotation contains a lot more genes than the two RefSeq annotations, it also contains a much higher percentage of short genes. We further performed gene-wise comparison of effective gene lengths using common genes between each pair of annotations. Although every annotation contains both longer and shorter genes in comparison to the corresponding genes from other annotations, the Ensembl genes were found to have a larger effective length than genes from the two RefSeq annotations overall (Figure 1C). This is in contrast to the higher proportion of short genes observed in the Ensembl annotation (Figure 1B), which indicates that the Ensembl genes that are also present in RefSeq-NCBI or RefSeq-Rsubread annotations tend to be longer than those Ensembl genes that can only be found in the Ensembl annotation. Although at least half of the genes were found to have a less than 2-fold (1-fold at log2 scale) length difference between annotations (Figure 1C), the length differences could be as high as more than 64-folds (6-folds at log2 scale). The RefSeq-NCBI genes seem to 8 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ be slightly longer than the corresponding RefSeq-Rsubread genes overall. Ensembl and RefSeq-Rsubread were found to be the least concordant annotations among the three annotations being compared. Lastly, we compared the size of the transcriptome represented by each annotation. The transcriptome size of an annotation is computed as the sum of effective gene lengths from all the genes included in that annotation, which also represents the total num- ber of exonic bases that were annotated in an annotation. Figure 1D shows that the Ensembl annotation has a larger transcriptome size than both RefSeq-NCBI and RefSeq- Rsubread annotations. This is not surprising because the Ensembl annotation contains more genes and also Ensembl genes common to other annotations are longer in general. RefSeq-Rsubread has a much smaller transcriptome size than RefSeq-NCBI, indicating a significant expansion of the RefSeq-NCBI annotation in the past five years. However, it is important to note that the RefSeq-Rsubread annotation is not a subset of the RefSeq- NCBI annotation, as demonstrated by the existence of RefSeq-RSubread genes that are absent in the RefSeq-NCBI annotation, the difference in gene length distribution and the length differences of the same genes between the two annotations (Figure 1A-C). This indicates that not only were new genes added to the RefSeq annotation during the expansion, but existing genes have been modified. It is against this background that we sought to understand how these differences in the annotations impact on the overall gene-level quantification results. 3.2 Fragments counted to annotated genes We used a benchmark RNA-seq dataset generated by the SEQC project [1] to evaluate the impact of gene annotation on the accuracy of RNA-seq expression quantification. This dataset contains paired-end 100bp read data generated for four samples including a Universal Human Reference RNA sample (sample A), a Human Brain Reference RNA sample (sample B), a mixture sample with 75%A and 25%B (sample C) and a mixture sample with 25%A and 75%B (sample D). We mapped the RNA-seq reads to the human genome GRCh38/hg38 using the Sub- read aligner [20, 5], and then counted the number of mapped fragments (read pairs) to each gene in each annotation using the featureCounts program [4, 5]. FeatureCounts assigns a mapped fragment to a gene if the fragment overlaps any of the exons in the gene. Figure 2 shows that across all the 16 libraries, the RefSeq-Rsubread annotation constantly has substantially more fragments assigned to it than the Ensembl and RefSeq- NCBI annotations. This is surprising because RefSeq-Rsubread contains much less an- notated genes and also has a significantly smaller transcriptome, compared to Ensembl 9 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ P er ce nt ag e of co un te d fr ag m en ts Key Ensembl RefSeq-NCBI RefSeq-Rsubread 0 20 40 60 80 A −1 A −2 A −3 A −4 B −4 B −1 B −2 B −3 C −1 C −2 C −3 C −4 D −1 D −2 D −3 D −4 Figure 2: Barplots showing the percentage of fragments successfully assigned to genes in each annota- tion, out of all the fragments included in each library. The horizontal axis represents the sixteen SEQC RNA-seq libraries generated from the four samples ‘A’, ‘B’, ‘C’ and ‘D’. Each sample has four replicates that are numbered from 1 to 4. and RefSeq-NCBI (Figure 1A,D). We then performed a detailed investigation into the mapping and counting results to find out what enabled RefSeq-Rsubread to achieve a higher percentage of successfully assigned fragments. Although gene annotations were utilized in mapping reads to the human reference genome, the use of different annotations was not found to affect the number of success- fully aligned fragments for each library (Supplementary Figure S1). We found that when assigning fragments to genes using the Ensembl or RefSeq-NCBI annotation, more frag- ments were unable to be assigned because they did not overlap any genes (ie. failed to overlap any exons included in any genes), despite there are more genes included in these annotations compared to the RefSeq-Rsubread annotation (Supplementary Figure S2). This is particularly the case for the fragment assignment in the human brain reference samples. We also found that the use of Ensembl and RefSeq-NCBI annotations led to more fragments being unassigned due to the assignment ambiguity, ie. a fragment over- laps more than one gene (Supplementary Figure S3). This should be because there are more genes that overlap with each other (ie. exons from different genes overlap with each other) in the Ensembl and RefSeq-NCBI annotations compared to the RefSeq-Rsubread annotation. Our investigation revealed that less gene overlapping in the RefSeq-Rsubread annotation and better compatibility of this annotation with the mapped fragments have led to more fragments being successfully counted for each library in this dataset. Given 10 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ −10 −5 0 5 10 15 lo g2 R P K M A −1 A −2 A −3 A −4 B −1 B −2 B −3 B −4 C −1 C −2 C −3 C −4 D −1 D −2 D −3 D −4 Key Ensembl RefSeq-NCBI RefSeq-Rsubread Figure 3: Boxplots comparing the intensity range of gene expression between the three annotations. All the genes from each annotation were included in the plots. Raw read counts of genes were transformed to log2FPKM values. A prior count of 0.5 was added to raw counts to avoid log-transformation of zero. that both the Universal Human Reference and Human Brain Reference samples used in this study are known to contain a very high number of expressed genes and the RNA-seq data generated from these samples are expected to cover most of the human transcrip- tome, our analysis suggests that the RefSeq-Rsubread annotation is likely to contain more transcribed region in the genome than the other two annotations in general. 3.3 Intensity range of gene expression We examined if the gene annotation choice has an impact on the range of gene expression levels in the RNA-seq data. Raw gene counts of the SEQC data were converted to log2FPKM (log2 fragments per kilo exonic bases per million mapped fragments) values for all the genes included in each annotation. A prior count of 0.5 was added to the raw counts to avoid log-transformation of zero. Figure 3 shows that the two RefSeq annotations exhibit a desirable larger intensity range of gene expression than the Ensembl annotation, as shown by the larger boxes in the boxplots. It is surprising to see that the Ensembl genes have the smallest intensity ranges in all the libraries, give that the Ensembl annotation contains the largest number of genes in all the three annotations being examined. In addition to the large intensity range, the RefSeq-Rsubread genes were also found to have a markedly higher median expression level than genes in the RefSeq-NCBI and Ensembl annotations. 11 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3.4 Gene annotation discrepancy after expression filtering As it is a common practice to filter out genes that are deemed as lowly expressed, or are completely absent in an RNA-seq data analysis [2], we also set out to assess the differences between alternative annotations after excluding such genes. We excluded those genes that failed to have at least 0.5 CPM (counts per million) in at least four libraries (each sample has four replicates) in the analysis of the SEQC dataset. The expression-filtered data were also used for comparing the accuracy of quantification from using alternative annotations presented in the following sections. The bar plot in Figure 4A shows that Ensembl has significantly more genes (also higher proportion of genes) filtered out due to low or no expression, compared to RefSeq- NCBI and RefSeq-Rsubread. After expression filtering, the total numbers of remaining genes from the three annotations became more similar to each other. 16,472 genes were found to be common between the three annotations after filtering, accounting for 69%, 78% and 86% of the filtered genes in the Ensembl, RefSeq-NCBI and RefSeq-Rsubread an- notations respectively (Figure 4B). Almost all the filtered genes in the RefSeq-Rsubread annotation can be found in the other two annotations. After expression filtering, the median effective gene length has increased to ∼4,000 bases for all annotations (Figure 4C), meaning that a higher proportion of short genes were removed due to low expression in every annotation. The median effective length of Ensembl genes now became comparable to, or slightly higher than those in the two RefSeq annotations, indicating that the Ensembl annotation contained a higher proportion of lowly expressed short genes than the two RefSeq annotations. When comparing the effective lengths of genes common to all three annotations after filtering, the Ensembl genes were found to have the largest median effective length and the RefSeq-Rsubread genes have the smallest median effective length (Figure 4D). This is not surprising because the Ensembl annotation is known to be more aggressive than the RefSeq annotations and the RefSeq-Rsubread annotation is an old annotation that has not been updated in the last five year. The expression filtering did not seem to affect the distribution of differences of effective gene lengths between each pair of annotations (using genes common to each pair of annotations), with Ensembl and RefSeq-Rsubread remaining to be the least concordant annotations (Figure 4E and Figure 1C). Using genes common to all three annotations after filtering exhibited similar distributions of gene length differences between each pair of annotations compared to using genes common to each pair of annotations (Figure 4F). Similar to before filtering, the gene-wise length comparison performed after filtering also showed that overall the Ensembl genes had the largest gene lengths and the RefSeq- 12 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ 1,996 376 2,141 6,837 489 71 16,472 Ensembl RefSeq-NCBI RefSeq-Rsubread B D Ensembl RefSeq-RsubreadRefSeq-NCBI Before After G en e co un t( x1 00 0) 10 30 50 60,683 23,869 28,395 19,060 39,535 21,098 A E ns em bl R ef S eq -N C B I R ef S eq -R su br ea d Lo g2 ef fe ct iv e ge ne le ng th D iff er en ce in lo g2 ef fe ct iv e ge ne le ng th F R ef S eq −N C B I vs R ef S eq −R su br ea d E ns em bl vs R ef S eq −R su br ea d E ns em bl vs R ef S eq -N C B I E ns em bl R ef S eq -N C B I R ef S eq -R su br ea d R ef S eq −N C B I vs R ef S eq −R su br ea d E ns em bl vs R ef S eq −R su br ea d E ns em bl vs R ef S eq -N C B I C E D iff er en ce in lo g2 ef fe ct iv e ge ne le ng th Lo g2 ef fe ct iv e ge ne le ng th 0 20 40 60 3 6 9 12 15 18 3 6 9 12 15 18 −4 −2 0 2 4 6 −4 −2 0 2 4 6 Figure 4: Concordance and differences between gene annotations after filtering for lowly expressed genes. (A) Bar plot showing the differences in the number of genes included in each annotation before and after filtering for lowly expressed genes. (B) Venn diagram comparing genes from different annotations after filtering for lowly expressed genes. Distributions of effective gene lengths after filtering are shown for all genes in each annotation (C) and for genes that are common between all three annotations (D). Distributions of differences of effective gene lengths between annotations after filtering are shown for common genes between each pair of annotations (E) and for genes that are common between all three annotations (F). 13 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ Rsubread genes had the shortest gene lengths. 3.5 Comparison of titration monotonicity preservation To assess the impact of gene annotation choice on the accuracy of RNA-seq quantification result, we utilized as ground truth the inbuilt titration monotonicity in the SEQC data, the TaqMan RT-PCR data and the microarray data generated for the same samples, to evaluate which annotation gives rise to a better expression correlation of the RNA-seq quantification data with the truth. In this section, we compared the ability of Ensembl and the two RefSeq annotations in retaining the inbuilt titration monotonicity in the RNA-seq dataset. In Figure 5, the reference titration curve depicts the expected fold change that genes are expected to follow in sample C vs sample D based on the fold change in sample A vs sample B. This is computed using the Equation (1) (see MATERIALS AND METHODS). We then calculated the Mean Squared Error (MSE) between the reference titration monotonicity and the titration monotonicity obtained from each annotation. A smaller MSE value means that the generated quantification data is closer to the truth. Figure 5 shows that the MSE computed for the RefSeq-Rsubread annotation is constantly lower than those computed for the Ensembl and RefSeq-NCBI annotations, regardless if filtering was applied or if only common genes were included for comparison. RefSeq-Rsubread was also found to yield comparable or lower MSE compared to the other two annotations when the data were TMM or quantile normalized (Supplementary Figures S4 and S5), in addition to the library-size normalized data shown in Figure 5. These results demonstrated that the use of RefSeq-Rsubread annotation led to better quantification accuracy for the RNA-seq data. 14 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 5: Titration monotonicity plots. The ability of Ensembl, RefSeq-NCBI and RefSeq-Rsubread to retain the titration monotonicity built into the SEQC RNA-seq data was measured using the Mean Squared Error (MSE) between the reference titration and the actual titration obtained from each an- notation. The red curve in each plot represents the reference titration calculated from using Equation (1). Plots in the top row include all the genes available in each annotation. Plots in the middle row includes those genes that remained after filtering for lowly expressed genes, in each annotation. Plots in the bottom row includes genes that are common between the three annotations after the expression filtering was performed. In each plot, the horizontal axis represents the log2 fold changes of gene expres- sion between samples A and B and the vertical axis represents the log2 fold changes of gene expression between samples C and D. 3.6 Validation against TaqMan RT-PCR data The TaqMan RT-PCR dataset generated in the MAQC study [14, 15] was used to validate the gene-level quantification results from the RNA-seq dataset. This dataset contains measured expression levels for >1,000 genes in the four SEQC samples. The aim was to understand how well Ensembl and RefSeq annotated gene expression correlated with the TaqMan RT-PCR data. 15 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ Key Ensembl RefSeq-NCBI RefSeq-Rsubread All genes after filtering Common genes after filtering 0.70 0.75 0.80 0.85 0.70 0.75 0.80 0.85 A B C D A B C D 0.70 0.75 0.80 0.85 Li br ar y si ze no rm al iz at io n Q ua nt ile no rm al iz at io n TM M no rm al iz at io n 0.70 0.75 0.80 0.85 0.70 0.75 0.80 0.85 0.70 0.75 0.80 0.85 C or re la tio n co ef fic ie nt C or re la tio n co ef fic ie nt C or re la tio n co ef fic ie nt Figure 6: Validation of RNA-seq against TaqMan RT-PCR dataset. Shown are Pearson correlation coefficients computed from comparing RNA-seq data against RT-PCR data, using the RT-PCR genes matched with each individual annotation (left column) or matched with all three annotations (right column). The rows represent the different RNA-seq normalization methods used. Lowly expressed genes in the RNA-seq data were filtered out before the correlation analysis was performed. The RNA-seq data generated from each annotation were filtered to remove lowly expressed genes before being compared to the RT-PCR data. Numbers of matched genes between the RT-PCR data and the RNA-seq data were 856, 901 and 901 for Ensembl, RefSeq-NCBI and RefSeq-Rsubread, respectively. 846 RT-PCR genes were found to be common to all the three annotations. The raw TaqMan RT-PCR data were log2- 16 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ transformed before comparing to the filtered RNA-seq data. Pearson correlation analysis of the RNA-seq gene expression (log2FPKM values) and RT-PCR gene expression (log2 values) from using the RT-PCR genes matched with each individual annotation showed that the RefSeq-Rsubread annotation constantly yielded a higher correlation than the Ensembl and RefSeq-NCBI annotations, across all the samples and the three different normalization methods (left panel in Figure 6). The Ensembl annotation was found to produce the worst correlation in all these comparisons. When using the RT-PCR genes matched with all three annotations for comparison, RefSeq- Rsubread was again found to yield the highest correlation (right panel in Figure 6). Ensembl and RefSeq-NCBI were found to produce similar correlation coefficients. Taken together, results from this evaluation showed that the use of RefSeq-Rsubread annotation led to a better concordance in gene expression between the RNA-seq data and the RT- PCR data, compared to the use of Ensembl and RefSeq-NCBI annotations. 3.7 Validation against microarray data An Illumina BeadChip microarray dataset, which was generated by the MAQC-I project [15] for the same samples as in the RNA-seq data used in this study, was used to further validate the gene-level RNA-seq quantification results obtained from different annota- tions. The microarray dataset was background corrected and normalized using the ‘neqc’ function in the limma package [26, 18]. Microarray genes were then matched to the RNA- seq genes included in the filtered RNA-seq data. 14,405, 14,561 and 14,508 microarray genes were found to be matched with RNA-seq genes from Ensembl, RefSeq-NCBI and RefSeq-Rsubread annotations, respectively. 13,424 microarray genes were found to be present in all three annotations. For those microarray genes that contain more than one probe, a representative probe was selected for each of them. The representative probe selected for a gene had the highest mean expression value across the four samples among all the probes the gene has. A Pearson correlation analysis was then performed between microarray data and RNA-seq data for each of the three annotations. Both RNA-seq and microarray data include log2 expression values of genes. Figure 7 shows that the use of RefSeq-Rsubread annotation consistently yielded the highest correlation between RNA-seq and microarray data in all the comparisons, no matter which RNA-seq normalization method was used and if all or common matched genes were included in the evaluation. On the other hand, the use of the Ensembl annotation resulted in the worst correlation between RNA-seq data and microarray data in all the comparisons. 17 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ C or re la tio n co ef fic ie nt All genes after filtering Common genes after filtering C or re la tio n co ef fic ie nt 0.60 0.65 0.70 0.75 A B C D A B C D Li br ar y si ze no rm al iz at io n Q ua nt ile no rm al iz at io n TM M no rm al iz at io n C or re la tio n co ef fic ie nt 0.80 0.60 0.65 0.70 0.75 0.80 0.60 0.65 0.70 0.75 0.80 0.60 0.65 0.70 0.75 0.80 0.60 0.65 0.70 0.75 0.80 0.60 0.65 0.70 0.75 0.80 Key Ensembl RefSeq-NCBI RefSeq-Rsubread Figure 7: Validation of RNA-seq quantification results against microarray data. Shown are Pearson correlation coefficients computed from comparing RNA-seq data against Illumina BeadChip microarray data, using the microarray genes matched with each individual annotation (left column) or matched with all three annotations (right column). Rows in the plots represent the different RNA-seq normalization methods used. Lowly expressed genes in the RNA-seq data were filtered out before the correlation analysis was performed. For those microarray genes that include more than one probe, a representative probe was selected and used for this analysis. 18 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 DISCUSSION The RNA-seq technique is currently routinely used for genome-wide profiling of gene expression in the biomedical research field. The analysis of RNA-seq data relies on the accurate annotation of genes so that expression levels of genes can be accurately and re- liably quantified. There are several major gene annotation sources that have been widely adopted in the field such as Ensembl and RefSeq annotations. The Ensembl and RefSeq annotations have been well maintained and under continuous development. In particular, new gene information collected from the next-generation sequencing technologies, such as RNA-seq, has been incorporated into the expansion of these annotations in recent years. However, differences between these annotations have raised concerns over the quality and reproducibility of RNA-seq data analyses. There are particularly concerns regarding the accuracy of new gene annotations generated from the use of the sequencing tech- nologies, due to known errors in the generation and analysis of the sequencing data. To address these concerns, in this study we systematically assessed the differences in RNA- seq quantification results attributed to the gene annotation discrepancy. Annotations being evaluated in this study included recent Ensembl and NCBI RefSeq annotations and also an older version of the RefSeq annotation. We compared the recent and old RefSeq annotations to assess the quality of the new annotations that were added when the sequencing technology was utilized at NCBI for curating RefSeq gene annotations. Although the Ensembl annotation contains significantly more genes than both the recent and old RefSeq annotations, it was also found to have a much higher proportion of short genes. Interestingly, we found that a much higher fraction of these short genes in Ensembl were filtered out due to low or no expression in the analysis of the SEQC RNA- seq dataset, compared to the short genes included in the two RefSeq annotations. The SEQC RNA-seq data is a widely used benchmark dataset including the Human Brain Reference RNA and Universal Human Reference RNA samples, in which a very large number of gene expressed making the entire human transcriptome well covered. The use of the RefSeq-Rsubread annotation (the older version of the RefSeq anno- tation used in this study) has led to substantially more fragments being successfully counted to genes than the use of RefSeq-NCBI (the recent RefSeq annotation used in this study) or Ensembl annotations. A detailed investigation revealed that this was be- cause (a) there are less overlapping between genes in the RefSeq-Rsubread annotation leading to less read assignment ambiguity and (b) the RefSeq-Rsubread annotation con- tains more genes that are compatible with mapped fragments, despite the transcriptome represented by this annotation is much smaller than those represented by the RefSeq- NCBI and Ensembl annotations. Moreover, the quantification data obtained from using 19 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ RefSeq-Rsubread exhibited desirable larger intensity range and higher median expression level than the quantification data obtained from using the other two annotations. The evaluation of quantification accuracy from using genome-wide titration mono- tonicity truth built in the RNA-seq data, the TaqMan RT-PCR data and the microarray data, showed that overall the RefSeq-NCBI annotation yielded better quantification re- sults than the Ensembl annotation. This may not be surprising because the NCBI RefSeq annotation is a traditionally conservative annotation that is known to be highly accurate as it uses an evidence-based approach to annotate genes. However, we also found that the RefSeq-Rsubread annotation yielded more accurate quantification results than the RefSeq-NCBI annotation in almost all the comparisons, which is very surprising. We suspect that this might be due to the annotation errors arising from the sequencing data recently utilized in the NCBI RefSeq annotation generation pipeline. It was reported that the sequencing data, including RNA-seq data and epigenome sequencing data, started to be utilized by NCBI for curating RefSeq gene annotations in around 2013 [9, 27]. Between March 2015 and July 2020, the number of gene transcripts in the vertebrate mammalian organisms included in the RefSeq database increased significantly from 3.6 million to 7.8 million (https://www.ncbi.nlm.nih.gov/refseq/statistics/), a more than twofold increase in just around 5 years. The use of sequencing data for annotation generation should be a significant driver for this rapid expansion of the RefSeq database. It is known that some errors associated with the generation and analysis of sequencing data are difficult to correct, such as sample contamination, sequencing errors, read mapping errors and read assembly errors. When these errors were brought to the annotation process, they could result in incorrect gene annotations being generated and consequently led to less accurate quantification of the RNA-seq data. 5 CONCLUSION In conclusion, our findings from this study revealed that the NCBI RefSeq human gene annotations outperformed the Ensembl human gene annotation in the quantification of RNA-seq data. However, we also raised concerns over the recent changes made to the RefSeq database due to the use of sequencing data in the annotation generation process. These changes need to be reviewed and validated so as to ensure the RefSeq database continues to be a reliable and high-quality gene annotation resource for the research com- munity. Similarly, such review should be conducted for other gene annotation databases as well. The research findings from this study also have an implication for the quantification of RNA-seq data generated by the recently emerged single-cell sequencing technologies. 20 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ Same as the quantification of bulk RNA-seq data, an accurate gene annotation is also required for quantifying single-cell RNA-seq data. It is therefore important to understand if and how the annotation choice impacts the quantification accuracy of the single-cell RNA-seq data as well. References [1] Zhenqiang Su, Pawe l P Labaj, Sheng Li, Jean Thierry-Mieg, Danielle Thierry-Mieg, Wei Shi, Charles Wang, Gary P Schroth, Robert A Setterquist, John F Thomp- son, et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature Biotech- nology, 32(9):903, 2014. [2] Yunshun Chen, Aaron TL Lun, and Gordon K Smyth. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research, 5:1438, 2016. [3] Simon Anders, Paul T Pyl, and Wolfgang Huber. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2):166–9, 2015. [4] Yang Liao, Gordon K Smyth, and Wei Shi. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923–930, 2014. [5] Yang Liao, Gordon K Smyth, and Wei Shi. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research, 47(8):e47–e47, 2019. [6] Steven L. Salzberg. Next-generation genome annotation: we still struggle to get it right. Genome Biology, 20(1):92, 2019. [7] Mihaela Pertea, Alaina Shumate, Geo Pertea, Ales Varabyou, Florian P Breitwieser, Yu-Chi Chang, Anil K Madugundu, Akhilesh Pandey, and Steven L Salzberg. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biology, 19(1):208, 2018. [8] Daniel R Zerbino, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, Carla Cummins, Astrid Gall, Car- los Garćıa Girón, et al. Ensembl 2018. Nucleic Acids Research, 46(D1):D754–D761, 2018. 21 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ [9] Nuala A O’Leary, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako- Adjei, et al. Reference sequence (RefSeq) database at NCBI: current status, taxo- nomic expansion, and functional annotation. Nucleic Acids Research, 44(D1):D733– D745, 2016. [10] Adam Frankish, Mark Diekhans, Anne-Maud Ferreira, Rory Johnson, Irwin Jun- greis, Jane Loveland, Jonathan M Mudge, Cristina Sisu, James Wright, Joel Arm- strong, et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research, 47(D1):D766–D773, 2019. [11] Christopher M Lee, Galt P Barber, Jonathan Casper, Hiram Clawson, Mark Diekhans, Jairo N Gonzalez, Angie S Hinrichs, Brian T Lee, Luis R Nassar, Con- ner C Powell, Brian J Raney, Kate R Rosenbloom, Daniel Schmelter, Matthew L Speir, Ann S Zweig, David Haussler, Maximilian Haeussler, Robert M Kuhn, and W J Kent. UCSC Genome Browser enters 20th year. Nucleic Acids Research, 48(D1):D756–D761, 2020. [12] Po-Yen Wu, John H Phan, and May D Wang. Assessing the impact of human genome annotation choice on RNA-seq expression estimates. BMC Bioinformatics, 14(11):S8, 2013. [13] Shanrong Zhao and Baohong Zhang. A comprehensive evaluation of ensembl, Ref- Seq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics, 16(1):97, 2015. [14] Leming Shi, Gregory Campbell, Wendell D Jones, Fabien Campagne, Zhining Wen, Stephen J Walker, Zhenqiang Su, Tzu-Ming Chu, Federico M Goodsaid, Lajos Pusz- tai, et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nature Biotechnology, 28(8):827–838, 2010. [15] MAQC Consortium, Leming Shi, Laura H Reid, Wendell D Jones, Richard Shippy, Janet A Warrington, Shawn C Baker, Patrick J Collins, Francoise de Longueville, Ernest S Kawasaki, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology, 24(9):1151–1161, 2006. [16] Yang Liao and Wei Shi. seqc: RNA-seq data generated from SEQC (MAQC-III) study, 2020. R package version 1.22.0. http://bioconductor.org/packages/release/data/experiment/html/seqc.html. 22 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ [17] Marc Carlson. org.Hs.eg.db: Genome wide annota- tion for Human, 2020. R package version 3.11.4. https://www.bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html. [18] Matthew E Ritchie, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. limma powers differential expression analyses for RNA- sequencing and microarray studies. Nucleic Acids Research, 43(7):e47–e47, 2015. [19] Wolfgang Huber, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, Sean Davis, Laurent Gatto, Thomas Girke, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods, 12(2):115, 2015. [20] Yang Liao, Gordon K Smyth, and Wei Shi. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108–e108, 2013. [21] Charity W Law, Yunshun Chen, Wei Shi, and Gordon K Smyth. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 15(2):R29, 2014. [22] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods, 5(7):621–8, 2008. [23] Benjamin M Bolstad, Rafael A Irizarry, Magnus Åstrand, and Terence P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193, 2003. [24] Mark D Robinson and Alicia Oshlack. A scaling normalization method for differen- tial expression analysis of RNA-seq data. Genome Biology, 11(3):R25, 2010. [25] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edgeR: a Biocon- ductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–40, 2010. [26] Wei Shi, Alicia Oshlack, and Gordon K Smyth. Optimizing the noise versus bias trade-off for Illumina whole genome expression BeadChips. Nucleic Acids Research, 38(22):e204, 2010. [27] Kim D Pruitt, Garth R Brown, Susan M Hiatt, Françoise Thibaud-Nissen, Alexander Astashyn, Olga Ermolaeva, Catherine M Farrell, Jennifer Hart, Melissa J Landrum, 23 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ Kelly M McGarvey, et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Research, 42(Database issue):D756–D763, 2014. 24 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425794doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425794 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_01_07_425801 ---- Fibrinolysis influences SARS-CoV-2 infection in ciliated cells Fibrinolysis influences SARS-CoV-2 infection in ciliated cells 1 2 Yapeng Hou1, Yan Ding1, Hongguang Nie1, *, Hong-Long Ji2 3 4 1Department of Stem Cells and Regenerative Medicine, College of Basic Medical Science, China Medical 5 University, Shenyang, Liaoning 110122, China. 2Department of Cellular and Molecular Biology, University 6 of Texas Health Science Center at Tyler, Tyler, TX 75708, USA. 7 8 *Address correspondence to hgnie@cmu.edu.cn 9 10 11 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 Abstract 12 Rapid spread of COVID-19 has caused an unprecedented pandemic worldwide, and an inserted furin site 13 in SARS-CoV-2 spike protein (S) may account for increased transmissibility. Plasmin, and other host 14 proteases, may cleave the furin site of SARS-CoV-2 S protein and  subunits of epithelial sodium channels ( 15 ENaC), resulting in an increment in virus infectivity and channel activity. As for the importance of ENaC in 16 the regulation of airway surface and alveolar fluid homeostasis, whether SARS-CoV-2 will share and 17 strengthen the cleavage network with ENaC proteins at the single-cell level is urgently worthy of consideration. 18 To address this issue, we analyzed single-cell RNA sequence (scRNA-seq) datasets, and found the PLAU 19 (encoding urokinase plasminogen activator), SCNN1G (ENaC), and ACE2 (SARS-CoV-2 receptor) were co-20 expressed in alveolar epithelial, basal, club, and ciliated epithelial cells. The relative expression level of PLAU, 21 TMPRSS2, and ACE2 were significantly upregulated in severe COVID-19 patients and SARS-CoV-2 infected 22 cell lines using Seurat and DESeq2 R packages. Moreover, the increments in PLAU, FURIN, TMPRSS2, and 23 ACE2 were predominately observed in different epithelial cells and leukocytes. Accordingly, SARS-CoV-2 24 may share and strengthen the ENaC fibrinolytic proteases network in ACE2 positive airway and alveolar 25 epithelial cells, which may expedite virus infusion into the susceptible cells and bring about ENaC associated 26 edematous respiratory condition. 27 Keywords: SARS-CoV-2; plasmin; ENaC; COVID-19; furin28 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 Introduction 29 The SARS-CoV-2 infection leads to COVID-19 with pathogenesis and clinical features similar to those 30 of SARS and shares the same receptor, angiotensin-converting enzyme 2 (ACE2), with SARS-CoV to enter 31 host cells (Zhou et al. 2020, Li and Zheng 2020). By comparison, the transmission ability of SARS-CoV-2 is 32 much stronger than that of SARS-CoV, owning to diverse affinity to ACE2 (Wrapp and Wang 2020). The 33 fusion capacity of coronavirus via the spike protein (S protein) determines infectivity (Wrapp and Wang 2020, 34 Kam et al. 2009b). Highly virulent avian and human influenza viruses bearing a furin site (RxxR) in the 35 haemagglutinin have been described (Coutard et al. 2020). Cleavage of the furin site enhances the entry ability 36 of Ebola, HIV, and influenza viruses into host cells (Claas et al. 1998). Consisting of receptor-binding (S1) 37 and fusion domains (S2), coronavirus S protein needs to be primed through the cleavage at S1/S2 site and S2’ 38 site for membrane fusion (Jaimes et al. 2020, Huggins 2020). The newly inserted furin site in SARS-CoV-2 S 39 protein significantly facilitated the membrane fusion, leading to enhanced virulence and infectivity (Xia et al. 40 2020, Wang, Qiu, et al. 2020). 41 Plasmin cleaves the furin site in SARS-CoV S protein (Kam et al. 2009b), which is upregulated in the 42 vulnerable populations of COVID-19 (Ji et al. 2020). However, whether plasmin cleaves the newly inserted 43 furin site in the SARS-CoV-2 S protein remains obscure. Plasmin cleaves the furin site of human subunit of 44 epithelial sodium channels (ENaC) as demonstrated by LC-MS and functional assays (Zhao, Ali, and Nie 45 2020, Sheng et al. 2006). Very recently, it has been proposed that the global pandemic of COVID-19 may 46 partially be driven by the targeted mimicry of ENaC α subunit by SARS-CoV-2 (Gentzsch and Rossier 2020, 47 Muhanna et al. 2020). ENaC are located at the apical side of the airway and alveolar cells, acting as a critical 48 system to maintain the homeostasis of airway surface and alveolar fluid homeostasis (Ji et al. 2006, Matalon, 49 Bartoszewski, and Collawn 2015). The luminal fluid is required for keeping normal ciliary beating to expel 50 inhaled pathogens, allergens, and pollutants and for migration of immune cells that release pro-inflammatory 51 cytokines and chemokines (Hou et al. 2019a). The plasmin family and ACE2 are expressed in the respiratory 52 epithelium (Nie et al. 2009, Hanukoglu and Hanukoglu 2016, Kam et al. 2009a). However, if the plasmin 53 system and ENaC are involved in the fusion of SARS-CoV-2 into host cells is unknown. 54 This study aims to determine whether PLAU, SCNN1G, and ACE2 are co-expressed in the airway and 55 lung epithelial cells and whether SARS-CoV-2 infection alters their expression at the single-cell level. We 56 found that these genes, especially the PLAU was significantly upregulated in epithelial cells of 57 severe/moderate COVID-19 patients and SARS-CoV-2 infected cell lines, mainly owning to ciliated cells. We 58 conclude that the most susceptible cells for SARS-CoV-2 infection could be the ones co-expressing these 59 genes and sharing plasmin-mediated cleavage. 60 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 61 Results 62 Furin sites are identified in both virus and host ENaC proteins 63 A furin site was located at the S proteins of SARS-CoV-2 from Arginine-683 to Serine-687 (RRAR|S), 64 and similar site was also seen in the S protein of HCoV-OC43, MERS, and HCoV-HKU1 coronavirus (Fig. 65 1A). In addition, the highly conserved RxxR motif existed in the hemagglutinin protein of influenza H3N2, 66 Herpes, Ebola, HIV, Dengue, hepatitis B, West Nile, Marburg, Zika, Epstein-Barr, and respiratory syncytial 67 virus (RSV). The furin site (RKRR|E) was found in the gating relief of inhibition by proteolysis (GRIP) 68 domain of the extracellular loop of the mouse, rat, and human ENaC (Fig. 1B). The similarity of these furin 69 sites is 40-80%. 70 Respiratory cells co-express PLAU, SCNN1G, and ACE2 71 To identify subpopulations of cells co-expressing PLAU, SCNN1G, and ACE2, we analyzed 11 scRNA-72 seq datasets by nferX scRNA-seq platform (https://academia.nferx.com/) (Supplementary Table 1). All three 73 genes were co-expressed in the following cells ranked by the expression level of PLAU from high to low: club 74 cells, goblets, basal cells, AT1 cells, ciliated cells, fibroblasts, mucous cells, deuterosomal cells, and AT2 cells 75 (Fig. 1C), which were supported by previous studies (Sungnak et al. 2020, Wang et al. 2008, Hanukoglu and 76 Hanukoglu 2016). These results suggest that these cell populations co-expressing PLAU-ENaC-ACE2 may 77 be more susceptible to the SARS-CoV-2 infection compared with others. In addition, the top ten ranked cell 78 sub-populations expressing PLAU, SCNN1G, or ACE2 alone were listed in Supplementary Table 2. To 79 compare the transcript of the proteases in different lung epithelial cells, we analyzed the lung dataset from 80 Gene Expression Omnibus (GEO) by Seurat, and the cells were annotated by their specific markers 81 (Supplementary Fig. 1A). The data showed that all these proteases were expressed in AT2 cells, including 82 PLAU, FURIN, PRSS3 (Trypsin), ELANE (Elastase), PRTN3 (Myeloblastin), CELA1 (Elastase-1), CELA2A 83 (Elastase-2A), CTRC (Chymotrypsin-C), TMPRSS4 (Transmembrane protease serine 4), and TMPRSS2 84 (Transmembrane protease serine 2) (Supplementary Fig. 1B). In AT2 cells, the proteases expression level in 85 order is: TMPRSS2 > FURIN > TMPRSS4 > PLAU > CELA1 > ELANE > PRSS3 > PRTN3 > CTRC > 86 CLEA2A. For PLAU, the high to low order is Basal > Club > Ciliated > AT1 > AT2. 87 The expression levels of proteases (PLAU, FURIN, TMPRSS2, PLG), ACE2, and SCNN1G in 11 cell 88 types co-expressing ACE2, SCNN1G, and PLAU were compared in Fig. 2. The club cells showed the highest 89 expression level of PLAU, and the ACE2, SCNN1G, TMPRSS2, FURIN, and PLG showed a higher 90 expression level in club cells compared with other cell types. Of note, the ciliated cell was the second and 91 seventh highest expression cell type of PLAU and ACE2, respectively. 92 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://academia.nferx.com/ https://doi.org/10.1101/2021.01.07.425801 Expression levels of PLAU, SCNN1G, and ACE2 in SARS-CoV-2 infection 93 To detect the potential changes in the cell populations that co-express PLAU, SCNN1G, and ACE2, we 94 analyzed the scRNA-seq datasets of bronchoalveolar lavage fluid (BALF) cells, which are mainly composed 95 of epithelial cells and leukocytes. There were three groups to be studied: 4 healthy controls, 3 moderate, and 96 6 severe COVID-19 patients. The expression level and the percentage of total cells expressing PLAU and 97 FURIN were significantly upregulated in the severe group compared with controls (P < 0.001), as well as the 98 expression levels of ACE2, TMPRSS2, SCNN1G, and PLG were also slightly upregulated (Fig. 3A and B). 99 The expression levels of PLAU, Furin, TMPRSS2, and ACE2 and the number of cells were profiled in 100 Fig. 4A. The data showed that these genes were upregulated in COVID-19 patients, and the number of cells 101 expressing these upregulated genes almost increased in a severity-dependent manner. PLAU was significantly 102 elevated in severe group (P < 0.001), and the other genes also showed an increasing trend (Fig. 4B). 103 The increments in PLAU (alveolar epithelial cells, basal, and ciliated cells), PLG (basal cells), FURIN 104 (alveolar epithelial cells, basal, ciliated cells), TMPRSS2 (basal and ciliated cells), SCNN1G (alveolar 105 epithelial cells and basal cells), and ACE2 (alveolar epithelial cells, basal, and club) were predominately 106 observed in different cells. Especially, a significant increase in PLAU expression was seen in ciliated cells, 107 while the expression of measured genes showed a decline in COVID-19 goblets (Fig. 4C). In addition, similar 108 changes of these genes in leukocytes were shown in Supplementary Fig. 2. 109 To corporate the results in COVID-19 patients, we analyzed bulk-seq data of 3 human respiratory 110 epithelial cell lines infected with SARS-CoV-2: A549, Calu-3, and NHBE (Blanco-Melo et al. 2020). PLAU 111 transcript was significantly upregulated in all three cell lines after SARS-CoV-2 infection (multiplicity of 112 infection = 2) (Fig. 5, P < 0.001). However, TMPRSS2 was only upregulated in infected Calu-3 cells, 113 evidenced by recent studies (P < 0.001) (Xu et al. 2020). Similar to those of SARS and MERS, the SARS-114 CoV-2 infection also increased the expression level of ACE2 in A549 cells (P < 0.05) (Smith et al. 2020). 115 Although SARS-CoV-2 did not change the mRNA level of SCNN1G significantly in these cell lines as that 116 for influenza virus, researchers are warned to pay more attention to the post-translational modification 117 ofENaC (Hou et al. 2019b). 118 119 Discussion 120 The novel coronavirus, SARS-CoV-2, was identified as the causative agent for a series of atypical 121 respiratory diseases, and the disease termed COVID-19 was officially declared a pandemic by the World 122 Health Organization on March 11, 2020 (Pollard, Morran, and Nestor-Kalinoski 2020). SARS-CoV-2 has a 123 great impact on human health all over the world, the virulence and pathogenicity of which may be relevant to 124 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 the inserted furin site. Whilst the SARS-CoV-2 S2’ cleavage site has a similar sequence motif to SARS-CoV 125 and would thus be suitable for cleavage by trypsin-like proteases, insertions of additional arginine residues at 126 the SARS-CoV-2 S1/S2 (RRAR|S) clearly generate a furin cleavage site (Zhou et al. 2020). Interestingly, this 127 difference has been implicated in the viral transmissibility of SARS-CoV-2 (Anand et al. 2020). Our data 128 supported the investigation that furin sites (RRAR|S) not only exist in human virus but also in the -subunit 129 of ENaC, which expresses highly in alveolar epithelial cells and a substrate to be cleaved by plasmin. 130 Plasmin has also been reported to have the ability to cleavage the furin site, and enhance the virulence 131 and pathogenicity of viruses in their envelope proteins (Sidarta-Oliveira et al. 2020). SARS-CoV-2 has 132 evolved a unique S1/S2 cleavage site, absent in any previous coronavirus sequenced, resulting in the striking 133 mimicry of an identical furin-cleavable peptide on αENaC, a protein critical for the homeostasis of airway 134 surface liquid (Anand et al. 2020). All the above indicates that SARS-CoV-2 infection will hijack the ENaC 135 proteolytic network, which is associated with the edematous respiratory condition (Fig. 6) (Chen et al. 2014, 136 Zhao, Ali, and Nie 2020). Our data showed that the respiratory cells co-express SARS-CoV-2 receptor, ENaC 137 (SCNN1G), and plasmin family mainly belonged to alveolar type Ⅰ/Ⅱ, basal, club, and ciliated cells, 138 respectively. The PLG (Plasminogen) expression in different cell types is not shown for its expression is too 139 low to be detected in many lung scRNA-seq datasets. Of note, the ciliated cell is the predominant contributor 140 to upregulate the PLAU gene in severe COVID-19 patients. As expected, PLAU levels, as well as TMPRSS2, 141 are upregulated in respiratory epithelial cell lines after SARS-CoV-2 infection, supporting the idea that SARS-142 CoV-2 can facilitate ACE2-mediated viral entry via TMPRSS2 spike glycoprotein priming (Roberts et al. 143 2020). Enhanced PLAU expression induced by SARS-CoV-2 infection will activate the plasminogen, which 144 may reduce the difficulty of SARS-CoV-2 invasion by cleaving the S protein. 145 The scRNA-seq data of bronchoalveolar lavage fluid cells from COVID-19 patients do not show the 146 expression difference of SCNN1G (ENaC), which is considered to be regulated by plasmin through 147 proteolytic hydrolysis. ENaC activity is not only determined by mRNA/protein expression but also cell 148 proteases. Once the ENaC is biosynthesized and trafficked to the Golgi, it is likely to be modified by 149 intracellular protease (furin). After inserted into plasma membrane, ENaC will encounter the opportunity for 150 full proteolytic activation of the channel by extracellular proteases (elastase, plasmin, chymotrypsin, and 151 trypsin) (Thibodeau and Butterworth 2013). Intriguingly, the PLG gene also did not show a difference between 152 COVID-19 patients and healthy control, indicating that hyperfibrinolysis in COVID-19 patients may be 153 induced by enhanced urokinase (Ji et al. 2020). Additional analysis of clinical studies or animal models is 154 urgently needed to future explore the relationship between the plasmin, ENaC, and SARS-CoV-2 receptors at 155 the protein level. 156 The amplified incidence of thrombotic events had been previously reported on COVID-19, and tissue 157 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 plasminogen activator (tPA) was tried to treat stroke in COVID-19 patients (Vinayagam and Sattu 2020). We 158 did not analyze the changes of PLAT in BALF cells of COVID-19 patients due to the tPA (PLAT) is generally 159 expressed in endothelial cells. Similarly, the beneficial effects of plasmin on alveolar fluid clearance and novel 160 mechanisms underlying the cleavage of human ENaCs at multiple sites by plasmin have been provided in our 161 recent studies (Zhao, Ali, and Nie 2020). New drugs that regulate the uPA/ uPA receptor (uPAR) system have 162 been demonstrated to help treat the severe complications of pandemic COVID-19 (D'Alonzo, De Fenza, and 163 Pavone 2020). Amiloride, a prototypic inhibitor of ENaC, can be an ideal candidate for COVID-19 patients, 164 supporting that ENaC is a downstream target of plasmin and involved in the luminal fluid absorption in SARS-165 CoV-2 infection (Adil, Narayanan, and Somanath 2020). Considering the two diametrically different 166 therapeutic regimes in practice to address the complicated coagulopathic changes in COVID-19, fibrinolytic 167 (alteplase, tPA) (Bona et al. 2020, Ly et al. 2020, Wang, Hajizadeh, et al. 2020, Barrett et al. 2020, Christie et 168 al. 2020, Papamichalis et al. 2020, Poor et al. 2020, Arachchillage et al. 2020) and antifibrinolytic therapies 169 (nafamostat and tranexamic acid) (Asakura and Ogawa 2020, Doi et al. 2020, Thierry 2020), our data provide 170 new and comprehensive information on fibrinolytic related therapy targeting plasmin(ogen) as a promising 171 approach to combat COVID-19. 172 173 Methods 174 Alignment of furin sites in viral and ENaC proteins 175 The sequences of ENaC proteins (rat, mouse, and humans) and human viruses were acquired from the 176 UniProt (https://www.uniprot.org/). The accession numbers were P0DTC2 (for SARS-CoV-2), P04578 (HIV), 177 P03435 (H3N2), A0A3G2XEB3 (Ebola), A0A140AYZ5 (MERS), P03188 (Epstein-Barr), P04488 (Herpes), 178 P17763 (Dengue), P26662 (hepatitis), Q9Q6P4 (West Nile), A0A024B7W1 (Zika), P03420 (respiratory 179 syncytial virus), P35253 (Marburg), P36334 (HCoV-OC43), A0A140H1H1 (HCoV-HKU1), P51170 (human 180 ENaC), Q9WU39 (mouse ENaC), and P37091 (rat ENaC). Alignment was performed using the JalView 181 software (Version: 2.11.1.0). The 3D structure of SARS-CoV-2 S (PDB ID: 6X2A) and ENaC (PDB ID: 182 6BQN) was modified and downloaded from the Protein Data Bank (http://www.rcsb.org/). 183 Co-expression profiles of ENaC, ACE2, and proteases 184 We performed a systematic expression profiling of ACE2 and ENaC across 11 published human single-185 cell RNA sequence (scRNA-seq) studies comprising ~0.4 million cells using the nferX Single-Cell platform 186 (https://academia.nferx.com/) (Anand et al. 2020). The mean expression of PLAU, SCNN1G, and ACE2 in a 187 given cell-population (mean CP10k) was Z-score normalized (to ensure the Standard deviation = 1 and mean 188 ~ 0 for all the genes) to obtain relative expression profiles across all the samples. The expression of PLAU, 189 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://academia.nferx.com/ https://doi.org/10.1101/2021.01.07.425801 SCNN1G, and ACE2 in the respiratory system were analyzed and graphed as heatmaps using R package 190 pheatmap. 191 Acquisition, filtering, and processing of scRNA-seq data 192 The dataset downloaded from the Gene Expression Omnibus was filtered for integration. Lung scRNA-193 seq dataset (8 healthy controls in GSE122960) were filtered by total number of reads (nreads > 1,000), number 194 of detected genes (50 < ngenes < 7,500), and mitochondrial percentage (mito.pc < 0.2). BALF scRNA-seq 195 dataset was composed of 3 healthy controls, 3 moderate and 6 severe COVID-19 patients in GSE145926, and 196 1 healthy control in GSM3660650. These datasets were filtered by total number of reads (nreads > 1,000), 197 number of detected genes (20 < ngenes < 6,000), and mitochondrial percentage (mito.pc < 0.1). Finally, a 198 filtered gene-barcode matrix of all samples was integrated with the Seurat v3 to remove batch effects across 199 different donors as described previously (Stuart et al. 2019). 200 Dimensionality reduction and clustering 201 The filtered gene-barcode matrix was first normalized using the ‘LogNormalize’ methods in Seurat v.3 202 with default parameters. The top 2,000 variable genes were then identified using the ‘vst’ method in Seurat 203 FindVariableFeatures function. Principal Component Analysis (PCA) was performed using the top 2,000 204 variable genes. Then Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) or 205 t-Distributed Stochastic Neighbor Embedding (tSNE) was performed on the top 50 principal components for 206 visualizing the epithelial cells. Meanwhile, the graph-based clustering was performed on the PCA-reduced 207 data for clustering analysis with Seurat v.3. The resolution was set to 0.6 and 0.15 for the lung and BALF 208 datasets to obtain a finer result, respectively. The markers used for BALF cell annotation were shown by the 209 bubble plot in Supplementary Fig. 3. 210 Differentiation of gene expression levels 211 Differentiation of gene expression level in BALF cells among the healthy, moderate, and severe groups 212 was achieved using the Wilcox in Seurat v.3 (FindMarkers function). Then, we divided BALF cells into 213 epithelial cells and leukocytes and compared gene expression levels among their subgroups. Both epithelial 214 and leukocytes were re-clustered to detect the differences in gene expression of all cell types between healthy 215 controls and severe/moderate COVID-19 patients. Bulk-seq data (GSE147507) was analyzed for the 216 differential genes in respiratory epithelial cell lines using the DESeq2 with Wald test and Benjamini-Hochberg 217 post-hoc test (Blanco-Melo et al. 2020, Love, Huber, and Anders 2014). It was considered significant if P < 218 0.05. 219 220 Acknowledgment 221 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 This study was supported by NSFC 81670010, NIH grants HL87017, HL095435, and HL134828, AHA 222 Awards AHA14GRNT20130034 and AHA16GRNT30780002. We were grateful to Yunlai Zhou (Yangzhou 223 University) and Congxi Zhang (Gene denovo) for their assistance on bioinformatics. 224 225 Conflict of interest 226 The authors declare no conflicts of interest. 227 228 229 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 References 230 Adil, M. S., S. P. Narayanan, and P. R. Somanath. 2020. "Is amiloride a promising cardiovascular medication to persist in the 231 COVID-19 crisis?" Drug Discov Ther no. 14 (5):256-258. doi: 10.5582/ddt.2020.03070. 232 Anand, P., A. Puranik, M. Aravamudan, and A. J. Venkatakrishnan. 2020. "SARS-CoV-2 strategically mimics proteolytic activation 233 of human ENaC." Elife no. 9:e58603. doi: 10.7554/eLife.58603. 234 Arachchillage, D. J., A. Stacey, F. Akor, M. Scotz, and M. Laffan. 2020. "Thrombolysis restores perfusion in COVID-19 hypoxia." 235 no. 190 (5):e270-e274. doi: 10.1111/bjh.17050. 236 Asakura, H., and H. Ogawa. 2020. "Potential of heparin and nafamostat combination therapy for COVID-19." J Thromb Haemost 237 no. 18 (6):1521-1522. doi: 10.1111/jth.14858. 238 Barrett, C. D., A. Oren-Grinberg, E. Chao, A. H. Moraco, M. J. Martin, S. H. Reddy, A. M. Ilg, R. Jhunjhunwala, M. Uribe, H. B. 239 Moore, E. E. Moore, E. N. Baedorf-Kassis, M. L. Krajewski, D. S. Talmor, S. Shaefi, and M. B. Yaffe. 2020. "Rescue 240 therapy for severe COVID-19-associated acute respiratory distress syndrome with tissue plasminogen activator: A case 241 series." J Trauma Acute Care Surg no. 89 (3):453-457. doi: 10.1097/ta.0000000000002786. 242 Blanco-Melo, D., B. E. Nilsson-Payant, W. C. Liu, S. Uhl, D. Hoagland, R. Moller, T. X. Jordan, K. Oishi, M. Panis, D. Sachs, T. 243 T. Wang, R. E. Schwartz, J. K. Lim, R. A. Albrecht, and B. R. tenOever. 2020. "Imbalanced Host Response to SARS-CoV-244 2 Drives Development of COVID-19." Cell no. 181 (5):1036-1045 e9. doi: 10.1016/j.cell.2020.04.026. 245 Bona, R. D., A. Valbusa, G. Malfa, D. R. Giacobbe, P. Ameri, N. Patroniti, C. Robba, V. Gilad, A. Insorsi, M. Bassetti, P. Pelosi, and 246 I. Porto. 2020. "Systemic fibrinolysis for acute pulmonary embolism complicating acute respiratory distress syndrome in 247 severe COVID-19: a case series." Eur Heart J Cardiovasc Pharmacother. doi: 10.1093/ehjcvp/pvaa087. 248 Chen, Z., R. Zhao, M. Zhao, X. Liang, D. Bhattarai, R. Dhiman, S. Shetty, S. Idell, and H. L. Ji. 2014. "Regulation of epithelial 249 sodium channels in urokinase plasminogen activator deficiency." Am J Physiol Lung Cell Mol Physiol no. 307 (8):L609-250 17. doi: 10.1152/ajplung.00126.2014. 251 Christie, D. B., 3rd, H. M. Nemec, A. M. Scott, J. T. Buchanan, C. M. Franklin, A. Ahmed, M. S. Khan, C. W. Callender, E. A. 252 James, A. B. Christie, and D. W. Ashley. 2020. "Early outcomes with utilization of tissue plasminogen activator in COVID-253 19-associated respiratory distress: A series of five cases." J Trauma Acute Care Surg no. 89 (3):448-452. doi: 254 10.1097/ta.0000000000002787. 255 Claas, E. C., A. D. Osterhaus, R. van Beek, J. C. De Jong, G. F. Rimmelzwaan, D. A. Senne, S. Krauss, K. F. Shortridge, and R. G. 256 Webster. 1998. "Human influenza A H5N1 virus related to a highly pathogenic avian influenza virus." Lancet no. 351 257 (9101):472-7. doi: 10.1016/s0140-6736(97)11212-0. 258 Coutard, B., C. Valle, X. de Lamballerie, B. Canard, N. G. Seidah, and E. Decroly. 2020. "The spike glycoprotein of the new 259 coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade." Antiviral Res no. 176:104742. 260 doi: 10.1016/j.antiviral.2020.104742. 261 D'Alonzo, D., M. De Fenza, and V. Pavone. 2020. "COVID-19 and pneumonia: a role for the uPA/uPAR system." Drug Discov 262 Today no. 25 (8):1528-1534. doi: 10.1016/j.drudis.2020.06.013. 263 Doi, K., M. Ikeda, N. Hayase, K. Moriya, and N. Morimura. 2020. "Nafamostat mesylate treatment in combination with favipiravir 264 for patients critically ill with Covid-19: a case series." Crit Care no. 24 (1):392. doi: 10.1186/s13054-020-03078-z. 265 Gentzsch, M., and B. C. Rossier. 2020. "A Pathophysiological Model for COVID-19: Critical Importance of Transepithelial Sodium 266 Transport upon Airway Infection." Function (Oxf) no. 1 (2):zqaa024. doi: 10.1093/function/zqaa024. 267 Hanukoglu, I., and A. Hanukoglu. 2016. "Epithelial sodium channel (ENaC) family: Phylogeny, structure-function, tissue 268 distribution, and associated inherited diseases." Gene no. 579 (2):95-132. doi: 10.1016/j.gene.2015.12.061. 269 Hou, Y., Y. Cui, Z. Zhou, H. Liu, H. Zhang, Y. Ding, H. Nie, and H. L. Ji. 2019a. "Upregulation of the WNK4 Signaling Pathway 270 Inhibits Epithelial Sodium Channels of Mouse Tracheal Epithelial Cells After Influenza A Infection." Front Pharmacol no. 271 10:12. doi: 10.3389/fphar.2019.00012. 272 Hou, Yapeng, Yong Cui, Zhiyu Zhou, Hongfei Liu, Honglei Zhang, Yan Ding, Hongguang Nie, and Hong-Long Ji. 2019b. 273 "Upregulation of the WNK4 Signaling Pathway Inhibits Epithelial Sodium Channels of Mouse Tracheal Epithelial Cells 274 After Influenza A Infection." Frontiers in pharmacology no. 10:12. doi: 10.3389/fphar.2019.00012. 275 Huggins, D. J. 2020. "Structural analysis of experimental drugs binding to the SARS-CoV-2 target TMPRSS2." J Mol Graph Model 276 no. 100:107710. doi: 10.1016/j.jmgm.2020.107710. 277 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 Jaimes, J. A., N. M. Andre, J. S. Chappie, J. K. Millet, and G. R. Whittaker. 2020. "Phylogenetic Analysis and Structural Modeling 278 of SARS-CoV-2 Spike Protein Reveals an Evolutionary Distinct and Proteolytically Sensitive Activation Loop." J Mol Biol 279 no. 432 (10):3309-3325. doi: 10.1016/j.jmb.2020.04.009. 280 Ji, H. L., X. F. Su, S. Kedar, J. Li, P. Barbry, P. R. Smith, S. Matalon, and D. J. Benos. 2006. "Delta-subunit confers novel biophysical 281 features to alpha beta gamma-human epithelial sodium channel (ENaC) via a physical interaction." J Biol Chem no. 281 282 (12):8233-41. doi: M512293200 [pii] 283 10.1074/jbc.M512293200. 284 Ji, H. L., R. Zhao, S. Matalon, and M. A. Matthay. 2020. "Elevated Plasmin(ogen) as a Common Risk Factor for COVID-19 285 Susceptibility." Physiol Rev no. 100 (3):1065-1075. doi: 10.1152/physrev.00013.2020. 286 Kam, Y. W., Y. Okumura, H. Kido, L. F. Ng, R. Bruzzone, and R. Altmeyer. 2009a. "Cleavage of the SARS coronavirus spike 287 glycoprotein by airway proteases enhances virus entry into human bronchial epithelial cells in vitro." PLoS One no. 4 288 (11):e7870. doi: 10.1371/journal.pone.0007870. 289 Kam, Yiu-Wing, Yuushi Okumura, Hiroshi Kido, Lisa F. P. Ng, Roberto Bruzzone, and Ralf Altmeyer. 2009b. "Cleavage of the 290 SARS coronavirus spike glycoprotein by airway proteases enhances virus entry into human bronchial epithelial cells in 291 vitro." PloS one no. 4 (11):e7870-e7870. doi: 10.1371/journal.pone.0007870. 292 Li, T., and Q. Zheng. 2020. "SARS-CoV-2 spike produced in insect cells elicits high neutralization titres in non-human primates." 293 no. 9 (1):2076-2090. doi: 10.1080/22221751.2020.1821583. 294 Love, M. I., W. Huber, and S. Anders. 2014. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." 295 Genome Biol no. 15 (12):550. doi: 10.1186/s13059-014-0550-8. 296 Ly, A., C. Alessandri, E. Skripkina, A. Meffert, S. Clariot, Q. de Roux, O. Langeron, and N. Mongardon. 2020. "Rescue fibrinolysis 297 in suspected massive pulmonary embolism during SARS-CoV-2 pandemic." Resuscitation no. 152:86-88. doi: 298 10.1016/j.resuscitation.2020.05.020. 299 Matalon, S., R. Bartoszewski, and J. F. Collawn. 2015. "Role of epithelial sodium channels in the regulation of lung fluid 300 homeostasis." Am J Physiol Lung Cell Mol Physiol no. 309 (11):L1229-38. doi: 10.1152/ajplung.00319.2015. 301 Muhanna, D., S. R. Arnipalli, S. B. Kumar, and O. Ziouzenkova. 2020. "Osmotic Adaptation by Na(+)-Dependent Transporters and 302 ACE2: Correlation with Hemostatic Crisis in COVID-19." no. 8 (11). doi: 10.3390/biomedicines8110460. 303 Nie, H. G., T. Tucker, X. F. Su, T. Na, J. B. Peng, P. R. Smith, S. Idell, and H. L. Ji. 2009. "Expression and regulation of epithelial 304 Na+ channels by nucleotides in pleural mesothelial cells." Am J Respir Cell Mol Biol no. 40 (5):543-54. 305 Papamichalis, P., A. Papadogoulas, P. Katsiafylloudis, A. L. Skoura, M. Papamichalis, E. Neou, D. Papadopoulos, S. Karagiannis, 306 T. Zafeiridis, D. Babalis, and A. Komnos. 2020. "Combination of thrombolytic and immunosuppressive therapy for 307 coronavirus disease 2019: A case report." Int J Infect Dis no. 97:90-93. doi: 10.1016/j.ijid.2020.05.118. 308 Pollard, C. A., M. P. Morran, and A. L. Nestor-Kalinoski. 2020. "The COVID-19 Pandemic: A Global Health Crisis." Physiol 309 Genomics. doi: 10.1152/physiolgenomics.00089.2020. 310 Poor, H. D., C. E. Ventetuolo, T. Tolbert, G. Chun, G. Serrao, A. Zeidman, N. S. Dangayach, J. Olin, R. Kohli-Seth, and C. A. Powell. 311 2020. "COVID-19 critical illness pathophysiology driven by diffuse pulmonary thrombi and pulmonary endothelial 312 dysfunction responsive to thrombolysis." Clin Transl Med no. 10 (2). doi: 10.1002/ctm2.44. 313 Roberts, K. A., L. Colley, T. A. Agbaedeng, G. M. Ellison-Hughes, and M. D. Ross. 2020. "Vascular Manifestations of COVID-19 314 - Thromboembolism and Microvascular Dysfunction." Front Cardiovasc Med no. 7:598400. doi: 315 10.3389/fcvm.2020.598400. 316 Sheng, S., M. D. Carattino, J. B. Bruns, R. P. Hughey, and T. R. Kleyman. 2006. "Furin cleavage activates the epithelial Na+ channel 317 by relieving Na+ self-inhibition." Am J Physiol Renal Physiol no. 290 (6):F1488-96. doi: 10.1152/ajprenal.00439.2005. 318 Sidarta-Oliveira, D., C. P. Jara, A. J. Ferruzzi, M. S. Skaf, W. H. Velander, E. P. Araujo, and L. A. Velloso. 2020. "SARS-CoV-2 319 receptor is co-expressed with elements of the kinin-kallikrein, renin-angiotensin and coagulation systems in alveolar cells." 320 Sci Rep no. 10 (1):19522. doi: 10.1038/s41598-020-76488-2. 321 Smith, J. C., E. L. Sausville, V. Girish, M. L. Yuan, A. Vasudevan, K. M. John, and J. M. Sheltzer. 2020. "Cigarette Smoke Exposure 322 and Inflammatory Signaling Increase the Expression of the SARS-CoV-2 Receptor ACE2 in the Respiratory Tract." Dev 323 Cell no. 53 (5):514-529.e3. doi: 10.1016/j.devcel.2020.05.012. 324 Stuart, T., A. Butler, P. Hoffman, C. Hafemeister, E. Papalexi, W. M. Mauck, 3rd, Y. Hao, M. Stoeckius, P. Smibert, and R. Satija. 325 2019. "Comprehensive Integration of Single-Cell Data." Cell no. 177 (7):1888-1902 e21. doi: 10.1016/j.cell.2019.05.031. 326 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 Sungnak, W., N. Huang, C. Becavin, M. Berg, R. Queen, M. Litvinukova, C. Talavera-Lopez, H. Maatz, D. Reichart, F. Sampaziotis, 327 K. B. Worlock, M. Yoshida, J. L. Barnes, and H. C. A. Lung Biological Network. 2020. "SARS-CoV-2 entry factors are 328 highly expressed in nasal epithelial cells together with innate immune genes." Nat Med no. 26 (5):681-687. doi: 329 10.1038/s41591-020-0868-6. 330 Thibodeau, P. H., and M. B. Butterworth. 2013. "Proteases, cystic fibrosis and the epithelial sodium channel (ENaC)." Cell Tissue 331 Res no. 351 (2):309-23. doi: 10.1007/s00441-012-1439-z. 332 Thierry, A. R. 2020. "Anti-protease Treatments Targeting Plasmin(ogen) and Neutrophil Elastase May Be Beneficial in Fighting 333 COVID-19." Physiol Rev no. 100 (4):1597-1598. doi: 10.1152/physrev.00019.2020. 334 Vinayagam, S., and K. Sattu. 2020. "SARS-CoV-2 and coagulation disorders in different organs." Life Sci no. 260:118431. doi: 335 10.1016/j.lfs.2020.118431. 336 Wang, I. M., S. Stepaniants, Y. Boie, J. R. Mortimer, B. Kennedy, M. Elliott, S. Hayashi, L. Loy, S. Coulter, S. Cervino, J. Harris, 337 M. Thornton, R. Raubertas, C. Roberts, J. C. Hogg, M. Crackower, G. O'Neill, and P. D. Paré. 2008. "Gene expression 338 profiling in patients with chronic obstructive pulmonary disease and lung cancer." Am J Respir Crit Care Med no. 177 339 (4):402-11. doi: 10.1164/rccm.200703-390OC. 340 Wang, J., N. Hajizadeh, E. E. Moore, R. C. McIntyre, P. K. Moore, L. A. Veress, M. B. Yaffe, H. B. Moore, and C. D. Barrett. 2020. 341 "Tissue plasminogen activator (tPA) treatment for COVID-19 associated acute respiratory distress syndrome (ARDS): A 342 case series." no. 18 (7):1752-1755. doi: 10.1111/jth.14828. 343 Wang, Q., Y. Qiu, J. Y. Li, Z. J. Zhou, C. H. Liao, and X. Y. Ge. 2020. "A Unique Protease Cleavage Site Predicted in the Spike 344 Protein of the Novel Pneumonia Coronavirus (2019-nCoV) Potentially Related to Viral Transmissibility." Virol Sin no. 35 345 (3):337-339. doi: 10.1007/s12250-020-00212-7. 346 Wrapp, D., and N. Wang. 2020. "Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation." no. 367 (6483):1260-347 1263. doi: 10.1126/science.abb2507. 348 Xia, S., Q. Lan, S. Su, X. Wang, W. Xu, Z. Liu, Y. Zhu, Q. Wang, L. Lu, and S. Jiang. 2020. "The role of furin cleavage site in 349 SARS-CoV-2 spike protein-mediated membrane fusion in the presence or absence of trypsin." Signal Transduct Target 350 Ther no. 5 (1):92. doi: 10.1038/s41392-020-0184-0. 351 Xu, J., X. Xu, L. Jiang, K. Dua, P. M. Hansbro, and G. Liu. 2020. "SARS-CoV-2 induces transcriptional signatures in human lung 352 epithelial cells that promote lung fibrosis." no. 21 (1):182. doi: 10.1186/s12931-020-01445-6. 353 Zhao, R., G. Ali, and H. G. Nie. 2020. "Plasmin improves blood-gas barrier function in oedematous lungs by cleaving epithelial 354 sodium channels." Br J Pharmacol no. 177 (13):3091-3106. doi: 10.1111/bph.15038. 355 Zhou, P., X. L. Yang, X. G. Wang, B. Hu, L. Zhang, W. Zhang, H. R. Si, Y. Zhu, B. Li, C. L. Huang, H. D. Chen, J. Chen, Y. Luo, 356 H. Guo, R. D. Jiang, M. Q. Liu, Y. Chen, X. R. Shen, X. Wang, X. S. Zheng, K. Zhao, Q. J. Chen, F. Deng, L. L. Liu, B. 357 Yan, F. X. Zhan, Y. Y. Wang, G. F. Xiao, and Z. L. Shi. 2020. "A pneumonia outbreak associated with a new coronavirus 358 of probable bat origin." Nature no. 579 (7798):270-273. doi: 10.1038/s41586-020-2012-7. 359 360 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 Figure 1. Figure 1. Targeted molecular mimicry by SARS-CoV-2 of human ENaC and profiling ACE2-SCNN1G- PLAU/PLAT co-expression. (A) The cartoon showed the S-protein of SARS-CoV-2 (PDB ID: 6X2A), which was highlighted in green. The S1/S2 cleavage site required for the activation of SARS-CoV-2 was enlarged and highlighted in red. Furin/plasmin cleavage sites of common human viruses were shown in a box. (B) The cartoon represents the human ENaC protein (PDB ID: 6BQN), which was highlighted in green. Furin/plasmin cleavage site was enlarged and highlighted in red. The cleavage sites of ENaC in other species were shown in a box. (C) The single-cell transcriptomic co-expression of ACE2, SCNN1G (ENaC), and PLAU was summarized. The heatmap depicted the mean relative expression of each gene across the identified cell populations. The cell types were ranked based on decreasing expression of PLAU. The box highlighted the ACE2, SCNN1G (ENaC), and PLAU co-expressing cell types in the human respiratory system. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 Figure 2. Figure 2. Expression of proteases, ENaC, and ACE2 in the human respiratory system. Violin plots showing the expression level of PLAU, PLG, FURIN, TMPRSS2, and SCNN1G in nferX platform. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 Figure 3. Figure 3. Overall expression levels of proteases, ACE2, and SCNN1G in BALF bulk cells of COVID-19 patients. (A) Bubble plot of proteases, ACE2, and SCNN1G in BALFs of COVID-19 patients. The size of the dots indicateed the proportion of cells in the respective cell type having a greater-than-zero expression of these genes, while the color indicated the mean expression of these genes. (B) The gene expression levels of proteases, ACE2, and SCNN1G from health controls (n = 4), moderate cases (n = 3) and severe cases (n = 6). ***Padj < 0.001 (Wilcoxon test, Padj was performed using Bonferroni correction). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 Figure 4. Figure 4. Transcription levels of proteases, ACE2, and SCNN1G in single epithelial cells of COVID-19 patients. (A) Bubble plot of SARS-CoV-2 receptor (ACE2) and proteases in BALFs epithelial cells of COVID-19 patients. The size of the dots indicated the proportion of cells in the respective cell type having a greater-than-zero expression of these genes, while the color indicated the mean expression of these genes. (B) The gene expression levels of selected proteases and ACE2 in epithelial cells from health controls (n = 4), moderate (n = 3), and severe cases (n = 6). (C) The gene expression levels of selected proteases and ACE2 in different epithelial cell types from health controls, moderate and severe cases. ***Padj < 0.001 (Wilcoxon test, Padj was performed using Bonferroni correction). AEC: alveolar epithelial cells. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 Figure 5. Figure 5. Changes of proteases, ACE2, and SCNN1G in respiratory cell lines after SARS-CoV-2 infection. Normal human bronchial epithelial (NHBE) and alveolar epithelial (A549, Calu-3) cells were infected with SARS-CoV-2 for 24 h (Infected), and control cells received culture medium only (Mock). The boxplot showed the changes of proteases (PLAU, FURIN, and TMPRSS), SCNN1G, and ACE2 in A549, Calu-3, and NHBE after SARS-CoV-2 infection. Differential genes were calculated by DESeq2, ***Padj < 0.001, *Padj < 0.05 (Wald test, Padj was performed using Benjamini-Hochberg post-hoc test). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 Figure 6. Figure 6. SARS-CoV-2 infection hijacks the ENaC proteolytic network. In physiological conditions, the urokinase activates the plasminogen to plasmin, which will cleave the γENaC, leading to its activation. After infected by SARS-CoV-2, the PLAU (urokinase) expression level is significantly upregulated, which may help other viruses’ invasion by activating the plasminogen to cleave the S protein. The green solid line represents the urokinase, plasminogen, ENaC mRNA transcripts and activation by plasmin under physiological conditions. The red solid line represents the activation process under infection conditions, while the grey dotted line denotes the repression effects. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425801doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425801 10_1101-2021_01_08_425855 ---- DeepHBV: A deep learning model to predict hepatitis B virus (HBV) integration sites. DeepHBV: A deep learning model to predict hepatitis B virus (HBV) integration sites. Canbiao Wu1 ¶, Xiaofang Guo2 ¶, Mengyuan Li3 ¶, Xiayu Fu4, Zeliang Hou1, Manman Zhai1,5, Jingxian Shen1, Xiaofan Qiu1, Zifeng Cui3, Hongxian Xie6, Pengmin Qin5, Xuchu Weng1, Zheng Hu3,7*, Jiuxing Liang1* 1 Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, China; Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou, China. 2 Department of Medical Oncology of the Eastern Hospital, the First Affiliated Hospital, Sun Yat-sen University, Guangzhou, Guangdong, China 3 Department of Gynecological Oncology, the First Affiliated Hospital, Sun Yat-sen University, Guangzhou, Guangdong, China 4 Department of Thoracic Surgery, the First Affiliated Hospital, Sun Yat-sen University, Guangzhou, Guangdong, China 5 School of Psychology, South China Normal University, Guangzhou, Guangdong, China 6 Generulor Company Bio-X Lab, Guangzhou, Guangdong, China 7 Department of Obstetrics and Gynecology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China *Corresponding author Email: huzheng1998@163.com(ZH), liangjiuxing@m.scnu.edu.cn(JL) ¶These authors contributed equally to this work. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ Abstract Hepatitis B virus (HBV) is one of the main causes for viral hepatitis and liver cancer. Previous studies showed HBV can integrate into host genome and further promote malignant transformation. In this study, we developed an attention-based deep learning model DeepHBV to predict HBV integration sites by learning local genomic features automatically. We trained and tested DeepHBV using the HBV integration sites data from dsVIS database. Initially, DeepHBV showed AUROC of 0.6363 and AUPR of 0.5471 on the dataset. Adding repeat peaks and TCGA Pan Cancer peaks can significantly improve the model performance, with an AUROC of 0.8378 and 0.9430 and an AUPR of 0.7535 and 0.9310, respectively. On independent validation dataset of HBV integration sites from VISDB, DeepHBV with HBV integration sequences plus TCGA Pan Cancer (AUROC of 0.7603 and AUPR of 0.6189) performed better than HBV integration sequences plus repeat peaks (AUROC of 0.6657 and AUPR of 0.5737). Next, we found the transcriptional factor binding sites (TFBS) were significantly enriched near genomic positions that were paid attention to by convolution neural network. The binding sites of AR-halfsite, Arnt, Atf1, bHLHE40, bHLHE41, BMAL1, CLOCK, c-Myc, COUP-TFII, E2A, EBF1, Erra and Foxo3 were highlighted by DeepHBV attention mechanism in both dsVIS dataset and VISDB dataset, revealing the HBV integration preference. In summary, DeepHBV is a robust and explainable deep learning model not only for the prediction of HBV integration sites but also for further mechanism study of HBV induced cancer. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ Author summary Hepatitis B virus (HBV) is one of the main causes for viral hepatitis and liver cancer. Previous studies showed HBV can integrate into host genome and further promote malignant transformation. In this study, we developed an attention-based deep learning model DeepHBV to predict HBV integration sites by learning local genomic features automatically. The performance of DeepHBV model significantly improves after adding genomic features, with an AUROC of 0.9430 and an AUPR of 0.9310. Furthermore, we enriched the transcriptional factor binding sites of proteins by convolution neural network. In summary, DeepHBV is a robust and explainable deep learning model not only for the prediction of HBV integration sites but also for the further study of HBV integration mechanism. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ Introduction HBV is the main cause of viral hepatitis and liver cancer (hepatocellular carcinoma: HCC) [1]. It is a small DNA virus that can integrate into the host genome via an RNA intermediate [1]. First, HBV attaches and enters into hepatocytes, then transports its nucleocapsid which contains a relaxed circular DNA (rcDNA) to the host nucleus. In host nucleus, rcDNA is converted into covalently closed circular DNA (cccDNA) which produces messenger RNAs (mRNA) and pregenomic RNA (pgRNA) by transcription. Via reverse transcription in host nucleus, pgRNA produces new rcDNA and double-stranded linear DNA (dslDNA), which tend to integrate into the host cell genome [2]. Previous study showed HBV integration breakpoints distributed randomly across the whole genome with a handful of hotspots [3]. For instance, HBV was reported to recurrently integrate into the telomerase reverse transcriptase (TERT) and Myeloid/lymphoid or mixed-lineage leukemia 4 (MLL4, also known as KMT2B) genes. The insertional events were also accompanied by the altered expression of the integrated gene [2,3,5], indicating important biological impacts on the local genome. Further analysis revealed that the association between HBV integration and genomic instability existed in these insertional events [4]. Moreover, significant enrichment of HBV integration was found near the following genomic features in tumours compared to non-tumour tissue: repetitive regions, fragile sites, CpG islands and telomeres [2]. However, the pattern and the mechanism of HBV integration still remained to be explored. Many of the HBV integration sites distributed throughout the human .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ genome and seem completely random [4,6,7]. Whether the features and patterns of these “random” viral integration events could be learned and extracted remained an open question, and once solved, will greatly improve the understanding towards HBV integration induced carcinogenesis. Deep learning has an excellent performance in computational biology research, such as medical image identification [8], discovering motifs in protein sequences [9]. The convolutional neural network (CNN) is the most important part in deep learning, which enables a computer to learn and program itself from training data [10]. Though deep learning performs excellent in a various of fields, the detailed theory of how it makes the decision was hard to explain due to its black box effect. Therefore, an approach named attention mechanism which can highlight the outstanding parts was invented to open the “black box” [11,12]. In this study, we developed, DeepHBV, an attention-based model to predict the HBV integration sites using deep learning. The attention mechanism calculates the attention weight for each position and connect the encoder and the decoder in the meanwhile. It highlights the regions concentrated by DeepHBV and helps figure out the patterns that were paid attention to. DeepHBV can predict HBV integration sites accurately and specifically, and the attention mechanism identified positions with potential important biological meanings. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ Results DeepHBV effectively predicts HBV integration sites by adding genomic features. DeepHBV model structure and the scheme of encoding a 2 kb sample into a binary matrix were described in Fig 1. DeepHBV model was tested with our HBV integration sites database (http://dsvis.wuhansoftware.com). HBV integration sequences were prepared according to HBV integration sites as positive/negative samples following the steps in Method. The negative samples should be twice number of positive samples to keep data balance and to improve the confidence level. The positive samples were divided into 2902 and 1264 as positive training dataset and testing dataset. Ccorrespondingly, we extracted 5804 and 2528 negative samples as negative training dataset and testing dataset. DeepHINT, an existing deep learning model for predicting HIV integration sites according to surroundings [15], will also be evaluated using HBV integration sequences for training and testing. Both models were trained by the same HBV integration training dataset and used the same testing dataset for the evaluation. DeepHBV with HBV integration sequences showed an AUROC of 0.6363 and an AUPR of 0.5471 while DeepHINT with HBV integration sequences demonstrated an AUROC of 0.6199 and an AUPR of 0.5152 (Fig 2). The comparison of DeepHBV and DeepHINT was described in Discussion part. Several previous studies showed that HBV integration has a preference on surrounding genomic features such as repeat, histone markers, CpG islands, etc [2,4]. Thus, we tried to add these genomic features into DeepHBV, by mixing genomic feature samples together with HBV integration sequences as new datasets, then .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ trained and tested the updated DeepHBV models. We downloaded following genomic features from different datasets [16-18] into four subgroups: (1) DNase Clusters, Fragile site, RepeatMasker; (2) CpG islands, GeneHancer; (3) Cons 20 Mammals, TCGA Pan-Cancer; (4) H3K4Me3 ChIP-seq, H3K27ac ChIP-seq (S2 Fig). After obtaining genomic feature data positions (sources are mentioned in S2 Table), we extended the positions to 2000 bp and extracted related sequences on hg38 reference genome. We defined these sequences as positive genmoic feature samples. Then we mixed HBV integration sequences, positive genome feature samples, and randomly picked negative genomic feature samples (see Method) together and trained the DeepHBV model. Once a subgroup performed well, we re-test each genomic feature in that subgroup to figure out which specific genomic feature affect the model performance significantly (S2 Fig) (AUROC and AUPR values were recorded in S3 Table). From the ROC and PR curves, we found DeepHBV with HBV integration sites plus the genomic features repeat (AUROC: 0.8378 and AUPR: 0.7535) and TCGA Pan Cancer (AUROC: 0.9430 and AUPR: 0.9310) can significantly improve the HBV integration sites prediction performance against DeepHBV with HBV integration sequences (Fig 2). We also performed the same test on DeepHINT, but did not find a subgroup can substantially improve the model performance (results were recorded in S3 Table). Together, DeepHBV with HBV integration sequences plus repeat or TCGA Pan Cancer can significantly improve the model performance. Validation of DeepHBV using independent dataset VISDB It is necessary of DeepHBV to be applied on general datasets, we tested the .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ pre-trained DeepHBV models (DeepHBV with HBV integration sequences + repeat peaks and DeepHBV with HBV integration sequences + TCGA Pan Cancer peaks) on the HBV integration sites dataset in another viruses integration sites (VIS) database VISDB [19]. We found that in the model trained with HBV integration sequences + repeat sequences showed an AUROC of 0.6657 and an AUPR of 0.5737, while the model trained with HBV integrated sequences + TCGA Pan Cancer showed an AUROC of 0.7603 and an AUPR of 0.6189. The DeepHBV model with HBV integration sequences + TCGA Pan Cancer performed better compared with DeepHBV model with HBV integration sequences + repeat and was more robust on both testing dataset from dsVIS (AUROC: 0.9430 and AUPR: 0.9310) and independent testing dataset from VISDB (AUROC: 0.7603 and AUPR: 0.6189). Thus, we decided to use this model for future HBV integration sites study. Study the preference pattern of HBV integration by conserved sequence elements DeepHBV can extract features with translation invariance by pooling operation, which enables DeepHBV to recognise certain patterns even the features were slightly translated. The participating of attention mechanism into DeepHBV framework might partly open the deep learning black box by giving an attention weight to each position. Each attention weight represented the computational importance level of that position in DeepHBV judgement. The attention weights in attention layer were extracted after two de-convolution and one de-pooling operation and the output shape .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ is 667×1. Each score represented an attention weight of a 3 bp region. Positions with higher attention weight scores might have more important impact on the pattern recognition of DeepHBV, meaning these positions might be the critical points for identifying HBV integration positive samples. We first averaged the fractions of attention scores in all HBV integration sequences and normalized them to the mean of all positions. Then we visualised the fractions of attention scores and found the figure showed peak-valley-peak patterns only in positive samples (Fig 3). We were interested in the positions with higher attention weights in convolution neural network. And we found that, in the attention weight distribution of DeepHBV with HBV integration sites + TCGA Pan Cancer, a cluster of attention weights much higher than other weights often occurred in the positive samples. While in the model of DeepHBV with HBV integration sites + repeat did not show this pattern (Fig 3). To further discover the pattern behind these positions with higher attention weights, we defined the sites with top 5% highest attention weights as attention intensive sites, the regions of 10 bp near them as attention intensive regions. We mapped these attention intensive sites on hg38 reference genome with genomic features (Fig 4), but found that the positional relationship between attention intensive sites and genomic features was not quite clear. The results indicated that there may exist other specific pattern closely related to HBV integration preference, and when analysed carefully, could be recognized by the DeepHBV model. Convolution and pooling module will learn the patterns with translation invariance in deep learning, based on that deep learning network tend to learn the .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ domains happened recurrently among different samples in the same pooling matrix, even if the learned feature was not at the same position in these different samples [20,21]. Attention intensive regions are more likely to be conserved due to the translation invariance in convolution and pooling module, and would give hints to the selection preference of HBV integration sites. Transcriptional factor-binding sites (TFBS) motifs are conserved genomic elements which can be critical to the regulation of downstream genes. Therefore, we tested whether TFBS played important roles in HBV integration preference. We used all HBV integration samples whose prediction scores were higher than 0.95 from dsVIS and VISDB separately to enrich local TFBS motifs in attention intensive regions by HOMER v 4.11.1 [22] with its vertebrates transcription factor databases (Table 1). From the result of DeepHBV with HBV integration sequences + TCGA Pan Cancer, binding sites of AR-halfsite, Arnt, Atf1, bHLHE40, bHLHE41, BMAL1, CLOCK, c-Myc, COUP-TFII, E2A, EBF1, Erra, Foxo3, HEB, HIC1, HIF-1b, LRF, Meis1, MITF, MNT, MyoG, n-Myc, NPAS2, NPAS, Nr5a2, Ptf1a, Snail1, Tbx5, Tbx6, TCF7, TEAD1, TEAD3, TEAD4, TEAD, Tgif1, Tgif2, THRb, USF1, Usf2, Zac1, ZEB1, ZFX, ZNF692, ZNF711 can be both enriched in attention intensive regions of dsVIS and VISDB sequences. We selected two representative samples to obtain a more intuitive display. Genomic features, HBV integration sites from dsVIS and VISDB, attention intensive sites and TFBS were aligned and shown in hg38 reference genome (Fig 4). Most attention intensive sites can be mapped to enrich TF motifs. And the clusters of high attention weight from the output of DeepHBV with HBV integration sites plus TCGA Pan Cancer showed the .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ binding site of a tumour suppressor gene HIC1, circadian clock related elements BMAL1, CLOCK, c-Myc and NAPS2 (Fig 4). The data provided novel insights into HBV integration site selection preference and reveal biological importance that warrants future experimental confirmation. Table 1. Enriched TFBS from attention intensive regions of DeepHBV with HBV integration sites + TCGA Pan Cancer peaks. HOMER known results HOMER de novo results Rank Name P-value Rank Best Match/Details P-value 1 BMAL1 1E-323 1 TEAD3 1E-2283 2 NPAS 1.00E-259 2 EBF1 1E-1926 3 CLOCK 1.00E-165 3 TCF7 1E-958 4 c-Myc 1.00E-126 4 GRHL2 1E-504 5 ZFX 1.00E-108 5 Dux 1E-477 6 Tgif2 1.00E-75 6 Ptf1a 1E-465 7 MNT 1.00E-71 7 TEAD 1E-385 8 LRF 1.00E-62 8 Ahr::Arnt 1.00E-302 9 Tbx5 1.00E-62 9 Sox5 1.00E-245 10 ZNF711 1.00E-57 10 TEAD 1.00E-233 11 n-Myc 1.00E-54 11 Zic2 1.00E-204 12 ZNF416 1.00E-52 12 Nr2e3 1.00E-197 13 USF1 1.00E-47 13 SOX18 1.00E-182 14 bHLHE40 1.00E-45 14 ZBTB14 1.00E-174 15 Rbpj1 1.00E-36 15 USF2 1.00E-153 16 Zac1 1.00E-35 16 Isl1 1.00E-142 17 Tgif1 1.00E-32 17 ZNF264 1.00E-142 18 ZEB1 1.00E-30 18 Ascl2 1.00E-133 19 THRb 1.00E-29 19 ZNF460 1.00E-120 20 Ptf1a 1.00E-29 20 LRF 1.00E-117 21 bHLHE41 1.00E-29 21 ZNF416 1.00E-117 22 TEAD1 1.00E-27 22 PKNOX1 1.00E-103 23 Stat3 1.00E-24 23 Bcl6b 1.00E-91 24 Meis1 1.00E-21 24 Arnt 1.00E-90 25 c-Myc 1.00E-21 25 Osr2 1.00E-88 26 Usf2 1.00E-20 26 TFAP2A 1.00E-79 27 NPAS2 1.00E-17 28 HIC1 1.00E-17 29 TEAD 1.00E-17 30 TEAD4 1.00E-16 31 AR-halfsite 1.00E-16 32 STAT6 1.00E-15 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ 33 TCF4 1.00E-13 34 MITF 1.00E-13 35 TEAD3 1.00E-13 36 Atf1 1.00E-12 37 HIF-1b 1.00E-11 38 Foxo3 1.00E-10 39 E2A 1.00E-09 40 TEAD2 1.00E-09 41 Mef2a 1.00E-08 42 ZNF692 1.00E-07 43 Nkx3.1 1.00E-07 44 COUP-TFII 1.00E-07 45 MyoG 1.00E-07 46 Nkx2.5 1.00E-06 47 Snail1 1.00E-05 48 HEB 1.00E-05 49 Tbx6 1.00E-05 50 SCRT1 1.00E-04 51 Nr5a2 1.00E-04 52 Nanog 1.00E-03 53 Oct11 1.00E-03 54 Elk1 1.00E-03 55 Erra 1.00E-03 56 Gata6 1.00E-03 57 BHLHA15 1.00E-03 58 AMYB 1.00E-03 59 Nr5a2 1.00E-03 60 NFkB-p65-Rel 1.00E-02 61 Zic 1.00E-02 62 TRPS1 1.00E-02 63 Hoxa9 1.00E-02 64 HIF2a 1.00E-02 65 Isl1 1.00E-02 66 CEBP:AP1 1.00E-02 67 EWS:FLI1-fusion 1.00E-02 68 FOXK1 1.00E-02 69 ETS 1.00E-02 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ Discussion In this study, we developed an explainable attention-based deep learning model DeepHBV to predict HBV integration sites. In the comparison of DeepHBV and DeepHINT on predicting HBV integration sites (S3 Table), DeepHBV out-performed DeepHINT after adding genomic features due to its more suitable model structure and parameters on recognising the surroundings of HBV integration sites. We applied two convolution layers (1st layer: 128 convolution kernels and the kernel size is 8; 2nd layer: 256 convolution kernels and the kernel size is 6) and one pooling layer (with pooling size of 3) in DeepHBV while in DeepHINT the model only have one convolution layer (64 convolution kernels and the kernel size is 6) and one pooling layer (with pool size of 3). The increasing of convolution layers enables the information from higher dimensions can be extracted, the increasing of convolution kernels enables more feature information to be extracted [23]. We trained the DeepHBV model using three strategies (1) DNA sequences near HBV integration sites (HBV integration sequences), (2) HBV integration sequences + TCGA Pan Cancer peaks, (3) HBV integration sequences + repeat peaks. We found that the model with HBV integration sequences adding TCGA Pan Cancer or repeat can both significantly improve the model performance. And the DeepHBV with HBV integration sequences adding TCGA Pan Cancer peaks performed better on independent test dataset VISDB. However, the attention intensive regions cannot be well aligned to these genomic features. Thus, we further inferred that other features such as TFBS motifs may influence DeepHBV in the prediction process. And .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ HOMER was applied to recognise these TFBS that might be related to HBV-related diseases or cancer development. We noticed that the attention intensive regions identified by attention mechanism of DeepHBV with HBV integration sequences + TCGA Pan Cancer showed strong concentration on the binding site of the tumour suppressor gene HIC1, circadian clock-related elements BMAL1, CLOCK, c-Myc, NAPS2, and the transcription factors TEAD and Nr5a2. These DNA binding proteins were closely related to tumour development [24-30]. For instance, HIC1 is a tumour suppressor gene in hepatocarcinogenesis development [24,25]. BMAL1, CLOCK, c-Myc, NAPS2 all participate in the regulation of circadian clock [26], which is reported to promote HBV-related diseases [27,28]. In accordance, the binding motif of circadian clock-related elements were also enriched from the attention intensive regions of DeepHBV with HBV integration sequences + repeats, further confirming the results (S4 Table). In addition, the other transcription factors identified by Deep HBV are TEAD and Nr5a2. TEAD deregulation affected well-established cancer genes such as BRAF, KRAS, MYC, NF2 and LKB1, and showed high correlation with clinicopathological parameters in human malignancies [29]. Nr5a2 (also known as Liver receptor homolog-1, LRH-1) binds to the enhancer II (ENII) of HBV genes, and serves as a critical regulator of their expression [30]. In summary, DeepHBV is a robust deep learning model of using convolutional neural network to predict HBV integrations. Our data provide new insight into the preference for HBV integration and mechanism research on HBV induced cancer. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ Methods Data preparation A detailed step-by-step instruction of DeepHBV was provided in S1 and S2 Notes. To obtain positive training and testing samples for DeepHBV, we extracted 1000 bp DNA sequences from upstream and 1000 bp DNA sequences from downstream of HBV integration sites as positive dataset, each sample was denoted as 𝑆 = (𝑛1,𝑛2,…,𝑛2000), where 𝑛i represents the nucleotide in position i. DeepHBV, as a deep learning network also require negative samples that do not contain HBV integration sites as background area. The existing of HBV integration hot spots which contains several integration events within 30~100 kb range [13] prompted us that we should selected background area keeping enough distance from known HBV integration sites. Thus, we discarded the regions around known HBV integration sites with length 50 kb on hg38 reference genome and selected 2 kb length DNA sequences randomly on remained regions as negative samples. We encoded extracted DNA sequences using one-hot code to make the calculation of distance between features in training and the calculation of similarity more accuracy. Original DNA sequences were converted to binary matrices of 4-bit length where each dimension corresponds to one nucleotide type. Finally, we converted a 2000 bp DNA sequence into a 2000×4 binary matrix. Feature extraction DeepHBV model first applied convolution and pooling module to learn and .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ obtain sequence features around HBV integration sites (S1 Fig). Each binary matrix representing a DNA sequence entered the convolution and pooling module to execute convolution calculation. We employed multiple variant convolution kernels to calculation in order to obtain different features. S = (𝑛1,𝑛2,…,𝑛2000) denoted as a specific DNA sequence and E represented the binary matrix- encoded from S, the convolutional calculation in convolution layer refers to 𝑋 = 𝑐𝑜𝑛𝑣(𝐸), which can be described as: 𝑋𝑘,𝑗= ∑ 𝑝―1 𝑗=0 ∑ 𝐿 𝑙=1 𝑊𝑘,𝑗,𝑙𝐸𝑙,𝑖+𝑗 (1) Where 1 ≤ 𝑘 ≤ 𝑑, 𝑑 refers to the number of kernels, 1 ≤ 𝑖 ≤ 𝑛 ― 𝑝 +1, 𝑖 refers to the index, 𝑝 refers to the kernel size, n refers to input sequence length, 𝑊 refers to the kernel weight. Convolutional layer activated eigen vectors using Rectified Linear Unit (ReLU) after extracting relative eigen vectors. ReLU is an activation function in artificial neural networks which can be described as 𝑓(𝑥) = max (0,𝑥). We applied ReLU on the output matrix of each convolution layer and mapped each element on a sparse matrix. ReLU imitates real neuron activation, which enables data fitted to the model better. Then we applied max-pooling strategy to complete dimension reduction as well as support the maximum retention of predicted information. Till now, we achieved the final eigen vector 𝐹c from the binary matrix represented DNA sequence after feature extracting in convolution and pooling module. Attention mechanism in DeepHBV model DeepHBV added attention mechanism in order to capture and understand the .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ position contribution in abstracted eigen-vector 𝐹c. Eigen-vector entered the attention layer, which will calculate a weight value to each dimension in 𝐹c. The attention weight represents the contribution level of the convolutional neural network (CNN) in that position. The output of attention weight 𝑡𝑗 is the contribution score, larger 𝑡𝑗 score means bigger contribution in this position to HBV integration sites prediction. All contribution scores were normalized to achieve the dense eigenvector matrix, which denoted as 𝐹𝑎: 𝐹𝑎 = ∑ 𝑞 𝑗=1 𝑎𝑗𝑣𝑗 (2) Where, 𝑎𝑗 = 𝑒𝑥𝑝 (𝑡𝑗) ∑𝑞𝑖 𝑒𝑥𝑝 (𝑡𝑖) (3) Where 𝑎𝑗 represents the relevant normalisation score, 𝑣𝑗 represents the eigenvector at position 𝑗 of the input eigenmatrix. Each position represents an extracted eigen-vector in each convolution kernel. The convolution-pooling module and the attention mechanism module need to be combined in model prediction progress, in another word, eigen-vector 𝐹c and relative eigen important score 𝐹𝑎 should work together in HBV integration sites prediction. We linked the values in eigen-vector 𝐹c and linearly mapped them to a new vector 𝐹𝑣, which is: 𝐹𝑣= (𝑑𝑒𝑛𝑠𝑒(𝑓𝑙𝑎𝑡𝑡𝑒𝑛(𝐹c))) (4) In this step, flatten layer performed function 𝑓𝑙𝑎𝑡𝑡𝑒𝑛() to reduce dimension and concatenate data; function 𝑑𝑒𝑛𝑠𝑒() was executed by dense layer, which will map .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ dimension-reduced data to a single value. Then 𝐹𝑣 and 𝐹𝑎 concatenated vector entered linear classifier prediction to calculate the probability of HBV integration happened within the current sequence, with: 𝑃 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑐𝑜𝑛𝑐𝑎𝑡(𝐹𝑎,𝐹𝑣)) (5) Where 𝑃 is the predicted score, 𝑠𝑖𝑔𝑚𝑜𝑖𝑑() represents the activation function acted as classifier in final output, 𝑐𝑜𝑛𝑐𝑎𝑡() represents the concatenate operation. In the meantime, if we give the output eigenvector 𝐹c from convolution-and-pooling module as input, and execute attention mechanism, weight vector 𝑊 can be achieved: 𝑊 = 𝑎𝑡𝑡(𝑎1,𝑎2,…,𝑎𝑞) (6) Where 𝑎𝑡𝑡() refers to the attention mechanism, 𝑎𝑖 denotes the eigenvector in 𝑖𝑡ℎ dimension in the eigenmatrix, 𝑊 represents the dataset containing contribution scores of each position in the eigenmatrix extracted by convolution-and-pooling module. DeepHBV model training After confirming each parameter in DeepHBV (S1 Table), we trained the deep learning neural network model DeepHBV via binary crossentropy. The loss function of DeepHBV can be defined as: loss = -∑𝑖 𝑦𝑖 log(𝑃) + (1 ― 𝑦𝑖) log(1 ― 𝑃) (7) Where, 𝑦𝑖 is the prediction score, 𝑃 is the binary tag value of that sequence (in this dataset, positive samples were labelled as 1 and negative samples were labelled as 0). Back propagation algorithm was adapted in training progress and .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ Nesterov-accelerated adaptive moment estimation (Nadam) gradient descent algorithm was applied to optimise parameter initialization. The deep learning neural network model adapted Python 3.7, Keras library 2.2.4 [14] using three NVIDIA® Tesla V100-PCIE-32G(NVIDIA Corporation, California, USA ) for training and testing. DeepHBV takes around 90 min and 30 s for model training and testing respectively using the computational platform under such software and hardware settings. Data Availability DeepHBV is available as an open-source software and can be downloaded from https://github.com/JiuxingLiang/DeepHBV.git Reference 1. Liang TJ. Hepatitis B: the virus and disease. Hepatology 2009;49(5 Suppl):S13-21. 2. Tu T, Budzinska MA, Shackel NA et al. HBV DNA Integration: Molecular Mechanisms and Clinical Implications. Viruses 2017;9(4). 3. Sung WK, Zheng H, Li S et al. Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nat Genet 2012;44(7):765-9. 4. Zhao LH, Liu X, Yan HX et al. Genomic and oncogenic preference of HBV integration in hepatocellular carcinoma. Nat Commun 2016;7:12992. 5. Ding D, Lou X, Hua D et al. Recurrent targeted genes of hepatitis B virus in the .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ liver cancer genomes identified by a next-generation sequencing-based approach. PLoS Genet 2012;8(12):e1003065. 6. Tu T, Budzinska MA, Vondran FWR et al. Hepatitis B Virus DNA Integration Occurs Early in the Viral Life Cycle in an In Vitro Infection Model via Sodium Taurocholate Cotransporting Polypeptide-Dependent Uptake of Enveloped Virus Particles. J Virol 2018;92(11). 7. Mason WS, Gill US, Litwin S et al. HBV DNA Integration and Clonal Hepatocyte Expansion in Chronic Hepatitis B Patients Considered Immune Tolerant. Gastroenterology 2016;151(5):986-998 e4. 8. Litjens G, Kooi T, Bejnordi BE et al. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60-88. 9. Bailey TL, Baker ME, Elkan CP. An artificial intelligence approach to motif discovery in protein sequences: Application to steroid dehydrogenases. The Journal of Steroid Biochemistry and Molecular Biology 1997;62(1):29-44. 10. Yamashita R, Nishio M, Do RKG et al. Convolutional neural networks: an overview and application in radiology. Insights into Imaging 2018;9(4):611-629. 11. Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. Computer Science 2014. 12. Guidotti R, Monreale A, Ruggieri S et al. A Survey of Methods for Explaining Black Box Models. ACM Comput. Surv. 2018;51(5):Article 93. 13. Hu Z, Zhu D, Wang W et al. Genome-wide profiling of HPV integration in cervical cancer identifies clustered genomic hot spots and a potential .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ microhomology-mediated integration mechanism. Nat Genet 2015;47(2):158-63. 14. Chollet Fao. Keras. 2015. 15. Hu H, Xiao A, Zhang S et al. DeepHINT: understanding HIV-1 integration via deep learning with attention. Bioinformatics 2019;35(10):1660-1667. 16. Haeussler M, Zweig AS, Tyner C et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res 2019;47(D1):D853-D858. 17. Inoue F, Kircher M, Martin B et al. A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res 2017;27(1):38-52. 18. Robinson JT, Thorvaldsdottir H, Winckler W et al. Integrative genomics viewer. Nature Biotechnology 2011;29(1):24-26. 19. Tang D, Li B, Xu T et al. VISDB: a manually curated database of viral integration sites in the human genome. Nucleic Acids Res 2019. 20. Zhang W, Itoh K, Tanida J et al. Parallel distributed processing model with local space-invariant interconnections and its optical architecture. Appl Opt 1990;29(32):4790-7. 21. Bruna J, Zaremba W, Szlam A et al. Spectral Networks and Locally Connected Networks on Graphs. Computer Science 2013. 22. Heinz S, Benner C, Spann N et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Molecular Cell 2010;38(4):576-589. 23. Seide F, Gang L, Dong Y. Conversational speech transcription using .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ context-dependent deep neural networks. 2012. 24. Taniguchi K, Roberts LR, Aderca IN et al. Mutational spectrum of beta-catenin, AXIN1, and AXIN2 in hepatocellular carcinomas and hepatoblastomas. Oncogene 2002;21(31):4863-71. 25. Zheng J, Xiong D, Sun X et al. Signification of Hypermethylated in Cancer 1 (HIC1) as Tumor Suppressor Gene in Tumor Progression. Cancer Microenviron 2012;5(3):285-93. 26. Paibomesai MI, Moghadam HK, Ferguson MM et al. Clock genes and their genomic distributions in three species of salmonid fishes: Associations with genes regulating sexual maturation and cell cycling. BMC Res Notes 2010;3:215. 27. Fekry B, Ribas-Latre A, Baumgartner C et al. Incompatibility of the circadian protein BMAL1 and HNF4alpha in hepatocellular carcinoma. Nat Commun 2018;9(1):4349. 28. Mukherji A, Bailey SM, Staels B et al. The circadian clock and liver function in health and disease. J Hepatol 2019;71(1):200-211. 29. Huh HD, Kim DH, Jeong HS et al. Regulation of TEAD Transcription Factors in Cancer Biology. Cells 2019;8(6). 30. Cai YN, Zhou Q, Kong YY et al. LRH-1/hB1F and HNF1 synergistically up-regulate hepatitis B virus gene transcription and DNA replication. Cell Research 2003;13(6):451-458. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ Figure legends Figure 1. The deep learning framework applied in DeepHBV. (a) Scheme of encoding a 2 kb DNA sequence into a binary matrix using one-hot code; (b) A brief flowchart of DeepHBV structure, the matrix shape was included in brackets, and a detailed flowchart was in S1 Fig. Figure 2. Evaluation of DeepHBV and DeepHINT model prediction performance on the test dataset. (a) receiver-operating characteristic (ROC) curves and (b) precision recall (PR) curves, respectively. “DeepHBV with HBV integration sequences” refers to DeepHBV model with only HBV integration sequences as input; “DeepHINT with HBV integration sequences” refers to DeepHINT model with only HBV integration sequences as input; “DeepHBV with HBV integration sequences + repeat” refers to DeepHBV integration sequences and repeat sequences as input; “DeepHBV with HBV integration sequences” refers to DeepHBV integration sequences and TCGA Pan Cancer sequences as input: “DeepHBV with HBV integration sequences + repeat + (test) VISDB” refers to DeepHBV using HBV integration sequences and repeat sequences for training and using VISDB as independent test dataset; “HBV with HBV integration sequences + TCGA Pan Cancer + (test) VISDB” refers to DeepHBV using HBV integration sequences as TCGA Pan Cancer sequences for training and using VISDB as independent test dataset. Figure 3. The attention weight distribution of analysed by DeepHBV with HBV integration sequences + genomic features. (a) DeepHBV with HBV integration sequences + TCGA Pan Cancer peaks; (b) DeepHBV with HBV integration .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ sequences + repeat peaks. The left graph showed the fractions of attention weight, which were averaged among all samples and normalized to the average of all positions, each index represents a 3 bp region due to the multiple convolution and pooling operation. The graphs on the right are representative samples of attention weight distribution of positive samples and negative samples. Figure 4. Attention intensive regions highlighted essential local genomic features on predicting HBV integration sites. Representative examples showed the positional relationship between the attention intensive sites and several genomic features using DeepHBV with HBV integration sequences + TCGA Pan Cancer model on (a) chr5:1,294,063-1,296,063 (hg38), (b) chr5: 1291277-1293277 (hg38). Each of these two sequences contains HBV integration sites from both dsVIS and VISDB. Enriched DNA binding proteins detected by HOMER from the attention intensive regions using the output of DeepHBV then we applied FIMO [1] to find the enriched motif position and label the motifs on attention intensive regions. UCSC genome browser [2] and Matplotlib [3] was used for visualisation. “HPV integration site” refers to the sites selected from our unpublished database used as testing samples. “Attention Intensive Sites” denotes the sites with top 5% attention weight. “RepeatMasker”, “TCGA Pan Cancer”, “DNase Clusters”, “Con20mammals”, “GeneHancer”, “Layered H3K27ac”, “Layered H3K36me3” are genomic features. References 1. Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ motif. Bioinformatics 2011;27(7):1017-8. 2. Haeussler M, Zweig AS, Tyner C et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res 2019;47(D1):D853-D858. 3. Hunter JD. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering 2007;9(3):90-95. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ Supporting information S1 Fig. DeepHBV framework. Each part represents a layer in neural network and 𝑛 × 𝑛 stands for the output dimension which was explained in S2 Note. Two continuous convolution layers were used to extract features; max-pooling layers can reduce the dimension while keeping the feature matrix has the ability to predicting information; dropout layer randomly drop some results to prevent over-fit; flatten layer is responsible for reduce the dimensions and connect them; dense layer is used to map the output from last layer to a specific value; attention layer and attention flatten are used to give a weight score to each dimension in the feature matrix; concatenate layer concatenates captured features and importance scores of those features from the convolution module and the attention mechanism model. Prediction Output offered the final output reveals the probability of HBV infection. S2 Fig. Prediction performance on the HBV integration dataset with different types of genomic features added in. We found that character 1 and character 3 outperformed the DeepHBV model with an significant increase in AUPR and AUROC score on character 1 and character 3, indicating that DeepHBV can capture genomic features from character 1 and character 3 effectively, so we did further analysis on each single items in character group 1 and 3, and found that Repeats and TCGA Pan Cancer are the genomic features that can be captured by DeepHBV which significantly improved model performance. DeepHBV with HBV integration sequences + repeats reached the AUROC of 0.8378 and the AUPR of 0.7535, which DeepHBV with HBV integration sequences + TCGA Pan Cancer reached the .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ AUROC of 0.9430 and the AUPR of 0.9310. S1 Table. The parameters for the deep neural network used in DeepHBV. S2 Table. Genomic features and sources. (Access date: Novemember 16th, 2019) S3 Table. Comparison of DeepHBV and DeepHINT result record. S4 Table. Enriched TFBS from attention intensive regions of DeepHBV with HBV integration sites + repeat peaks. S1 Note. DeepHBV framework. DeepHBV neural network structure design and hyperparameters involved in DeepHBV are noted. S2 Note. Mathematical matters of the DeepHBV. There are explanations for 8 mathematical matters (i.e. encoding DNA sequences, convolution layers, the max pooling layer, dropout layer, attention layer, concatenate layer, linear classifier and optimisation algorithm) of the DeepHBV in this part. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425855doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425855 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_01_08_425885 ---- Computing the Riemannian curvature of image patch and single-cell RNA sequencing data manifolds using extrinsic differential geometry Computing the Riemannian curvature of image patch and single-cell RNA sequencing data manifolds using extrinsic differential geometry Duluxan Sritharan∗1,2, Shu Wang∗1,3, and Sahand Hormoz†2,4,5 1Harvard Graduate Program in Biophysics, Harvard University, Cambridge, MA, USA 2Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA 3Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA 4Department of Systems Biology, Harvard Medical School, Boston, MA, USA 5Broad Institute of MIT and Harvard, Cambridge, MA, USA Abstract Most high-dimensional datasets are thought to be inherently low-dimensional, that is, datapoints are constrained to lie on a low-dimensional manifold embedded in a high-dimensional ambient space. Here we study the viability of two approaches from differential geometry to estimate the Riemannian curvature of these low-dimensional manifolds. The intrinsic approach relates curvature to the Laplace-Beltrami operator using the heat-trace expansion, and is agnostic to how a manifold is embedded in a high- dimensional space. The extrinsic approach relates the ambient coordinates of a manifold’s embedding to its curvature using the Second Fundamental Form and the Gauss-Codazzi equation. Keeping in mind practical constraints of real-world datasets, like small sample sizes and measurement noise, we found that estimating curvature is only feasible for even simple, low-dimensional toy manifolds, when the extrinsic approach is used. To test the applicability of the extrinsic approach to real-world data, we computed the curvature of a well-studied manifold of image patches, and recapitulated its topological classification as a Klein bottle. Lastly, we applied the approach to study single-cell transcriptomic sequencing (scRNAseq) datasets of blood, gastrulation, and brain cells, revealing for the first time the intrinsic curvature of scRNAseq manifolds. ∗Equal contribution †To whom correspondence should be addressed (Sahand Hormoz@hms.harvard.edu) 1 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 1 Introduction High-dimensional biological datasets have become prevalent in recent decades because of new technologies such as high-throughput scRNAseq [1, 2, 3], mass cytometry [4, 5] and multiplex imaging [6, 7]. Interpre- tation and visualization of such high-dimensional datasets have been challenging however, prompting the development of tools for non-linear projection of datapoints onto 2 or 3 dimensions [8]. These tools, such as IsoMAP [9], t-SNE [10] and UMAP [11], appeal to the ansatz that datapoints in a high-dimensional ambient space are constrained to lie on a low-dimensional manifold. Unfortunately, determining the geometry of a low-dimensional manifold from these visualizations is difficult, since many geometric properties are lost after projecting onto 2 or 3 dimensions. For example, the cartographic projections used in an atlas to flatten Earth’s curved surface tear apart continuous neighborhoods and non-uniformly stretch distances. Fortunately, topology and differential geometry provide a wealth of concepts to characterize a manifold’s shape directly without confounding projections. In particular, homology [12, 13] categorizes a manifold according to the number of holes it contains, and the dimensionality of each hole (whereas for example, the hole in a hollow sphere does not survive projection onto a 2-dimensional plane). Similarly, metrics [14] and geodesics [9] determine shortest-distance paths between pairs of points on a manifold without any distortion from a projection (whereas for example, most atlases exaggerate distances at the poles). Curvature [15] is a local manifold property that quantifies the extent to which a manifold deviates from the tangent plane at each point p. Projecting a manifold onto a plane for visualization destroys this property by definition. Recent methods have emerged for estimating homology [16, 17], metrics [14] and geodesics [18] from noisy, sampled data, with accompanying statistical guarantees [18, 19, 20]. These methods have been applied to analyze images [21, 22] and biological datasets [23, 24]. However, estimating curvature has received less attention although it is fundamental to quantifying geometry. Curvature arises from two sources. On the one hand, a manifold itself can be curved, resulting in Riemannian or intrinsic curvature. A sphere has intrinsic curvature because it cannot be flattened so that all geodesics on its surface correspond to straight lines on a Euclidean plane (see Figure 1A). On the other hand, the embedding of a manifold in an ambient space can give rise to extrinsic curvature, a property that is not inherent to the manifold itself. For example, a scroll has extrinsic curvature because it is formed by rolling a piece of parchment, but the parchment itself is not inherently curved (see Figure 1B). It is important to note that both types of curvature scale inversely with the global length scale (L) associated with a manifold. It is for this reason that a marble (L ≈ 1 cm) is visibly round, but the Earth (L ≈ 10, 000 km) is still mistaken by some to be flat. Since intrinsic curvature is an inherent property of a manifold, while 2 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B C D Intrinsic (Riemannian) Curvature Extrinsic Curvature Intrinsic Differen�al Geometry Extrinsic Differen�al Geometry z = ± 1 − x 2 − y2 Figure 1: Riemannian curvature is an intrinsic property of a manifold while extrinsic curvature depends on the embedding. (A) (Left) N = 104 points uniformly sampled from the 2-dimensional hollow unit sphere, S2, embedded in the 3-dimensional ambient space R3, colored according to the z-coordinate. S2 has Riemannian or intrinsic curvature because there is no projection onto 2-dimensional Euclidean space that preserves geodesic (shortest-path) distances. (Right) For example, a stereographic projection using the point z = (0, 0, 1) and the plane z = 0 introduces distortions since the geodesic distance between any pair of points in the lower hemisphere is (non-uniformly) larger than the Euclidean distance in this projection. (B) (Left) N = 104 points uniformly sampled from a scroll, which is also a 2-dimensional manifold embedded in R3. The scroll has extrinsic curvature because it curls away from the tangent plane at any point. (Right) However, it does not have intrinsic curvature, because it can be projected onto 2-dimensional Euclidean space in a way that preserves geodesic distances, by unfurling. (C) Intrinsic differential geometry treats manifolds as self-contained objects that can be described using only intrinsic coor- dinates, which do not depend on any embedding or ambient space. One possible set of intrinsic coordinates for S2 are polar coordinates, where θ1 and θ2 are the azimuthal and elevation angles respectively. While this representation superficially resem- bles the unfurled scroll in (B), distances in this plane are non-Euclidean. Any line segment along θ2 = ±π2 has zero length for example. (D) Extrinsic differential geometry defines manifolds in the coordinate system of the ambient space, which requires a privileged vantage point off the manifold itself. Both intrinsic and extrinsic differential geometry can be used to compute intrinsic curvature, whereas only extrinsic differential geometry can be used to compute extrinsic curvature (as indicated by the black arrows). 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ extrinsic curvature is incidental to an embedding, we will restrict our attention to the former. A precise description of intrinsic curvature is provided by the Riemannian curvature tensor, Rlkij(p). For a given basis {v}, this tensor quantifies how much a vector initially pointing in direction vk is displaced in direction vl after parallel transport around an infinitesimal parallelogram defined by directions vi and vj. The simplest intrinsic curvature descriptor is scalar curvature, S(p), which is formed by contracting Rlkij(p) to a scalar quantity, as its name suggests. When S(p) is greater (less) than 0, the sum of the angles of a triangle formed by connecting three points near p by geodesics is greater (less) than π. Likewise, when S(p) is greater (less) than 0, a small ball centred at p has a smaller (larger) volume than a ball of the same radius in Euclidean space. We furnish toy examples in the main text to provide stronger intuition for this quantity. In theory, intrinsic curvature can be equivalently computed using tools from either one of the two branches of differential geometry. Intrinsic differential geometry makes no recourse to an external vantage point off a manifold, just as the polygonal characters in Edwin Abbot’s classic Flatland [25] were confined to traversing in R2, and found the notion of R3 unfathomable. In this branch, a manifold is therefore represented in intrinsic coordinates, which are agnostic to any ambient space or embedding. A hollow sphere represented in polar coordinates and k-nearest neighbor (kNN) graph representations of a dataset, for instance, are in this spirit (see Figure 1C). Conversely, in extrinsic differential geometry, a manifold is treated as a surface embedded in an ambient space, and is represented in ambient coordinates (see Figure 1D). The surface of an organ is parameterized this way, for example, in a surgical robot suturing an incision. In this work, we explore two approaches for estimating intrinsic curvature based on these twin views, keeping in mind practical limitations of real-world datasets, which may be comprised of a relatively small number of noisy measurements. The first approach uses the Laplace-Beltrami operator, which is well-studied in previous applications of differential geometry to data analysis [14, 26, 27, 28, 29], and is theoretically appealing as an intrinsic quantity that is embedding-invariant. However, we find that this approach cannot accurately estimate even average scalar curvature on the simplest of low-dimensional toy manifolds for small sample sizes, despite the history and ubiquity of the Laplace-Beltrami operator in geometric data analysis. Meanwhile, the second approach uses the Second Fundamental Form and the Gauss-Codazzi equation [15], identities that rely on information from the ambient space. We find that this extrinsic approach is not only more robust to small sample sizes and noise, but permits computation of the full Riemannian curvature tensor, though we focus on the scalar curvature for simplicity. Using these insights, we developed a software package to compute the scalar curvature (and associated uncertainty) at each sampled point on a manifold, and applied this tool to investigate the curvature of image and scRNAseq datasets. 4 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 Results 2.1 Estimators of the Laplace-Beltrami Operator Yield Inaccurate Scalar Cur- vatures Intrinsic differential geometry treats a d-dimensional manifold, M, as a self-contained object and is agnostic to how M may be represented in ambient coordinates due to any particular embedding (see Figure 1C). Conceptually, this is accomplished by only considering M as a collection of local, overlapping neighborhoods. The geometry of these neighborhoods is encoded using tools such as the Laplace-Beltrami operator, ∆M , which captures diffusion dynamics across neighborhoods. For most practical applications, we do not have direct access to M but instead to a finite number (N) of points sampled from M. For these cases, estimators of ∆M are used instead. These estimators are well-studied [14, 26, 27, 28, 29], and the convergence rates of some have been characterized [30]. The scalar curvature averaged across M, has a well-known connection to ∆M via the heat-trace expan- sion [27, 31], which relates the eigenvalues, λk, of ∆M to the geometry of M: Z(t) ≡ ∞∑ k=1 e−λkt = (4πt)− d 2 ( n∑ i=0 cit i 2 + o(t n+1 2 ) ) , λk ≤ λk+1 (1) The first few coefficients, ci, are given by [27]: c0 = ∫ M dM, c1 = − √ π 2 ∫ ∂M d(∂M), c2 = 1 6 ∫ M S dM − 1 6 ∫ ∂M J d(∂M) (2) where ∂M is the boundary of the manifold and J is the mean curvature on ∂M. Recall that S is the point-wise scalar curvature. By inspection, c0 is the volume, c1 is proportional to the area, and c2 is directly related to the average scalar curvature. We reasoned that if the average scalar curvature cannot be accurately computed for a manifold with constant scalar curvature using these relations, then computing the point-wise scalar curvature for more complex manifolds is intractable. To investigate this, we considered the 2-dimensional hollow unit sphere, S2, for which the true scalar curvature is S(p) = 2 ∀p ∈ M, and uniformly sampled N = 104 points to mirror the typical size of current scRNAseq datasets (see Figure 1A; Methods Section 4.4.1.1). Since common estimators of ∆M only yield as many eigenvalues as datapoints (N), we cannot compute 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ the infinite set of eigenvalues needed in Equation 1. Therefore, we introduced a truncated series with m eigenvalues, zm(x), where we have substituted x = √ t and divided through by the prefactor in the RHS of Equation 1 to isolate for ci, following the approach in [27]: zm(x) = (4π) d/2xd m∑ k=1 e−λkx 2 (3) The scalar curvature can then be approximated by fitting the truncated series, zm(x), to a second-order polynomial, p2(x), over intervals of small x: zm(x) ≈ p2(x), where p2(x) = c0 + c1x + c2x 2 (4) We estimated ∆M using the N sampled points (see Methods Section 4.2.5), substituted the eigenvalues of the estimate into Equation 3, and numerically fit zm(x) to p2(x) (see Figure S1A-G; Methods Section 4.2.1). We obtained the scalar curvature by inspecting the resulting c2 coefficient, and compared the result to the true value of 2. We found that the scalar curvature was always over-estimated (S > 3) regardless of m, the number of eigenvalues used in the truncated series (see Methods Section 4.2.3), or the choice of estimator for ∆M (see Methods Section 4.2.5). We identified the poor convergence of the estimated eigenvalues of ∆M as the source of error (see Methods Section 4.2.4) and found that at least N ≈ 107 points are required to reduce the error to ±0.5, so that S ≈ 2.5 (see Figure S1H). Therefore, despite the prevalence of the Laplace-Beltrami operator in geometric data analysis, our exam- ple shows that an intrinsic approach relying on the operator is not practical for computing scalar curvatures. Even for noise-free datapoints uniformly sampled from S2, the sample size needed to compute average scalar curvature accurate to ±0.5 is several orders of magnitude greater than what is typically feasible in current scRNAseq experiments. Noise and non-uniform sampling would confound the issue further. Most impor- tantly, we would eventually like to compute local values of S(p) ∀p ∈ M, but this approach failed to correctly recover even average scalar curvature, which one might have expected to be feasible. To find an alternative approach, we next considered tools from extrinsic differential geometry. 2.2 Curvature Can Be Computed Accurately Using the Second Fundamental Form In extrinsic differential geometry, a manifold is described in the coordinates of the ambient space in which it is embedded, usually Rn (see Figure 1D). Since the shape of the sphere in Figure 1A is visually unambiguous 6 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ to the eye (thanks to its extrinsic view from a vantage point off the manifold), we reasoned that an extrinsic approach would be more fruitful. A d-dimensional manifold, M, embedded in Rn can be described at each point p in terms of a d- dimensional tangent space, TM (p), and an (n − d)-dimensional normal space, NM (p), as shown in Fig- ure 2A. Given orthonormal bases for TM (p) and NM (p), points in the neighborhood of p can be expressed as Y = [t1, ..., td,n1, ...,nn−d] where ti is Y ’s coordinate along the i th basis vector of TM (p) and nk is Y ’s coordinate along the kth basis vector of NM (p). The nks can then be locally approximated as functions of the tis i.e. nk ≈ fk(t1, ..., td) as shown in Figure 2B. The Riemannian curvature of M is related to the quadratic terms in the Taylor expansion of each fk with respect to the tis. Specifically, the Second Fundamental Form of M, h k ij, gives the second-order coefficient relating each fk to the quadratic term titj [32]: hkij(p) = ∂2fk ∂ti∂tj ∣∣∣∣ p (5) The Riemannian curvature tensor is related to the Second Fundamental Form according to the Gauss-Codazzi equation [15]: Rijkl = (h α jkh β il −h β jih α kl)gαβ (6) where gαβ is the metric of the ambient space, which we take to be the usual Euclidean metric δα,β going forward. The scalar curvature can be obtained by contracting the Riemannian curvature tensor: S = ∑ i,j Rijij (7) This suggests a conceptually simple procedure to estimate the scalar curvature of a data manifold at each point p: (i) estimate TM (p) and NM (p), (ii) determine h k ij(p) in local coordinates, (iii) compute S using Equations 6 and 7. We developed a computational tool that provides an implementation of this procedure. Briefly, given a set of datapoints {X} ∈ Rn and manifold dimension d, a neighborhood around each point p is selected to be the n-dimensional ball centred on p of radius r encompassing Np(r) points (see Methods Section 4.3.2). For each point p, Principal Component Analysis (PCA) [33] is performed on the Np(r) points in its neighborhood, and the first d (last n−d) Principal Components (PCs) accounting for the most (least) variance are taken as an orthonormal basis for TM (p) (NM (p)). The normal coordinates, nk, of the Np(r) points in each neighborhood are fit by regression to a quadratic model in terms of the tangent coordinates, ti, to obtain h k ij(p) with associated uncertainties (see Figure 2B; Methods Section 4.3.1). 7 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ The choice of r(p) is an important one since it sets the length scale at which curvature is computed for point p (see Methods Section 4.3.5). Our tool allows interrogation of curvature at any length scale of interest by allowing the user to manually set r(p), a feature we use to inspect real-world datasets later in the paper. However, since the local geometry of the manifold may be non-trivial and unknown a priori, we also provide the ability to set r(p) according to statistical rather than geometric principles. Specifically, our tool algorithmically chooses r at each p so that the uncertainty in hkij(p) from regression is less than a user-specified global parameter, σh (see Methods Section 4.3.2). Since a larger number of points reduces the uncertainty in regression, a smaller σh requires a larger r(p) ∀p ∈ M. This strategy of setting σh therefore allows neighborhood sizes to dynamically vary over the manifold based on the local density of the data, which means the algorithm can gracefully handle non-uniform sampling of the manifold. The choice of σh will depend on the global length scale, L, of the datapoints (see Methods Section 4.3.5), the average density of sampled points, and of course, the desired uncertainty in the estimates of hkij. These uncertainties are in turn used to compute a standard error, σS, accompanying the scalar curvature estimate at each point, using standard error propagation formulas (see Methods Section 4.3.4). We specify σh instead of σS as the global parameter for choosing neighborhood sizes, since the latter depends non-linearly on the values of hkij(p), which makes determining r(p) more difficult. Our algorithm also computes a goodness-of-fit (GOF) p-value at each p by comparing residuals from regression against a normal distribution to quantify how well the normal coordinates are fit by a quadratic function (see Methods Section 4.3.3). We tested this p-value at significance level α = 0.05, declaring fits to be poor when the residuals are significantly non-Gaussian. The p-value can be disregarded if the neighborhood size is manually specified to be larger than a length scale for which a quadratic fit is appropriate. However, when σh is specified instead, a uniform distribution of these p-values over M indicates that the desired uncertainty results in neighborhoods that are well-approximated using quadratic regression. We adopted this heuristic when choosing σh for the datasets studied in this paper (see Methods Section 4.4.3, 4.5.7 and 4.6.4). The software is available at https://gitlab.com/hormozlab/ManifoldCurvature. We first applied our algorithm to compute scalar curvatures for the same N = 104 points uniformly sampled from S2 for which the intrinsic approach failed (see Figure 2C; Methods Section 4.4.1.1). The algorithm yielded scalar curvature estimates at each point with mean error −0.17 (computed by averaging the difference between the point-wise scalar curvature estimates and the ground truth value of 2 across all points) using neighborhoods that only contained Np(r) ≈ 102 points. This is already superior to the intrinsic approach, which failed to compute even average scalar accurate to ±1 for the same sample size. The non-zero value of the mean error indicates that our estimator is biased. The values of hkij are not biased because they 8 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://gitlab.com/hormozlab/ManifoldCurvature https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ C D B E F I G H Figure 2: Scalar curvature is accurately estimated using the Second Fundamental Form and the Gauss-Codazzi equation. (A) A hypothetical manifold (shown in grey) from which datapoints are sampled (shown as colored dots). The manifold at any given point p (shown in red) can be decomposed into a tangent space TM (p) (the cyan plane) and a normal space NM (p) (the cyan line). Points in the neighborhood around p (shown in green) can be expressed in terms of orthonormal bases for TM (p) and NM (p) (see (B) below). (B) The set of points in the neighborhood of p (shown as green dots in (A)) are represented here in local tangent (t1, t2) and normal (n1) coordinates, corresponding to orthonormal bases for TM (p) and NM (p) respectively. Coloring corresponds to magnitude in the normal direction. The normal coordinates (n1) can be locally approximated as a quadratic function (the translucent surface) of the tangent coordinates (t1, t2), according to the Second Fundamental Form, h k ij. (C) Scalar curvatures computed using the extrinsic approach for N = 104 points uniformly sampled from the 2-dimensional hollow unit sphere, S2. The true value is 2 at all points on the manifold. See Methods Section 4.4.1.1. (D) Scalar curvatures (S) computed in (C) are plotted against their associated standard errors (σS). Points enclosed by the red lines have a 95% confidence interval (CI), computed as S ± 2σS, containing the true value of 2. (E) As in (C) but for N = 104 points uniformly sampled from a one-sheet hyperboloid, H22 , which is also a 2-dimensional manifold. Due to the radial symmetry of the manifold, scalar curvature only varies only along the z-direction. See Methods Section 4.4.1.2. (F) Scalar curvatures (black) computed in (E) with their associated 95% CIs (shown in grey) plotted as a function of the z-coordinates of the datapoints. The true value is shown as a dashed red line. (G) As in (C) but for N = 104 points uniformly sampled from a 2-dimensional ring torus, T2. T2 is constructed by revolving a circle parameterized by θ, oriented perpendicular to the xy-plane, through an angle φ around the z-axis. The scalar curvature only depends on the value of θ. See Methods Section 4.4.1.3. (H) Scalar curvatures computed in (G) with their associated 95% CIs plotted as a function of the θ values of the datapoints. Colors as in (F). (I) Distribution of computed scalar curvatures for N = 104 points uniformly sampled from the d-dimensional unit hypersphere, Sd, for d = 2, 3, 5, 7. As with S2, these manifolds are isotropic and have constant scalar curvature. The true values are shown as dashed red lines. See Methods Section 4.4.1.1. 9 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ are obtained using regression. Even so, the components of the Riemannian curvature tensor, Rijkl, may still be biased because they are non-linear functions of hkij. Note that for S 2, this bias is the same across all datapoints (because of the isotropic nature of the manifold) and therefore results in a systematic under- estimation of scalar curvature (see Figure 2C; Methods Sections 4.3.4). We also computed 95% confidence intervals (CI) for our estimates as S ± 2σS, and despite the mean error, 73% of points still reported a 95% CI containing the true value of 2 (see Figure 2D). We next tested our algorithm on a 2-dimensional manifold with negative scalar curvature, by uniformly sampling N = 104 points from the one-sheet hyperboloid, H22 (see Figure 2E; Methods Section 4.4.1.2). Here, 71% of points reported a 95% CI containing the true scalar curvature (see Figure 2F). Lastly, we considered the 2-dimensional ring torus, T 2 (see Figure 2G; Methods Section 4.4.1.3). As a manifold with regions of positive, zero, and negative scalar curvature, T 2 is a useful toy model for understanding more complex 2-dimensional manifolds and gaining intuition for higher-dimensional manifolds. In 2 dimensions, regions of a manifold with positive scalar curvature (θ = 0, 2π in Figure 2H) are dome-shaped, regions with zero scalar curvature (θ = π 2 , 3π 2 in Figure 2H) are planar, and regions with negative scalar curvature (θ = π in Figure 2H) are saddle-shaped. We applied our tool to N = 104 points uniformly sampled from T 2 and found that 88% of points reported a 95% CI containing the true scalar curvature (see Figure 2H). To test the applicability of our algorithm to higher-dimensional manifolds, we uniformly sampled N = 104 points from unit hyperspheres, Sd, and found that 90%, 84% and 78% of points reported a 95% CI containing the true scalar curvature for d = 3, 5 and 7 respectively (see Figure 2I; Methods Section 4.4.1.1). The number of terms, hkij, in the Second Fundamental Form grows as d 2. For larger d, a greater number of datapoints and hence larger neighborhoods are needed for regression, but these are no longer well-approximated by quadratic fits according to our GOF measure. More generally, higher-dimensional manifolds require a higher density of data to estimate scalar curvatures accurately. We additionally characterized how our algorithm performed when datapoints were non-uniformly sampled (see Figure S2A; Methods Section 4.4.2.1) or convoluted by observational noise (see Figure S2B; Methods Sec- tion 4.4.2.2), when the dimension of the ambient space was large (see Figure S2C; Methods Section 4.4.2.3), and when the specified manifold dimension differed from the ground truth (see Figure S2D; Methods Sec- tion 4.4.2.4). We found that the algorithm is robust to non-uniform sampling, large ambient dimension and small observational noise, and provides signatures indicating when the manifold dimension may be mis- specified. However, when the noise scale is large, the resulting manifold is no longer trivially related to the noise-free manifold, consistent with existing literature [34, 35, 36, 37], so that scalar curvature cannot be accurately computed. Lastly, we note that since the full Riemannian curvature tensor is computed as an 10 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ intermediate step in our algorithm, more intricate geometric features in the data can also be analyzed using our tool, though we defer such investigation to future studies. Taken together, these examples demonstrate the utility of the algorithm in recovering curvature with specified uncertainties for manifolds with positive and/or negative scalar curvature. Next, we tested our algorithm on real-world data. 2.3 Curvature of Image Patch Manifold is Consistent with a Noisy Klein Bottle Pixel intensity values in images of natural scenes are not independently or uniformly distributed. Understand- ing the statistics of such images is important for designing compression algorithms [38] and for addressing challenges in the field of computer vision such as segmentation [39]. Lee et al. discovered that 3x3-pixel patches extracted from greyscale images of natural scenes, whose pixels have high-contrast (i.e. the differ- ences between the intensity values of adjacent pixels in a patch are large), are not uniformly distributed in R9, but are instead concentrated on a low-dimensional manifold [40]. This is because high-contrast regions in a natural scene usually correspond to the edges of objects in the scene. High-contrast image patches consequently tend to be comprised of gradients and not simply random speckle. Subsequent work using topological data analysis revealed that after appropriate normalization (which takes image patches from R9 to S7 ∈ R8, so that the global length scale is L = 1; see Methods Section 4.5.2), dense regions of high-contrast image patches have the same homology as a 2-dimensional manifold called a Klein bottle [21]. A Klein bottle, K2, is a canonical manifold typically introduced in the context of orientability, where it is often visualized in R3 (as shown in Figure 3A) to highlight that it is non-orientable. From a topological perspective, K2 is a manifold parameterized by θ,φ ∈ [0, 2π] as shown in Figure 3B in which vertical edges are defined to be θ = 0 and θ = π, and horizontal edges are defined to be φ = 0 and φ = 2π. To make a closed surface, the vertical (horizontal) edges are glued together according to the red (blue) arrows in Figure 3B. K2 is therefore 2π-periodic in φ, since a point corresponding to θ on the bottom horizontal edge (φ = 0) is the same as the point corresponding to θ on the top horizontal edge (φ = 2π). Similarly, a point corresponding to φ on the left vertical edge (θ = 0) is the same as the point corresponding to 2π −φ on the right vertical edges (θ = π). In short, points on K2 obey the similarity relation (θ,φ) ∼ (θ + π, 2π − φ). K2 captures the dominant features in high-contrast image patches because θ can be treated as a parameter controlling rotation and φ as a parameter controlling the relative contribution of linear vs. quadratic gradients (see Figure 3B). An embedding of K2 into R9 with an analytical form, k0, was proposed by Carlsson et al. in [21] to model image patches (see Equation 31 in Methods Section 4.5.3). This embedding takes points from (θ,φ) into 11 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ image patches in R9 as shown in Figure 3B. For example, θ = 0 (θ = π 2 ) corresponds to patches with vertical (horizontal) stripes and φ = π 2 , 3π 2 (φ = 0,π) corresponds to patches with linear (quadratic) gradients. As θ increases, stripes in the image patches are rotated clockwise. As φ increases, image patches oscillate between having quadratic and linear gradients. Importantly, the image patches constructed by this embedding obey the same similarity relation (θ,φ) ∼ (θ+π, 2π−φ) topologically required of a Klein bottle. Whereas Carlsson et al. studied the global topology of image patches using this embedding, here we study their local geometry instead. First, we analytically calculated the scalar curvature of k0 as a function of (θ,φ) as shown in Figure 3C (see Methods Section 4.1). Next, we used our algorithm to compute the scalar curvature on a data manifold of N ≈ 4.2×105 high-contrast 3x3-pixel image patches randomly sampled from the same van Hateren dataset used to propose k0 (see Methods Section 4.5.2). We picked σh so that the distribution of GOF p-values was flat, and fixed this value for all subsequent simulations (see Methods Section 4.5.7). To visualize the results, we associated each image patch to its closest point on k0 (see Methods Section 4.5.4), and plotted the scalar curvatures on the resulting (θ0,φ0) coordinates (see Figure 3D). Most image patches map to φ = π 2 , 3π 2 or θ = 0, π 2 because linear gradients (of any orientation) and quadratic gradients that are vertically or horizontally oriented are the dominant features in the data as previously reported [21, 40]. The scalar curvatures computed for the image patches did not match the analytical scalar curvature of k0 (cf. Figures 3C and 3D). To identify the cause of this discrepancy, we first validated our algorithm by computing scalar curvatures on the set of N ≈ 4.2 × 105 (θ0,φ0) points on k0 associated with the image patches (see Figure 3E); we found close agreement with the analytical calculation (75% of points reported a 95% CI containing the true scalar curvature). Next, observing that the neighborhood sizes used for computing the scalar curvature of image patches were larger than those used for computing the scalar curvature of the associated (θ0,φ0) points (cf. Figures S3A and S3B), we recomputed the scalar curvatures of these (θ0,φ0) points, but now with the same neighborhood sizes used for the image patches. The results agreed with the analytical calculation, but still did not match the scalar curvatures computed for the image patches (see Figure S3C). Having ruled out these two possibilities, we hypothesized that the discrepancy was caused by fluctuations in the positions of the image patches with respect to the (θ0,φ0) points on the k 0 manifold (real image patches are noisy and the Klein bottle embedding is only an idealization). We found that adding isotropic Gaussian noise of increasing magnitude in R9 to the set of (θ0,φ0) points on k0 indeed resulted in scalar curvatures that resemble the data (see Figure 3F; Methods Section 4.5.6). The best agreement between the scalar curvatures of the image patches and the noisy (θ0,φ0) points was achieved when the magnitude of noise was 12 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B C D E F G H I J Figure 3: Scalar curvature computed for image patches is consistent with that of a Klein bottle with added isotropic Gaussian noise. (A) The Klein bottle, K2, is a 2-dimensional manifold shown here in R3. (B) k0 is an analytical embedding given by Carlsson et al. in [21] relating parameter values θ,φ ∈ [0, 2π] to 3x3-pixel patches of greyscale images (see Equation 31 in Methods Section 4.5.3). θ controls the rotation of stripes in the image patches and φ determines the relative contribution of linear vs. quadratic gradients. Importantly, as shown in the figure, this embedding has boundary conditions consistent with the topology of a Klein bottle (depicted by the blue/red arrows). In particular, the embedding produces image patches that obey the similarity relation (θ,φ) ∼ (θ + π, 2π −φ). Adapted from Figure 6 of [21]. (C) The analytical scalar curvature of k0 (derived as described in Methods Section 4.1). (D) Scalar curvatures computed for N ≈ 4.2 × 105 high-contrast 3x3-pixel patches sampled from the greyscale images in the van Hateren dataset [41] are plotted here as a function of (θ0,φ0), the parameter values of the closest point on k 0 associated with each image patch (see Methods Section 4.5.4). (E) Scalar curvatures computed for the set of N ≈ 4.2 × 105 closest points on k0 associated with the image patches. Note the close correspondence with Figure 3C, indicating that our algorithm correctly recapitulates the analytical scalar curvature. (F) As in (E), but after adding isotropic Gaussian noise in R9 to the set of closest points on k0 (see Methods Section 4.5.6). Left to right corresponds to increasing levels of noise, σ = 0.007, 0.01, 0.03. (G) The distribution of Euclidean distances in R8 between each image patch and its closest point on k0 is shown in blue. The distribution of distances to k0 after adding Gaussian noise to these closest points on k0 is also shown. (H) k1 is the analytical embedding from θ,φ ∈ [0, 2π] to R9 that minimizes the sum of Euclidean distances from the image patches to the closest point on the embedding (see Methods Section 4.5.5). Each of the N ≈ 4.2 × 105 image patches was associated to its closest point on k1, given by parameter values (θ1,φ1) (see Methods Section 4.5.4). Scalar curvatures computed on this set of N ≈ 4.2 × 105 points on k1 are shown. (I) The same scalar curvatures computed for the image patches and visualized on (θ0,φ0) coordinates in (D), are shown here plotted on (θ1,φ1) coordinates. (J) Scalar curvatures computed for a densely sampled manifold comprised of the full set of N ≈ 1.3×108 high-contrast 3x3-pixel image patches in the van Hateren image dataset (see Methods Section 4.5.2), visualized on (θ1,φ1) coordinates. 13 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ σ = 0.03. Notably, in this case, the median Euclidean distance of the noisy (θ0,φ0) points to k 0 was 0.132, which is comparable to 0.148, the median Euclidean distance of the image patches to k0 (see Figure 3G). Furthermore, the neighborhood sizes chosen by our algorithm when σ = 0.03 (see Figure S3A) matched those chosen for the image patches (see Figure S3B). To find an embedding of the Klein bottle that might better explain the scalar curvature of the image patches without needing to add noise, we incorporated higher-order terms to k0 (see Methods Section 4.5.3). The coefficients for the higher-order terms were determined by fitting the data, resulting in a new embedding, which we refer to as k1 (see Methods Section 4.5.5). The median Euclidean distance of the image patches to k1 was 0.115 versus 0.148 to k0. As was done for k0, we associated each image patch to its closest point (θ1,φ1) on k 1, and used our algorithm to compute the scalar curvature of these (θ1,φ1) points (see Figure 3H). Despite the reduction in the median Euclidean distance of images patches to the embedding, the scalar curvature of k1 was even less similar to that of the image patches (visualized in Figure 3I on these new (θ1,φ1) coordinates for k 1) than was the scalar curvature of k0; the range of scalar curvature values for k1 was much larger than for either the image patches or k0, and the scalar curvature fluctuates on smaller length scales. Lastly, we reasoned that there might be fine-scale scalar curvature fluctuations in the image patches that are masked by the larger neighborhood sizes used to compute scalar curvature for the image patches (see Figure S3B) relative to k1 (see Figure S3D). To decrease the neighborhood sizes chosen by the algorithm for the same σh, we augmented the image patch dataset using the full set of N ≈ 1.3 × 108 datapoints from the van Hateren dataset (see Methods Section 4.5.2). This resulted in neighborhood sizes comparable to those determined for k1 (cf. Figures S3D and S3E), but failed to recapitulate the fine-scale scalar curvature fluctuations observed in k1 (see Figure 3J). As a sanity check, we confirmed that the scalar curvature of the augmented image patch dataset matched that of the original image patch dataset, when computed using the same neighborhood sizes as the latter (see Figure S3F). Therefore, including higher-order terms in the embedding does not yield scalar curvatures that better agree with the data. Taken together, our analysis of curvature suggests that the image patch dataset can be best modelled by adding noise to the simplest embedding, k0. Having applied our algorithm on real-world manifold-valued data that is well-modelled by an analyti- cal embedding, we next turned our attention to scRNAseq datasets, which are generally regarded as low- dimensional manifolds and have no known analytical form. 14 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2.4 scRNAseq Datasets have Non-Trivial Intrinsic Curvature In scRNAseq datasets, each datapoint corresponds to a cell, and each coordinate to the abundance of a different gene. Here we consider the data manifold after basic preprocessing and linear dimensionality reduction using PCA (see Methods Section 4.6.1). Since many common analyses in the field such as clustering, visualization, and inference of cell differentiation trajectories are performed in this reduced space, it is natural to compute curvature in this space as well. We set the ambient dimension, n, to be the number of PCs needed to explain 80% of the variance. The manifold dimension, d, for scRNAseq datasets is not well-defined and needs to be chosen heuristically. As a simple heuristic, we specified d as the number of PCs needed to explain 80% of the variance in the ambient space i.e. 64% of the original variance (we show later that our computations are relatively insensitive to the choice of d). We considered three datasets. The first consists of N ≈ 104 peripheral blood mononuclear cells (PBMCs) collected from a healthy human donor [42]. The second is a gastrulation dataset comprised of N ≈ 1.2×105 cells pooled from 9 embryonic mice sacked at 6-hour intervals from embryonic day 6.5 to 8.5 [43]. The final dataset is a benchmark in the field consisting of N ≈ 1.3 × 106 brain cells pooled from 2 embryonic mice sacked at embryonic day 18 [44]. Refer to Figures S4A, S5A and S6A for cell type annotations for the three datasets. The PBMC dataset is characteristic of the sample size of current scRNAseq data. The other two are larger than most scRNAseq datasets, and we included these to verify if geometric features seen in the first dataset can be reproduced for more densely sampled manifolds. For the PBMC, gastrulation and brain datasets, the ambient (manifold) dimensions were determined to be 8, 11 and 9 (3, 3 and 5) respectively, according to the aforementioned heuristic (see Methods Section 4.6.4). For all three datasets, the global length scale happened to be L ≈ 20 (see Methods Sections 4.3.5). As before, we picked σh for each dataset according to the distribution of GOF p-values (see Figures S4B, S5B and S6B; Methods Section 4.6.4). We visualized the computed scalar curvatures on standard plots employed in the field (UMAP and t- SNE; shown in Figure 4A,D,G) and observed non-trivial scalar curvature for all three datasets. We found statistically significant correlations between the scalar curvature reported by each point and its kNN for k ≤ 250 (ρPearson = 0.58, 0.18 and 0.38 for the PBMC, gastrulation and brain datasets respectively at k = 250, p < 10−6; see Figures S4C, S5C and S6C), indicating that our algorithm yields scalar curvatures that vary continuously over the data manifolds. By plotting scalar curvatures against their standard errors, σS, we verified that regions with non-zero scalar curvature are statistically significant (see Figure 4B,E,H). As a consistency check, we confirmed that the percentage of points with 95% CIs containing the scalar curvatures reported by their respective kNNs (i) decayed with increasing k for k ≤ 250, and (ii) was significantly larger 15 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ than expected by chance (67%, 72% and 61% for the PBMC, gastrulation and brain datasets respectively at k = 250, p < 0.001; see Figures S4D, S5D and S6D; Methods Section 4.6.3.1). To rule out the possibility that localization of non-zero scalar curvature in certain regions of the UMAP/t- SNE plots is an artifact caused by other features of the data that are also localized, we considered several factors. First, we plotted the GOF p-value at each point on UMAP/t-SNE coordinates and noted that poor GOFs were not localized on the data manifolds, let alone to regions of non-zero scalar curvature (see Figures S4B, S5B and S6B). Therefore, the computed scalar curvatures are not due to poor fits. Next, we plotted the neighborhood size, r(p), used for fitting and observed that in some regions, non-zero scalar curvatures seemed to correspond to small r (see Figures S4E, S5E and S6E). Since σh is fixed, these regions necessarily have a larger number of neighbors Np(r) and are hence more dense (see Figures S4F, S5F and S6F). To rule out the possibility that the non-zero scalar curvatures were an artifact of smaller neighborhood size, we recomputed the scalar curvature at three fixed neighborhood sizes (see Figure 4C,F,I), corresponding to the 25, 50, and 75%-ile values of r(p) which arose from setting σh (see Figures S4E, S5E and S6E). In general, the scalar curvatures decreased in magnitude when neighborhood sizes increased. However, regions which had statistically significant non-zero scalar curvatures (zero falls outside of the 95% CI) using variable neighborhood sizes also had non-zero scalar curvatures for all three fixed neighborhood sizes. Additionally, statistically significant non-zero scalar curvature also emerged on other parts of the manifolds when using small fixed neighborhood sizes. These regions are therefore curved at small length scales but do not have a sufficient density of points to resolve curvature to the desired uncertainty σh (see Method Section 4.3.5). This is analogous to the image patch dataset for which we could resolve scalar curvatures of larger magnitude at a smaller length scale when the dataset was augmented with enough points to attain smaller neighborhood sizes for a fixed σh. We also checked how computed scalar curvatures changed with density in a toy model with zero scalar curvature. Importantly, we did not observe the artifactual appearance of statistically significant non-zero scalar curvature, for either variable neighborhood sizes chosen by the algorithm to achieve σh, or for fixed neighborhood sizes (see Figure S2A; Methods Section 4.4.2.1). Taken together, although higher density allows us to resolve statistically significant non-zero scalar curvatures in scRNAseq data, these computed scalar curvatures are not an artifact of the smaller neighborhood sizes used in regions with higher density. To ensure that the computed scalar curvatures were not sensitively dependent on the heuristically chosen manifold dimension, d, we also recomputed scalar curvatures for d − 1 and d + 1 and observed similar qualitative results (see Figures S4G, S5G and S6G). Lastly, we verified that the computed scalar curvatures were not correlated with the number of transcripts in each cell (see Figures S4H, S5H and S6H). 16 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B C D E F H IG Figure 4: scRNAseq datasets have localized regions of non-zero scalar curvature. (A) Scalar curvatures were computed for a scRNAseq dataset with N ≈ 104 peripheral blood mononuclear cells (PBMCs) collected from a healthy human donor. The ambient (n) and manifold (d) dimensions were specified to be 8 and 3 respectively and variable neighborhood sizes were chosen by setting σh (see Methods Section 4.6.4). The scalar curvatures are shown here overlaid onto UMAP coordinates, after smoothing the values over k = 250 nearest neighbors in the ambient space. (B) Scatter plot of (unsmoothed) scalar curvatures, S, and associated standard errors, σS, for each datapoint in the PBMC dataset. Points enclosed by the red lines reported a 95% CI (S ± 2σS) including 0. (C) As in (A) but with scalar curvatures computed using a fixed neighborhood size, r, for all datapoints. The value of r was set to be the 25, 50, and 75-%ile values (left to right) of the neighborhood sizes used in (A) (see Figure S4E). Points for which a neighborhood of size r does not include enough neighbors for regression are not shown. (D-F) As in (A-C) for a mouse gastrulation dataset with N ≈ 1.2 × 105, d = 3 and n = 11. (G-I) As in (A-C) for a mouse brain dataset with N ≈ 1.3 × 106, d = 5 and n = 9, plotted on t-SNE coordinates. 17 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ To confirm the robustness of our results to sampling, we randomly discarded f% of points in the ambient space determined for each dataset, and recomputed scalar curvatures using the same values of n, d and r(p) used for the original dataset. We found that a statistically significant percentage of downsampled points (82% for the PBMC dataset with f = 75, 78% for the gastrulation dataset with f = 75, and 76% for the brain dataset with f = 50; p < 0.001) had a 95% CI containing the scalar curvature reported by the same point for the original dataset (see Figures S4I, S5I and S6I; Methods Section 4.6.3.2). This suggests that if the datasets were more highly sampled, and scalar curvatures were recomputed using the same neighborhood sizes, they would be reliably contained within the currently reported 95% CIs. Unlike the two other datasets, the brain dataset could not be downsampled to f = 75 while still having at least 75% of points report 95% CIs containing the originally reported scalar curvatures, despite having the most points. This might be because the brain dataset has a larger manifold dimension according to our heuristic and therefore requires a greater number of terms, hkij, to be estimated in the Second Fundamental Form. For the PBMC dataset, we additionally downsampled the single-cell count matrix by discarding f% of transcripts at random and preprocessing the same way. We recomputed scalar curvatures for this downsam- pled dataset with the same n, d and r(p) values used for the original dataset. Here too, we found that when f = 50 (f = 75), 70% (65%) of the downsampled points had a 95% CI containing the originally reported scalar curvature (p < 0.001, see Figure S4J; Methods Section 4.6.3.3). Therefore, the computed scalar cur- vature is robust to changes in capture efficiency and sequencing depth. Taken together, our computational analysis reveals non-trivial intrinsic geometry in scRNAseq data. 3 Discussion In this study, we explored two approaches to computing the curvature of data manifolds using tools from twin branches of differential geometry. Despite the prevalence of the Laplace-Beltrami operator in geometric data analysis [14, 26, 27, 28, 29], an intrinsic approach to computing scalar curvature relying on this operator’s eigenvalues was determined to be infeasible for sample sizes of N ≈ 104 typical of current scRNAseq datasets. Although methods such as MAGIC [45] and diffusion pseudotime [46] apply the Laplace-Beltrami operator to smooth scRNAseq data and infer cell differentiation trajectories respectively, using information intrinsic to the manifold, our results suggest that the embedding of the manifold in the ambient space provides valuable information necessary for estimating the intrinsic curvature. This observation is perhaps implicit in recent tools for estimating the Laplace-Beltrami operator, which first use moving local least-squares to approximate a surface, thereby incorporating information from the ambient space [29]. 18 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Certainly, we found that an extrinsic approach in which the embedding is retained, and curvature is determined by local quadratic fitting of datapoints in ambient coordinates, is feasible given the sample size and degree of noise in real-world datasets. To obtain the scalar curvature of data manifolds, our algorithm first computes the full Riemannian curvature tensor. For other applications, this tensor can be used to compute other geometric quantities, such as Ricci curvature, or may itself be of interest. More generally, we focused on intrinsic curvature because we were interested in geometric properties of the manifolds independent of their embeddings. However, the Second Fundamental Form used in our approach to compute the intrinsic curvature can be used to obtain all the information about the extrinsic curvature as well. Indeed, hkij(p) exactly quantifies the extent to which the manifold deviates in the kth normal direction from the ij-tangent plane at point p. A key limitation of our algorithm is that the manifold dimension must be specified by the user. We also assumed that the manifold dimension is the same at every point in a dataset. Extending the algorithm to determine the manifold dimension from the data itself, potentially in a position-dependent manner, may prove useful. In addition, there is no inherently correct length scale over which curvature should be computed for a data manifold. Our algorithm chooses a length scale that varies from one part of the data manifold to another according to the density of points, and is tuned to achieve a user-specified level of uncertainty in the computed curvature. For some applications, it might be more sensible to fix a desired length scale for computing the curvature. As a demonstration of our algorithm, we computed the scalar curvature of image patches, and found that it was consistent with that of a Klein bottle. This observation further validates the claim by Carlsson et al. who showed that image patches have the topology of a Klein bottle [21]. Unlike the Klein bottle parameterization of image patches however, no definitive analytical form has been established for scRNAseq datasets. Recent work has suggested the use of hyperbolic geometry to model branching cell differentiation trajectories [47] and specific manifolds have been proposed to model reaction networks [48], which may be applicable to scRNAseq data. These proposed manifolds can be validated or improved using knowledge of the intrinsic geometry of scRNAseq datasets. Finally, incorporating information about curvature may provide a more principled approach for developing dimensionality reduction and visualization tools. 19 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 Methods 4.1 Differential Geometry of Theoretical Manifolds Here we briefly discuss how to compute the scalar curvature of, and sample from, theoretical manifolds given a parameterization. For a d-dimensional manifold, M, with intrinsic coordinates {x1, ...,xd} and embedding in Rn given by f(x1, ...,xd), the metric is: gij = ∂fT ∂xi ∂f ∂xj (8) The scalar curvature of M can then be derived analytically in intrinsic coordinates in terms of the metric as S = gij ( Γkij,k − Γ k ik,j + Γ l ijΓ k kl − Γ l ikΓ k jl ) (9) where the Γijks are Christoffel symbols given by Γijk = gil 2 ( ∂glj ∂xk + ∂glk ∂xj − ∂gjk ∂xl ) (10) and Γijk,l= ∂Γijk ∂xl . To draw points from M with ai ≤ xi ≤ bi so that the embedded manifold is uniformly sampled in Rn, we use rejection sampling. For paired random variables x ∼ Uniform(a,b) and y ∼ Uniform(0, max √ det g), we retain x as a sample point if √ det g ∣∣ x ≤ y. 4.2 Details of Intrinsic Approach to Curvature Estimation Here we explain how we used Equations 2-4 on the simplest of toy manifolds, the noise-free 2-dimensional hollow unit sphere, S2, to obtain an estimate of the average scalar curvature. The true scalar curvature is S(p) = 2 ∀p ∈ M. For the remainder of this section, we adopt the convention that symbols with overbars are estimates of the corresponding unaccented quantities. 4.2.1 Approach for S2 Our approach mirrors the treatment in [27], in which heat-traces are fit over various intervals [x1,x2] with x1 ≥ 0, to quadratic polynomials p2(x) = c0 + c1x + c2x2 to estimate the geometric quantities in Equation 2. Here, we constrained the form of p2(x) for fitting by assuming that (i) the manifold is boundary-less (so that c1 = c1 = 0 and the second boundary term for c2 vanishes), (ii) the volume is known (so that 20 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ c0 = c0 = 4π), and (iii) the scalar curvature is constant (so that c2 = 2π 3 S), yielding p2(x) = 4π + 2π 3 Sx2. These are strong assumptions that will not hold for an arbitrary manifold, which already precludes this as a generic procedure. Nonetheless, we proceeded for S2 to see if even with this privileged information, the scalar curvature could be estimated accurately. We declared an estimate to be accurate on the interval [x1,x2] if S has error within ±0.5 i.e. S ∈ [1.5, 2.5]. All quadratic fits were performed in MATLAB using the lsqnonlin function (‘StepTolerance’=1e-3, ‘FunctionTolerance’=1e-6). First, we evaluated zm(x) using analytical eigenvalues for S2 given by λ(`−1)2+1, ...,λ`2 = `(`− 1),` > 0, and let Dm be the collection of all intervals for which fits to p2(x) yielded accurate S. Dm corresponds to intervals where Equation 4 is accurate to our desired tolerance when the eigenvalues are known exactly. Next, we uniformly sampled N = 104 points from S2 (see Figure 1A; Methods Section 4.4.1.1), estimated ∆M using the random walk Graph Laplacian with Gaussian kernel (see Equation 15 in Methods Section 4.2.5), and computed empirical eigenvalues, λk, from ∆M . We selected N = 10 4 as it is the same order of magnitude as the sample size of current scRNAseq experiments, and is sufficient to identify M as S2 by eye (see Figure 1A). We verified if estimates zm(x), obtained by evaluating Equation 3 using λk, when fit as described above to p2(x) over intervals in Dm, recapitulated the accurate S obtained using zm(x). We restricted our attention to Dm for calculations using empirical eigenvalues, since it is only over intervals in Dm that it is even theoretically possible to compute scalar curvature to the desired accuracy. Below, we report our findings for different m. 4.2.2 Infinite Series We first applied this approach to the ideal case in Equation 3, where infinite analytical eigenvalues are available. We computed z∞(x) (shown as a black line in Figure S1A) and obtained S by fitting p2(x) over various intervals as described above. Figure S1B shows that D∞ is comprised of intervals with 0 ≤ x1 < x2 . 1.15. For x2 & 1.15, errors from neglecting higher-order terms o(x 3) in Equation 4 dominate. Since zm(x) converges from ∞, x2 . 1.15 necessarily holds for any interval in Dm∀m. 4.2.3 Truncated Series We next considered zm(x) for m < N, since in practice, we will only have access to as many eigenvalues as datapoints (N). We computed z1000(x) using Equation 3 (shown as a solid blue line in Figure S1A), and obtained S by fitting p2(x) (see Figure S1C). Intervals in D1000 roughly satisfy 0.25 . x1 < x2 . 1.15. However, we found that z1000(x) (shown as a dashed blue line in Figure S1A) deviated markedly from z1000(x) in the rough interval [0.1, 0.75], which has significant overlap with D1000. Consequently, when we 21 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ fit p2(x) to z1000(x) on D1000, the resulting S was not accurate for any interval in D1000 (see Figure S1D). Note that this inaccuracy was not a consequence of not using all N available eigenvalues. While picking m = N would reduce the lower bound on valid intervals in Dm (since zm(x) converges from ∞), it is exactly for small x1 that S obtained from z1000(x) is already over-estimated as shown in Figure S1D. Since zm2 (x) > zm1 (x) ∀x,m2 > m1, using a truncated series with a larger m would simply exaggerate the difference between zm(x) and zm(x) for small x and cause scalar curvatures estimated using the latter to be further over-estimated. Following this line of thought, we reasoned that picking a fewer number of eigenvalues may ameliorate the issue. We selected m = 49 (instead of a round number like m = 50 so that all eigenvalues of a given multiplicity are included) and repeated this analysis for the same set of N = 104 points. z49(x) is shown as a solid red line in Figure S1A and the intervals over which fits to p2(x) yield accurate S, D49, are shown in Figure S1E. While z49(x) (shown as a dashed red line in Figure S1A) has a much smaller deviation from z49(x) than z1000(x) did from z1000(x), no estimate of S obtained from fits of z49(x) to p2(x) on D49 were sufficiently accurate once again (see Figure S1F). 4.2.4 Eigenvalue Convergence We refrained from reducing m further to improve agreement between zm(x) and zm(x) after noting that the size of the intervals in Dm shrink with m. Though we may have a better chance of computing accurate S with zm(x) on Dm for smaller m, recall that in practice we will not have Dm available to us since the analytical eigenvalues will be unknown. Therefore, we simply shift the problem to one of choosing an interval that will yield an accurate S, from a shrinking pool of intervals that could even theoretically yield an accurate estimate. Instead, we compared the estimated λks with their true values, λk, and observed that the former con- sistently under-estimate the latter (see Figure S1H). Furthermore, we found that the fractional error grows with k, exceeding 60% for k = 37, ..., 49. Therefore, z49(x) will only be accurate if N is large enough to limit the fractional error. To determine the required tolerance on the fractional error, we constructed a truncated series analo- gous to Equation 3, but with eigenvalues interpolated between the analytical eigenvalues and the empirical 22 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ eigenvalues determined for N = 10000, according to a parameter f: z̃m(x; f) = (4π) d/2xd m∑ k=1 e−λ̃k(f)x 2 λ̃k(f) = λk + f(λk −λk) (11) f signifies that the fractional error of the interpolated eigenvalues is reduced by 1−f relative to the empirical eigenvalues determined for N = 10000. We found that f ≤ 0.23 is needed so that z̃49(x; f) (shown as a green line in Figure S1A) fit to p2(x) yields accurate S on half the intervals in D49 (see Figure S1G). Given that the fractional error in estimating λ37, ...,λ49 by λ37, ...,λ49 is 60% when N = 10000, how large does N have to be to reduce this fractional error to 60% × 0.23 ≈ 14%? A convergence rate for the fractional error is given in Theorem 1 of [30]. For 2-dimensional manifolds: ∣∣λk −λk∣∣ λk = O ( (log N) 3 8 N 1 4 ) (12) Assuming that the big-O bound is sharp at N = 104 for k = 37, ..., 49 (i.e. the prefactor is given by 0.6 10000 1 4 log(10000) 3 8 ≈ 2.61), we extrapolated that at least N = 107 datapoints are needed to reduce the fractional error to 14% (see Figure S1H). Equation 12 also applies to empirical eigenvalues of ∆M constructed from weighted kNN and r-neighborhood kernels instead of Gaussian kernels (see Methods Section 4.2.5). However, the prefactor in Equation 12 is actually worse for these estimators since their empirical eigenvalues have larger fractional errors at N = 10000 (see Figure S1H), so that even larger N would be required to attain the desired fractional error. Lastly, note that while we had analytical eigenvalues available with which to ascertain m = 49 as suitable, the naive approach of simply using all eigenvalues available (m = N), would require sample sizes that are even larger by several more orders of magnitude. 4.2.5 Estimating the Laplace-Beltrami Operator from Data For N points, {Xi} ∈ Rn, sampled from M, we estimated ∆M by normalizing the weight matrix W (see below) using the random walk normalization [30, 49]. ∆M constructed using this normalization converges to ∆M when samples are drawn uniformly from the embedding of M in Rn, as was done in our analysis. ∆M = 4 � (IN −D−1W) D = diag{W1} (13) 23 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ IN is the N ×N identity matrix, 1 ∈ RN is a vector of ones and the kernel width, �, is set to match that used in Theorem 1 of [30]: � = (log N) 3 8 N 1 4 (14) Throughout our analysis, we used W = Wg, the weight matrix with entries given by a Gaussian kernel: [Wg]i,j = exp(−‖Xi −Xj‖22/�) − δi,j (15) To check whether other estimators had more benign prefactors for eigenvalue convergence (see Figure S1H), we also considered the weighted kNN kernel, WkNN , and the r-neighborhood kernel, Wr, with r = � [50]: [WkNN ]i,j = [WG]i,j [ 1kNN(j)(i) OR 1kNN(i)(j) ] [Wr]i,j = 1BXi(r)(Xj) − δi,j (16) kNN(i) is the set of indices of the k-nearest neighbors of point i in Rn, BXi(r) is the n-dimensional ball of radius r centred at Xi, and 1A(x) is the indicator function for x ∈ A. 4.3 Details of Extrinsic Approach to Curvature Estimation 4.3.1 Quadratic Regression on Local Neighborhoods of Data Here we describe the regression model for computing the coefficients of the Second Fundamental Form, hkij, at a particular point p. As described in the main text, after performing PCA on a neighborhood of Np points around p in Rn, each point in the neighborhood can be described in terms of d tangent coordinates, ti, and n−d normal coordinates, nk. We defer discussion of how the neighborhood is selected to Methods Section 4.3.2. The nks are treated as dependent variables that can be modelled as quadratic functions of the tis, which are taken to be independent variables. See Equation 17 below. Linear terms are excluded since they ought to have zero coefficients in the tangent basis. Constant terms, Ck, are included to account for affine shifts. Since hkij = h k ji according to Equation 5, in practice we only consider titj and h k ij for j ≥ i so that t and h in Equation 17 have linearly independent columns, though we write the full form here for simplicity. 24 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ n = th + E n =   n (1) 1 . . . n (1) n−d ... . . . ... n (Np) 1 . . . n (Np) n−d   t =   1 t (1) 1 t (1) 1 . . . t (1) 1 t (1) d t (1) 2 t (1) 1 . . . t (1) d t (1) d ... ... . . . ... ... . . . ... 1 t (Np) 1 t (Np) 1 . . . t (Np) 1 t (Np) d t (Np) 2 t (Np) 1 . . . t (Np) d t (Np) d   h =   C1 h 1 1,1 . . . h 1 1,d h 1 2,1 . . . h 1 d,d ... ... . . . ... ... . . . ... Cn−d h n−d 1,1 . . . h n−d 1,d h n−d 2,1 . . . h n−d d,d   T E =   ε (1) 1 . . . ε (1) n−d ... . . . ... ε (Np) 1 . . . ε (Np) n−d   =   ε(1) T ... ε(Np) T   (17) Regression yields the following least-squares solution: ĥ = (tT t)−1tT n Σε = (n − tĥ)T (n − tĥ) Np Σh = Σε ⊗ (tT t)−1 (18) where ĥ is the matrix of estimates of the Second Fundamental Form, Σε is the estimated covariance structure of the residuals so that ε(i) ∼ N(0, Σε), and Σh is the covariance matrix for ĥ. ⊗ denotes the Kronecker product. We used the mvregress function in MATLAB to perform this regression in our code. When datapoints are sampled exactly from an analytical manifold, Σε measures the contribution of higher-order terms. In the limit of infinite sampling and infinitesimally small neighborhoods, Σε → 0. When observational noise is present (discussed in Methods Section 4.4.2.2), Σε also depends on the magnitude of the noise (σ in Equation 28). 25 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4.3.2 Selecting Local Neighborhoods for Regression Here we describe the procedure for selecting a neighborhood around each point p for computing the Second Fundamental Form. We adopt the simplest approach of selecting the neighborhood to be a ball of radius r centred at p, Bp(r). If r(p) is not specified, we set it according to statistical rather than geometric principles, since the geom- etry of the manifold may be non-trivial and unknown a priori. Specifically, we set r(p) so that the elements in the covariance matrix, Σh, are upper-bounded by σ 2 h, the square of the specified target uncertainty. The largest elements in Σh are the variance terms on the main diagonal, corresponding to the squares of the standard errors, σhk ij , for the coefficients hkij. By inspection of Equation 18: σ2 hk ij = [diag Σε]k [ diag (t′t)−1 ] (ij) (19) where [diag Σε]k is the diagonal entry of Σε corresponding to the k th normal direction and [ diag (t′t)−1 ] (ij) is the diagonal entry in (t′t)−1 for which the corresponding entry in t′t is ∼ ∑ l(t (l) i t (l) j ) 2. Increasing r(p) monotonically increases both Np(r), the number of points in Bp(r), and the average magnitude of elements in t, both of which reduce σhk ij . To avoid sweeping r(p) to find the minimum value such that max σhk ij < σh, which is computationally expensive, for each point we instead model the dependence of Np(r) on r as Np(r) ∼ rd ′ (20) so that σ2 hk ij ∼ 1 rd ′+4 (21) To determine d′, Np(r) is counted at 10 log-spaced distances, ri, and a line is fit to the (log ri, log Np(ri)) pairs for i ∈{2, ..., 8}. r1 is set to be the distance from p to the ( d(d+1) 2 + 1 ) -closest point to p (the minimum number of points needed for regression). r10 is set to be the distance from p to the furthest point from p. To solve for r, we first guess rg = r1, perform regression on the set of points in Bp(rg) and assign σ 2 g to be the largest diagonal entry in Σh. If ∣∣∣σgσh − 1∣∣∣ is within a desired tolerance, we set r = rg, or else we update rg as shown and iterate to convergence. rg ← rg ( σg σh ) 2 d′+4 (22) For large datasets, we speed up computation by only selecting r in this manner for a subset of Ncalib ≤ N 26 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ randomly selected calibration points. All datapoints in the Voronoi cell of each calibration point are then assigned the same r as the calibration point. Unless otherwise specified Ncalib = N. 4.3.3 Goodness-of-Fit Test for Quadratic Regression For a fixed density of points, there is a fundamental trade-off between reducing uncertainty in the hkijs and the validity of approximating local neighborhoods with quadratic fits. To reduce σh, more points must be included in the fit, but a larger neighborhood may not be well-modelled by only quadratic terms. Conversely, d(d+1) 2 + 1 points are sufficient to perform the regression, but there is then large uncertainty in the estimate of hkij. Since our approach is to choose a neighborhood size to achieve a target σh, we include a companion goodness-of-fit (GOF) statistic measuring how well the neighborhood is fit by a quadratic. Namely, we use Mardia’s test on the residuals from regression (ε(i) in Equation 17), which yields a p-value for the null hypothesis that the residuals are normally distributed [51]. When the p-values are small, the quadratic regression model is unlikely to be valid. In this case, curvatures computed using the resulting hkij may be suspect regardless of the tightness of the errorbars, and the user may want to consider increasing σh to reduce the neighborhood size. However, the poor GOF may not be of concern if the length scale of interest is larger than the fluctuations in the manifold which give rise to the non-Gaussian residuals (see Methods Section 4.3.5). Note that Mardia’s test is relatively weak since it may yield false negatives for heteroskedastic residuals. This GOF measure is therefore only provided as a computationally cheap consistency check. Ideally, the density of sampled points is sufficiently high to (i) permit small σh and (ii) produce GOF p-values that are uniformly distributed (consistent with the null model) and spatially uncorrelated. 4.3.4 Standard Error and Bias of Scalar Curvature Estimate Here we discuss how we compute the standard error, σS, of the estimate for S and note sources of estimator bias. Since the Riemannian curvature tensor in Equation 6 is a bilinear form and the tensor contraction in Equation 7 is a straightforward sum, σS can be computed using simple error propagation formulas in terms of the uncertainties from regression. Specifically, the standard error we report is the first-order approximation to the second moment of a function of random variables: σS = √ JT ΣhJ (23) where J = ∂S ∂hk ij ∣∣∣ ĥ . 27 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ It is important to note that our estimate for S is biased and not normally distributed. First, the hkijs are only normally distributed when the residuals (ε(i) in Equation 17) themselves are normally distributed. Second, even when the hkijs are normally distributed, our estimate of S will not be due to its bilinear dependence on hkij. Lastly, estimates for S can be biased in a manifold-dependent and even position- dependent way. For instance, the analytical scalar curvature of S2 embedded in R3 is given by S = 8(h111h122− h112h 1 21), with h 1 11 = h 1 22 = 1/2 and h 1 12 = h 1 21 = 0. Numerically however, the symmetric off-diagonal terms will never be exactly 0 so S will be systematically under-estimated. This is apparent in the left tail of the blue histogram in Figure 2I. In our experience, adding isotropic noise of small magnitude tends to remove the skew, presumably because then the residuals more closely match the regression assumptions (see for example Figure S2B, where the left tail disappears for σ = 0.001). Furthermore, in our examples, we observed that computed scalar curvatures were less biased when the ambient and/or manifold dimensions were large. We speculate that this is because the increased number of terms (with alternating signs) in Equations 6 and 7 leads to cancellation of errors, which is likely why the accuracy of computed scalar curvatures was higher for S3, S5 and S7 than S2, and the distribution of scalar curvatures less skewed (see Figure 2I). 4.3.5 Note on Length Scales Here we make three remarks regarding length scales relevant both for considering curvature theoretically and for applying our algorithm. First, note that scalar curvature has units of inverse length squared. Therefore, scaling all the coordinates of the points on a manifold by a factor L, changes the scalar curvature at all points by L−2. Thus, it is always important to contextualize the scalar curvature in terms of the global length scale associated with the manifold. For example, the scalar curvature of Sd with radius R is Sd(p) = d(d−1) R2 ∀p ∈ M (here L = R). In the case of the toy models shown in Figure 2, the global length scale is L ≈ 1 (see Methods Section 4.4.1). For the image patch dataset, a normalization is applied which places all patches on S7 (see Methods Section 4.5.2), so that the global scale is again L = 1. For scRNAseq data, we computed scalar curvature on the datapoints after preprocessing (see Methods Section 4.6.1), without imposing any additional scaling correction to achieve a standardized global length scale. Since other custom analyses also use these same boilerplate preprocessing steps, computing scalar curvatures in the context of the global length scale of the preprocessed data is sensible. For all three scRNAseq datasets, the global length scale happened to be L ≈ 20 (see Methods Section 4.6.4). Second, since hkij is a dimension-ful quantity (which scales as L −1), to keep the ratio of σS to S fixed when all coordinates are scaled by L, σh needs to be scaled by L −1. 28 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Lastly, we note that our choice of σh sets local length scales that are statistically rather than geometrically informed: neighborhoods are chosen to upper bound the uncertainty in estimates obtained from regression. This length scale can also be understood in terms of a bias-variance trade-off. Large length scales reduce variance but may introduce a bias if the resulting neighborhoods are larger than features on the manifold. This manifests as poor GOFs and can be corrected by finer sampling. However, for manifolds with features at different length scales (such as a golf ball, which can be treated as dimples superimposed on S2), neigh- borhoods chosen by this heuristic can also be much smaller than the feature of interest, so that fine-scale curvature fluctuations are detected (dimples) while coarser features are neglected (S2). Regardless, we de- fault to this statistical approach because in general, the length scale of relevant features on a data manifold will not be uniform across the manifold or known a priori. However, we also provide the ability to manually set position-dependent r(p) in the software to facilitate ad hoc computation of curvatures at any length scale of interest. 4.4 Details of Toy Manifold Curvature Computations 4.4.1 Analytical Forms Here we provide analytical forms for the toy manifolds shown in Figures 2 and S2. 4.4.1.1 Hypersphere The d-dimensional unit hypersphere, Sd, has intrinsic coordinates θ1 ∈ [0, 2π], θ2, ...,θd ∈ [ −π 2 , π 2 ] and ambient coordinates in Rd+1 given by: xi =   ∏d j=1 cos θj, i = 1 sin θi−1 ∏d j=i cos θj, 1 < i ≤ d + 1 (24) Using the relations in Methods Sections 4.1, the scalar curvature is given by Sd(p) = d(d − 1) ∀p ∈ M. To draw uniform samples from Sd, instead of applying rejection sampling on these intrinsic coordinates as described in Methods Section 4.1, it is more straightforward to let xi ∼N(0, 1) and scale the resulting vector (x1, ...,xd+1) to have unit norm. 29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4.4.1.2 One-Sheet Hyperboloid The one-sheet hyperboloid, H22 , has intrinsic coordinates θ ∈ [0, 2π], u ∈ R and ambient coordinates in R3 given by: x = a cos θ √ u2 + 1 y = b sin θ √ u2 + 1 z = cu (25) For Figure 2E,F, we used a = b = 2 and c = 1. Using the relations in Methods Sections 4.1, the scalar curvature is given by S(z) = − 2 (5z2+1)2 . To avoid edge effects in the z-direction, we constrained u ∈ [−2, 2], and sampled points as described in Methods Section 4.1 until a subset of N = 104 had u ∈ [−1, 1]. Scalar curvature was computed and visualized for these N = 104 points. 4.4.1.3 Ring Torus The 2-dimensional ring torus, T 2, has intrinsic coordinates θ,φ ∈ [0, 2π] and ambi- ent coordinates in R3 given by: x = (R + r cos θ) cos φ y = (R + r cos θ) sin φ z = r sin θ (26) For Figure 2G,H, we used R = 2.5 and r = 0.5. Using the relations in Methods Sections 4.1, the scalar curvature is given by S(θ) = 8 cos(θ) 5+cos(θ) . 4.4.1.4 Hypercube The m-dimensional cube of side length r, Dmr , has intrinsic coordinates z1, ...,zm ∈ [−r/2,r/2], and ambient coordinates in Rn for n ≥ m given by: xi =   zi, 1 ≤ i ≤ m 0, m < i ≤ n (27) Using the relations in Methods Sections 4.1, the scalar curvature is given by S(p) = 0 ∀p ∈ M. 4.4.2 Practical Issues for Curvature Estimation on Real-World Datasets For real-world data, small sample size is only one of the potential confounders for accurately estimating curvature. Here, we report how our algorithm fares when four other real-world confounders are applied to toy manifolds: non-uniform sampling, observational noise, large ambient dimension n, and uncertainty in 30 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ the manifold dimension d. 4.4.2.1 Non-Uniform Sampling We expect our approach to handle non-uniform sampling of the man- ifold gracefully: smaller (larger) neighborhoods will be used on densely (sparsely) sampled portions of the manifold to encapsulate the number of points needed to achieve σh. To computationally verify the robustness of our tool to non-uniform sampling, we constructed a toy model to roughly match the (n, d, L) parameters for the scRNAseq datasets explored in the paper, for which non-zero scalar curvatures seemed to appear at smaller length scales/higher densities. Specifically, we wanted to verify that non-zero scalar curvatures do not appear artifactually at specific length scales due to sharp changes in the local density of points sampled from a flat manifold. To this end, we formed a dataset with a sparse periphery and dense core by uniformly sampling N1 = 10 4 points from D310 to establish a background density equal to 10 points per unit volume, and N2 = 10 3 points from D31 to create a core density roughly equal to 103 points per unit volume (see Methods Section 4.4.1.4). We embedded these points in R11 by adding isotropic Gaussian noise with σ = 0.01 to the eight normal directions, for all datapoints. We computed scalar curvature on this dataset for a fixed σh (see Methods Section 4.4.3) and found no significant deviation from the true value of zero in either the sparse or dense regions (see Figure S2A). We next computed scalar curvatures at three fixed length scales corresponding to the 5, 50, and 95%-ile r values obtained using the specified σh (r = 0.54, 0.90 and 1.22 respectively) and again saw no deviation from zero scalar curvature for points in either the sparse or dense region (see Figure S2A). We repeated this analysis for N2 = 10 4 and again saw no deviation from zero scalar curvature, regardless of whether variable neighborhood sizes or fixed length scales (r = 0.37, 0.63 and 1.42 corresponding to the same percentiles) were used (see Figure S2A). 4.4.2.2 Observational Noise Every ambient coordinate can be considered a measured observable with its own observational noise. Assuming each observable is distorted by independent, isotropic Gaussian noise with variance σ2 (sometimes referred to as convolutional noise [37]), datapoints X ∈ Rn sampled from an embedded manifold M are modelled by: X = x + N(0,σI), x ∈ M (28) To study the sensitivity of our algorithm to noise, we uniformly sampled N = 104 datapoints from S2 ∈ R3, added convolutional noise with σ ranging over several orders of magnitude, and estimated scalar curvatures using a fixed σh (see Methods Section 4.4.3). For small σ, the distribution of scalar curvatures was centred 31 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ on the true value of 2, but once σ became large (≈ 10% of S2’s radius), the estimated scalar curvatures approached 0 (see Figure S2B). Noise in the regression context does not change the expectation value of any estimated parameter. The apparent flattening that is observed therefore indicates that X (obtained from convoluting M), has a geometry that is not trivially related to M. Certainly for σ ≈ 1, X does not even preserve the topology of M as S2. From a practical perspective, it suffices to say that small convolutional noise can be handled by simple quadratic regression, while large convolutional noise obfuscates the original manifold. These observations are consistent with literature defining a manifold’s reach [34, 35], a noise scale beyond which noisy samples cannot be uniquely associated to a point on the noise-free manifold. When σ exceeds the manifold’s reach, the relationship between the empirical density of sampled points and the original manifold is non-trivial even for a relatively forgiving model of manifold-orthogonal noise. The ridge manifold [36, 37] of an empirical density has also been defined as an alternative to the unwieldy task of deconvoluting noisy samples to recover a noise-free manifold. This definition avoids the notion of a noise-free manifold altogether and instead defines manifolds as ridges, contours along which the empirical density of points is maximized. 4.4.2.3 Large Ambient Dimension A high-dimensional dataset may have an ambient space comprised of tens of thousands of observables, i.e. n is very large. Meanwhile, the underlying manifold dimension, d, may be small. Since convolutional noise occurs in n dimensions, will a low-dimensional manifold still be discernable? To explore this, we uniformly sampled N = 104 datapoints from S2 ∈ R3, embedded these points in Rn for a range of n up to 100, and added convolutional noise of magnitude σ = 0.01, 0.03, and 0.05 in the n-dimensional ambient space. We computed curvatures for all combinations of n and σ using a fixed σh (see Methods Section 4.4.3). As n or σ increased, the algorithmically chosen neighborhood sizes, r(p), expanded to include enough datapoints to maintain the desired σh. The distribution of estimated scalar curvatures (shown in Figure S2C) is centred on the true value of 2 for n < 80 and σ ≤ 0.05. However, we observed that r was far less sensitive to changes in n than changes in σ. For example, exploding n from 3 to 100 at σ = 0.01 and tripling σ from 0.01 to 0.03 at n = 3 required a comparable increase in r (see Figure S2C). Therefore, consistent with the results of Methods Section 4.4.2.2, as long as the noise scale σ is small, a large ambient dimension n is not a confounder. Practically however, to shorten computational overhead and avoid the large-n-and-σ case, it is still helpful to reduce the ambient dimension by projecting datapoints to an affine subspace containing the manifold (e.g. by PCA). Such a transformation does not change the intrinsic curvature. 32 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4.4.2.4 Choice of Manifold Dimension The last practical consideration is accurate selection of the manifold dimension, d, which we have so far assumed to be known. There is no consensus on the definition of d for a dataset, so various disciplines have devised different heuristics to determine d in a data-driven fashion [52]. From the regression perspective, any d > 0 corresponds to a well-defined regression problem. The choice of d merely determines how local coordinates are partitioned into independent (tangent) and dependent (normal) variables. However, in our algorithm we noticed that some choices of d result in exces- sively large r(p) for a fixed σh. We explored this further using two toy manifolds and discovered a signature indicating that the specified manifold dimension may be incorrect. The manifolds considered were S3 ⊂ R5 convoluted by isotropic Gaussian noise with σ = 0.01 and S2 ×S2 ⊂ R6, for which d∗, the true manifold dimension, is d∗ = 3 and d∗ = 4 respectively. We uniformly sampled N = 104 points from each manifold and estimated scalar curvatures by holding σh fixed for different d (see Methods Section 4.4.3). For both manifolds, the average neighborhood size, r, was much larger for d > d∗ and d < d∗, than for d = d∗ (see Figure S2D). In the case of S3, for d < d∗, the average neighborhood size was even larger than the global length scale, L, of the manifold. Since neighborhood sizes are chosen to achieve a target σh, manually decreasing r(p) is counter-productive and simply increases the uncertainty from regression above σh. The large neighborhood sizes that emerged for both d > d∗ and d < d∗ can be understood in terms of the mis-assignment of normal vectors to the tangent space, or vice versa. According to Equation 19, σhk ij increases with large variation in the normal direction ([diag Σε]k), or with small variation in the tangent direction ( [ diag (t′t)−1 ] (ij) ). When we choose d > d∗, we mis-attribute a normal direction with small variation [diag Σε]k as an independent variable, whereas variation along the true tangent space is � [diag Σε]k. r must therefore be increased to compensate for the lack of variation along this direction mis-classified as tangent. When d < d∗, we have spuriously assigned a tangent direction with large variation to be a normal direction. Since this spurious normal coordinate cannot be well-approximated as a function of tangent coordinates from which it is linearly independent, the perceived noise scale ([diag Σε]k) is exaggerated so that a larger neighborhood is needed to attain σh. This suggests a crude, operational definition of what constitutes an incorrect choice of d. When σhk ij is large relative to the uncertainty in other coefficients, there is either too little variation along the ith and jth tangent directions, or too much variation along the kth normal direction. In the former case, the ith or jth tangent direction might be more appropriately classified as a normal direction (d is too large and should be decreased), while in the latter case, the kth normal direction might be more appropriately classified as a tangent direction (d is too small and should be increased). When this criterion is applied point-wise, 33 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ there may be a different acceptable choice of d for different parts of the manifold. When this criterion is generalized over the entire manifold, a σh yielding a flat distribution of GOF p-values when the manifold dimension is specified to be d will also yield a flat distribution for d + 1 but not necessarily for d − 1: if residuals in n−d dimensions are well-modelled by a multivariate Gaussian, so too will residuals in n−d−1 dimensions, but not necessarily residuals in n − d + 1 dimensions (see Figure S2D). Our observations are consistent with manifolds in literature with multiple possible manifold dimensions (like the helix manifold in [36]), and which could generally arise from non-isotropic noise or non-uniform sampling. 4.4.3 Parameters for Curvature Estimation For each manifold in Figure 2, we chose σh so that the fraction of points with GOF p-value ≤ α = 0.05 most closely matched the null model of normally distributed residuals consistent with neighborhood sizes well-approximated by quadratic regression (see Section 4.3.3). σh = (0.017, 0.020, 0.028, 0.055, 0.022, 0.030) for (S2,S3,S5,S7,H22,T 2) resulted in (7.4, 3.5, 1.4, 2.8, 7.4, 4.0)% of points having GOF p-values ≤ α = 0.05. Theoretically, max |hkij| = (0.5, 2, 2.5) for (S d,H22,T 2) so our choices for σh result in small fractional errors in all cases. For Figure S2A, we set σh = (0.02, 0.01) for N2 = (10 3, 104) respectively which resulted in (1.6, 2.5)% of points having GOF p-values ≤ α = 0.05. For all other panels in Figure S2, where we were interested in ascertaining the sensitivity to different confounders, instead of minimizing uncertainty per se, we used a fixed value of σh = 0.05. This choice resulted in neighborhoods small enough to be well-approximated by quadratic regression, manifesting as a roughly uniform distribution of GOF p-values in all cases. 4.5 Details of Image Patch Dataset and Klein Bottle Manifolds 4.5.1 Notation and Preliminaries First we introduce some notation needed to describe the image patch dataset. We refer readers to [21, 40] for a more detailed exposition. Let P be the space of all bivariate polynomials p : R × R → R with p ∈ P, h : P → R9 the vectorization operator given by h(p) = [p(−1, 1),p(−1, 0),p(−1,−1),p(0, 1),p(0, 0),p(0,−1), p(1, 1),p(1, 0),p(1,−1)]T , u : Rm →Sm−1 the normalization operator given by u(v) = v‖v‖2 , and c : R 9 → R8 the projection operator given by c(y) = ΛATy, where A = [e1 . . . e8], Λ = diag{ 1‖e1‖22 , ..., 1 ‖e8‖22 }, and {ei} 34 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ are vectorized basis vectors for the 2-dimensional discrete cosine transform (DCT) applied to 3x3 patches: e1 = [1, 0,−1, 1, 0,−1, 1, 0,−1]T/ √ 6 e2 = [1, 1, 1, 0, 0, 0,−1,−1,−1]T/ √ 6 e3 = [1,−2, 1, 1,−2, 1, 1,−2, 1]T/ √ 54 e4 = [1, 1, 1,−2,−2,−2, 1, 1, 1]T/ √ 54 e5 = [1, 0,−1, 0, 0, 0,−1, 0, 1]T/ √ 8 e6 = [1, 0,−1,−2, 0, 2, 1, 0,−1]T/ √ 48 e7 = [1,−2, 1, 0, 0, 0,−1, 2,−1]T/ √ 48 e8 = [1,−2, 1,−2, 4,−2, 1,−2, 1]T/ √ 216 (29) By inspection, e1 is the basis vector for patches with horizontal stripes and linear gradients, e2 for patches with vertical stripes and linear gradients, e3 for patches with horizontal stripes and quadratic gradients, e4 for patches with vertical stripes and quadratic gradients, and e5 for diagonally-oriented patches with quadratic gradients. All the patches produced by the embedding k0 in Equation 31 below and visualized in Figure 3B can be written as a linear combination of these 5 basis vectors. Next, note that the components in each ei sum to 0, so that the projection operator, c, additionally serves to remove the mean. Finally, observe that the vector norm formed under D = AΛ2AT (referred to hereafter as the D-norm following [40]) measures the contrast in a 3x3 patch since ‖v‖D = √ vTDv = 1 2 √∑ i ∑ j∼i (vi −vj)2 (30) where j ∼ i refers to all vertical and horizontal neighbors, j, of a pixel i in the preimage of v under h. The ei are normalized so that ‖ei‖D = 1. 4.5.2 Image Dataset We used the same van Hateren IML dataset [41] consisting of 4167 greyscale images of size 1532x1020 pixels studied by Carlsson et al. in [21] and followed the same preprocessing steps used there. In short, we applied a log1p transformation to all pixel values and randomly sampled 5 × 103 (possibly overlapping) 3x3 patches from each image. We indexed the pixels in each patch using standard Cartesian coordinates with the middle pixel as the origin, so that log-transformed pixel values are given by p(x,y),x ∈{−1, 0, 1},y ∈{−1, 0, 1}. We 35 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ then applied h to vectorize each patch p, and retained the high-contrast patches comprising the top quintile of D-norms for each image, resulting in N ≈ 4.2 × 106 datapoints. Next, we normalized these high-contrast vectorized patches using the composition u◦ c, resulting in a set of datapoints on S7 ⊂ R8. We determined the density of these datapoints in R8 using the kNN density estimator with k = 100, and retained the densest decile, which yielded N ≈ 4.2 × 105 datapoints. This dense subset of high-contrast normalized patches was found using topological data analysis in [21] to be a Klein bottle, K2 ⊂ S7, and is studied in Figures 3D,I and S3B. To generate the augmented image patch dataset used in Figures 3J and S3E,F, we first considered all N ≈ 1.3 × 109 vectorized high-contrast patches in the van Hateren IML dataset using the same procedure described above (each of the 4167 images yields 1530 × 1018 patches, of which the top 20% by D-norm are retained per image). These were normalized by u◦c as before to place them on S7 ⊂ R8. We again wanted to retain the densest decile of points, since only these have the topology of a Klein bottle. Mirroring the approach in [21] where the k used in the kNN estimator was scaled with sample size, k = 102 used for N ≈ 4.2 × 106 corresponds to k = 102 × 1.3×10 9 4.2×106 ≈ 3 × 10 4 for N ≈ 1.3 × 109. Computing k ≈ 3 × 104 neighbors for all N ≈ 1.3 × 109 points is prohibitive however. To determine a reasonable smaller value of k, we randomly selected 2 × 104 points from the set of N ≈ 1.3 × 109 on which to compare estimators and found that 90% of points in the densest decile as computed with k = 3 × 104 also appeared in the densest decile computed using k = 6 × 102. We therefore used the latter value for density estimation and retained the N ≈ 1.3 × 108 datapoints comprising the densest decile. 4.5.3 Parametric Family of Klein Bottle Embeddings Let θ,φ ∈ [0, 2π]. Bivariate polynomials parameterized by (θ,φ), kθ,φ ∈ Kθ,φ ⊂ P, that satisfy kθ,φ = kθ+π,2π−φ form a Klein bottle, K 2: the (θ,φ) ∼ (θ + π, 2π − φ) similarity relation results in edges being glued together in the manner definitional of a Klein bottle’s topology (shown in Figure 3B). The candidate Klein bottle embedding supplied in [21] to model image patch data satisfies the similarity relation ∀x,y: k0 ≡ k0θ,φ(x,y) = cos φ [x cos θ + y sin θ] 2 + sin φ [x cos θ + y sin θ] (31) Note that any kθ,φ ∈ Kθ,φ can be decomposed as: kθ,φ = C + κθ + κφ + κθ,φ (32) 36 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ where κθ = κθ+π, κφ = κ2π−φ and κθ,φ = κθ+π,2π−φ. The first three terms can be understood as constant, θ-dependent and φ-dependent phases respectively. We sought an embedding of the Klein bottle for which the sum of Euclidean distances from each image patch to its closest point on the embedding is minimized. To accomplish this, we constructed a parametric family of models for each of the four terms in Equation 32. The first three of these are most conveniently expressed directly in the DCT basis. (c◦h) (C) = NC 8∑ i=1 µiei (c◦h) (κθ) = 8∑ i=1   Nθ∑ j=2 j even βi,j cos(jθ) + γi,j sin(jθ)   ei (c◦h) (κφ) = 8∑ i=1  Nφ∑ j=1 ζi,j cos(jφ)  ei (33) NC is a Boolean variable, and Nθ and Nφ control the number of terms in the inner sum for (c◦h) (κθ) and (c◦h) (κφ) respectively. The expression for (c◦h) (κθ) only includes even coefficients for θ so that the similarity relation (θ) ∼ (θ +π) is satisfied. The expression for (c◦h) (κφ) only includes cosine terms so that the similarity relation (φ) ∼ (2π −φ) is satisfied. For κθ,φ, we refrained from writing a Fourier series-like expansion because we wanted to preserve the interpretation of θ and φ as parameters controlling the orientation and gradient respectively [21]. Instead, we devised the following form, which we motivate further below: κθ,φ(x,y) = Mφ∑ l=1 cosl(φ)   s+t≤Mθ∑ 0≤s,t≤Mθ 1 0 even and t odd −2 √ 6 3 e2 − 4 √ 3 3 e7, if t > 0 even and s odd 2 √ 6 3 (e3 + e4 + e8) , if s > 0 even and t > 0 even (35) Note that the first inner sum in Equation 34 is a linear combination of basis vectors encoding purely quadratic gradients (e3, e4, e5 and e8), weighted by even trigonometric functions of θ. The prefactors on this inner sum are functions that are even in φ. This inner sum and its prefactor therefore jointly satisfy the similarity relation (θ,φ) ∼ (θ + π, 2π−φ) by independently satisfying (θ) ∼ (θ + π) and (φ) ∼ (2π−φ). Meanwhile, the second inner sum in Equation 34 is a linear combination of basis vectors containing linear gradients (e1, e2, e6 and e7), weighted by odd trigonometric functions of θ. The prefactors on this inner sum are functions that are odd in φ. This inner sum and its prefactor therefore jointly satisfy the similarity relation (θ,φ) ∼ (θ + π, 2π − φ), by independently satisfying (θ) ∼ −(θ + π) and (φ) ∼ −(2π − φ). Since the trigonometric functions of θ are coupled to (x,y), θ controls the rotation of stripes in the image patches, just as in k0. Similarly, since the prefactors on the inner sums are functions of φ, φ controls the relative contribution of quadratic gradients (e3, e4, e5 and e8 in the first inner sum) and linear gradients (e1, e2, e6 and e7 in the second inner sum). Lastly, the boundary conditions for θ and φ in this parameterization of κθ,φ, yield patches with vertical (horizontal) stripes when θ = 0 (θ = π 2 ), and linear (quadratic) gradients when φ = π 2 , 3π 2 (φ = 0,π) just as in k0. A Klein bottle embedding belonging to this parametric family, kαθ,φ ∈ Kθ,φ, can therefore be specified in terms of a vector F = [NC,Nθ,Nφ,Mθ,Mφ] defining its functional form, and a corresponding coefficient vector α = [µi, ...,βi, ...,γi, ...,ζi, ...,ηi, ...,ϑi]. In this parametric family of Klein bottle embeddings, k 0 corresponds to F = [0, 0, 0, 2, 1] with α = [η1,2,0,η1,1,1,η1,0,2,ϑ1,0,1,ϑ1,1,0] = [1, 2, 1, 1, 1]. Note that since curvatures are only computed on the embedding after normalization, α is only meaningfully defined up to a multiplicative constant. 38 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4.5.4 Associating Image Patches to a Klein Bottle Embedding For a given Klein bottle embedding, kαθ,φ ∈ Kθ,φ, we associated each datapoint vi (already vectorized and normalized by u◦ c◦h) to the closest point on kαθ,φ by minimizing the Euclidean distance in R 8: (θ̂i, φ̂i) = argminθ,φ‖(u◦ c◦h) ( kαθ,φ ) −vi‖22 (36) We solved this minimization using the lsqnonlin function (‘StepTolerance’=1e-3, ‘FunctionTolerance’=1e-6) in MATLAB, supplying initial conditions corresponding to analytical values for a point on k0: θ̂i = arctan e1 Tvi −e2Tvi ( vi∈(u◦c◦h)(k0) = arctan sin φ̂i sin θ̂i sin φ̂i cos θ̂i ) φ̂i = arctan √ (e1Tvi)2 + (e2Tvi)2 (e3Tvi) + e4Tvi  vi∈(u◦c◦h)(k0)= arctan √ sin2 φ̂i cos2 φ̂i   (37) We constrained solutions to θ̂i ∈ [0,π] and φ̂i = [0, 2π] according to the (θ,φ) similarity relation. 4.5.5 Optimal Klein Bottle Embedding Let kα̂θ,φ ∈ Kθ,φ be the Klein bottle embedding that minimizes the sum of Euclidean distances in R 8 between each image patch and the closest point on the embedding. To determine kα̂θ,φ given a functional form F, we initialized the coefficient vector α̂ to have zero entries everywhere except for the values used in k0. We then iterated between optimizing for (θ̂i, φ̂i) according to Equation 36 and for α̂ as shown below using least-squares, until convergence: α̂ = argminα ∑ i ‖(u◦ c◦h) ( kα θ̂i,φ̂i ) −vi‖22 (38) k1 ≡ kα̂θ,φ is the optimized Klein bottle embedding corresponding to F = [1, 10, 10, 20, 10], for which results are shown in Figures 3H and S3D. 4.5.6 Noisy Klein Bottle Embeddings The set of N ≈ 4.2×105 image patches was associated to k0 according to the procedure described in Methods Section 4.5.4, yielding (θ̂i, φ̂i) values. Isotropic Gaussian noise of magnitude sσ was added element-wise in R9 (prior to normalization by u ◦ c) to h(k0 θ̂i,φ̂i ), where s = mediani{‖h(k0 θ̂i,φ̂i )‖2} ≈ 2.451. Figures 3F,G and S3A correspond to noise with σ = 0.007, 0.01 and 0.03. 39 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4.5.7 Parameters for Curvature Estimation For all scalar curvature computations on image patch datasets and Klein bottle embeddings, we set d = 2 and Ncalib = 10 4. Unless the neighborhoods were manually specified, we used σh = 0.1, which yielded a flat distribution of GOF p-values (2.5% of points reported GOF p-values ≤ α = 0.05) for the set of N ≈ 4.2×105 points on k0 closest to the image patches (shown in Figure 3E). 4.6 Details of scRNASeq Datasets The PBMC dataset provided by 10x Genomics is comprised of N = 10194 PBMCs collected from a healthy donor [42]. The mouse gastrulation dataset consists of N = 116312 cells collected at nine 6-hour intervals from embryonic day 6.5 to 8.5 [43]. The mouse brain dataset is a benchmark from 10x Genomics consisting of N = 1306127 cells collected from the cortex, hippocampus and ventricular zone of two embryonic mice sacked at embryonic day 18 [44]. 4.6.1 Preprocessing For the PBMC dataset, we applied standard preprocessing steps using Seurat v3.1.2 [53] with default function arguments, to extract PC projections and UMAP coordinates ourselves. Specifically, we removed cells where the percentage of transcripts corresponding to mitochondrial genes exceeded 15%, or which had fewer than 500 transcripts. This reduced the number of cells from 10194 to 9385. On this filtered set, we normalized the data (NormalizeData(normalization.method=‘LogNormalize’, scale.factor=10000)), retained the 2000 most variable genes (FindVariableFeatures(selection.method=‘vst’, nfeatures=2000)), and scaled the data (ScaleData). Next, we performed linear dimensionality reduction using PCA down to 50 dimensions (RunPCA(npcs=50)) and generated UMAP coordinates for visualization (RunUMAP(dims = 1:30)). For the gastrulation (brain) dataset, we did not preprocess the data ourselves but instead directly used the 50 (20) PC projections and UMAP (t-SNE) visualization coordinates provided with the dataset. Please refer to [43, 44] for additional details. 4.6.2 Cell Type Annotations For the PBMC dataset, the AddModuleScore(ctrl=5) function was used to compute the per-cell average expression of marker genes corresponding to seven different cell types [54]. To prepare Figure S4A, each cell was assigned the cell type for which its average marker gene expression was the highest. Cell type annotations for the gastrulation dataset (see Figure S5A) were sourced from Figure 1C of [43]. Cell type 40 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ annotations for the brain dataset (see Figure S6A) are predicted labels sourced from [55]. 4.6.3 Statistical Tests Here we describe the statistical tests applied to scalar curvatures computed for the scRNAseq datasets. 4.6.3.1 Spatial Precision of Errorbars Let m be the fraction of datapoints with 95% CIs containing the scalar curvatures reported by their respective kNNs. To check whether m was significantly larger than chance, we used a permutation test. We randomly assigned the kNN of each datapoint to be one of the N datapoints in the dataset and computed m. We repeated the procedure T = 1000 times to generate an empirical distribution of m for the null model of random neighbors. The reported p-value for each k is the fraction of the T trials for which m was greater than the value computed for data. See Figures S4D, S5D and S6D. 4.6.3.2 Sensitivity to Cell Downsampling To check the sensitivity of the computed scalar curvatures to the average density of cells, we discarded f% of cells at random from the ambient space computed using the original set of N datapoints, and recomputed scalar curvatures using the same ambient dimension, manifold dimension and neighborhood sizes as for the original dataset (see Methods Section 4.6.4). Let m be the fraction of downsampled datapoints with 95% CIs containing the scalar curvatures originally reported. Since the CIs grow as f increases, we checked whether m was significantly larger than chance by using a permutation test. We randomly paired each of the 95% CIs computed after downsampling, to one of the scalar curvatures reported by the downsampled points for the original dataset, and computed m. We repeated the procedure T = 1000 times to generate an empirical distribution of m for the null model. The reported p-value for each f is the fraction of the T trials for which m was greater than the value computed for data. See Figures S4I, S5I and S6I. 4.6.3.3 Sensitivity to Transcript Downsampling To check the sensitivity of the computed scalar curvatures to the capture efficiency and sequencing depth of the data, we discarded f% of transcripts at random from the single-cell count matrix for the PBMC dataset, then performed the same preprocessing steps described in Methods Section 4.6.1. We recomputed scalar curvatures using the same ambient dimension, manifold dimension and neighborhood sizes as for the original dataset (see Methods Section 4.6.4). Let m be the fraction of datapoints with 95% CIs containing the scalar curvatures originally reported. To check whether m was significantly larger than chance, we used a permutation test. We randomly paired each of the 95% CIs computed after downsampling transcripts, to one of the scalar curvatures computed for the original 41 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ dataset, and computed m. We repeated the procedure T = 1000 times to generate an empirical distribution of m for the null model. The reported p-value for each f is the fraction of the T trials for which m was greater than the value computed for data. See Figure S4J. 4.6.4 Parameters for Curvature Estimation Let the variance explained by the ith PC be given by σ2i and the cumulative fractional variance of the first m PCs by cm = ∑m i=1 σ 2 i∑ i σ2 i . For each dataset, we selected the ambient dimension as n = argmaxm{cm|cm ≤ 0.8}, the manifold dimension as d = argmaxm{cm|cm ≤ 0.64}, and considered the global length scale to be L = 3σd. (n,d,L) = (8, 3, 18.3), (11, 3, 19.1) and (9, 5, 24.9) for the PBMC, gastrulation and brain datasets respectively. For the three datasets, we computed scalar curvatures for manifold dimensions d− 1, d and d + 1. It was not always possible to select σh for each dataset and manifold dimension, so that the distribution of GOF p-values was flat, according to our usual heuristic. For consistency, we therefore picked σh so that 1/3 of points had GOF p-values ≤ α = 0.05. For manifold dimension (d − 1,d,d + 1), σh = (0.031, 0.041, 0.045), (0.036, 0.044, 0.053) and (0.034, 0.050, 0.055) for the PBMC, gastrulation and brain datasets respectively. 5 Acknowledgements DS was funded in part by the Natural Sciences and Engineering Research Council of Canada (NSERC PGSD2-517131-2018). SW was supported by NCI U54-CA225088 and NIH NIGMS T32 GM008313. DS and SH acknowledge funding from NIH NIGMS R00GM118910, U19 Systems Immunology Pilot Project Grant at Harvard University, and the Harvard University William F. Milton Fund. The authors would like to thank Peter Kharchenko and Allon Klein for helpful discussions. Portions of this research were conducted on the O2 High Performance Compute Cluster, supported by the Research Computing Group, at Harvard Medical School. See http://rc.hms.harvard.edu for more information. 6 Data and Code Availability The van Hateren IML dataset is available at http://bethgelab.org/datasets/vanhateren and was loaded according to the instructions there. The PBMC dataset is available at https://support.10xgenomics. com/single-cell-gene-expression/datasets/4.0.0/Parent_NGSC3_DI_PBMC. The gastrulation dataset can be retrieved using instructions found at https://github.com/MarioniLab/EmbryoTimecourse2018. 42 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint http://rc.hms.harvard.edu http://bethgelab.org/datasets/vanhateren https://support.10xgenomics.com/single-cell-gene-expression/datasets/4.0.0/Parent_NGSC3_DI_PBMC https://support.10xgenomics.com/single-cell-gene-expression/datasets/4.0.0/Parent_NGSC3_DI_PBMC https://github.com/MarioniLab/EmbryoTimecourse2018 https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ The brain dataset is available at https://support.10xgenomics.com/single-cell-gene-expression/ datasets/1.3.0/1M_neurons. The software package described here to compute scalar curvature is avail- able at https://gitlab.com/hormozlab/ManifoldCurvature. All code and instructions to reproduce the numerics and figures in this study will be made available upon publication. 43 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons https://gitlab.com/hormozlab/ManifoldCurvature https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 Supplementary Figures 44 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B C D E F H G Figure S1: The scalar curvature of S2 is poorly estimated using the Laplace-Beltrami operator. (A) The heat-trace with m terms, (zm(x) in Equation 3) is shown for m = ∞ (black), m = 1000 (solid blue) and m = 49 (solid red), when evaluated with analytical eigenvalues for S2. Empirical eigenvalues were obtained by uniformly sampling N = 104 points from S2 (see Figure 1A; Methods Section 4.4.1.1) and estimating the Laplace-Beltrami (LB) operator using Equations 13-15. The heat-trace evaluated using these empirical eigenvalues, zm, is shown for m = 1000 (dashed blue) and m = 49 (dashed red). The heat-trace evaluated using eigenvalues obtained by interpolating between the analytical and empirical values (z̃m(x; f) in Equation 11) is shown for m = 49 and f = 0.23 (solid green). f signifies that the fractional error of the interpolated eigenvalues is reduced by 1−f relative to the empirical eigenvalues. f = 0 corresponds to the analytical eigenvalues while f = 1 corresponds to the empirical eigenvalues. The white region bounded by [x1,x2] indicates a candidate interval over which to fit a heat-trace to a quadratic in order to extract an estimate for the scalar curvature (see Equations 2-4; Methods Section 4.2.1). On the one hand, since the knee of zm(x) shifts to the left as m increases (i.e. zm(x) converges from ∞), larger m results in more intervals for which zm(x) well-approximates z∞(x) and will therefore yield accurate scalar curvature estimates. On the other hand, zm(x) becomes a worse estimator for zm(x) as m increases. (B) Scalar curvatures estimated by fitting z∞(x) to a quadratic over different intervals [x1,x2] as defined in (A). Scalar curvatures are shown in color for intervals yielding accurate estimates (S ∈ [1.5, 2.5]). This colored region corresponds to D∞. (C) As in (B) but with estimates obtained by fitting a quadratic to z1000(x). The colored region corresponds to D1000. By inspection, D1000 ⊂ D∞. (D) Scalar curvatures estimated by fitting z1000(x) to a quadratic over each interval in D1000. Though D1000 was constructed using only intervals which yielded an accurate scalar curvature estimate when analytical eigenvalues were used in the heat-trace, no interval in D1000 yields an accurate scalar curvature estimate when the same number of empirical eigenvalues are used in the heat-trace instead. (E) As in (B) but with estimates obtained by fitting a quadratic to z49(x). The colored region corresponds to D49. By inspection, D49 ⊂ D1000 (F) As in (D) but with estimates obtained by fitting z49(x) to a quadratic over each interval in D49. No estimate is accurate just as in (D). (G) As in (F) but with estimates obtained by fitting z̃49(x; f = 0.23) to a quadratic over each interval in D49. f = 0.23 was chosen so that half the intervals in D49 yield an accurate scalar curvature estimate. (H) (Left) The fractional error in the first 49 empirical eigenvalues of the LB estimator from (A) is shown in red. This operator was computed using the Gaussian kernel (Wg in Equation 15). Eigenvalues 37-49 have a fractional error of 60%. The fractional error of the eigenvalues of LB estimators computed on the same N = 104 points but using the weighted kNN and r-neighborhood kernels (WkNN and Wr respectively in Equation 16) is also plotted. Positive error indicates under-estimation. (Right) Projected fractional error for eigenvalues 37-49 of the LB estimator with Gaussian kernel computed using a larger sample size (N). The projection is based on the convergence rate given in Theorem 1 of [30], assuming that the big-O bound is sharp at N = 104 for eigenvalues 37-49. The dashed green line corresponds to the 14% fractional error needed for scalar curvatures to be accurately estimated for half the intervals in D49. This corresponds to f = 0.23 in (G) since 60% ×f = 14%. For the LB estimator computed using the Gaussian kernel, achieving this fractional error requires N ≈ 107. Since LB estimators computed using the other kernels have the same convergence rate but larger fractional error at N = 104, these estimators would require even larger N to achieve the desired 14% fractional error. 45 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B C D Figure S2: Sensitivity of algorithm to real-world confounders. (A) (Left) A dataset with a sparse periphery and a dense core was formed by uniformly sampling N1 = 10 4 points from the 3-dimensional cube of side-length 10, D310, and N2 = 10 3 points from the 3-dimensional cube of side-length 1, D31 (see Methods Section 4.4.1.4). These points were embedded in R11 and padded with isotropic Gaussian noise of magnitude σ = 0.01 in the 8 normal directions. Scalar curvatures (S) were computed on this dataset of N1 + N2 points by setting σh and are plotted against their standard errors (σS) in the leftmost panel. Curvature computations were also performed at fixed length scales corresponding to the 5, 50 and 95%-ile values for neighborhood size (left to right) used in the leftmost panel (r = 0.54, 0.90 and 1.22 respectively). Here, points for which the chosen r led to neighborhoods with insufficient points for regression are not shown. For large length scales, all points in the dense region are able to report curvatures but are crowded into the apex of the plots. The N1 (N2) sparse (dense) points are shown in blue (green). Points enclosed by the red lines have 95% CIs including the true value of zero. The right four panels show analogous results when N2 = 10 4. Here the the 5, 50 and 95%-ile values for neighborhood size are r = 0.37, 0.63 and 1.42 respectively. See Methods Section 4.4.2.1. (B) Distribution of scalar curvatures computed for N = 104 points uniformly sampled from S2 ⊂ R3 and convoluted with isotropic Gaussian noise of magnitude σ in R3. Noise confounds accurate scalar curvature computation when σ is roughly 10% of the sphere’s radius. The deviation of the estimated scalar curvatures from the true value of 2 (shown as a dashed red line) for σ ≥ 0.1 reflects the nontrivial geometry of a manifold convoluted by noise. See Methods Section 4.4.2.2. (C) (Left) N = 104 points were uniformly sampled from S2 and embedded in Rn. Isotropic Gaussian noise of magnitude σ was applied to each of the n ambient dimensions. Scalar curvatures computed by keeping σh fixed for all n and σ, recapitulated the true value of 2 (shown as dashed red lines) for n ≤ 80 and σ ≤ 0.05. (Right) The neighborhood size (r) necessary to attain σh is less sensitive to changes in n than changes in σ. See Methods Section 4.4.2.3. (D) N = 104 points were uniformly sampled from (left) S3 ⊂ R5 convoluted with isotropic Gaussian noise in the ambient space with σ = 0.01 and (right) S2 ×S2 ⊂ R6. To investigate the effects of choosing the manifold dimension, d, differently than the true value, d∗, σh was kept fixed, and scalar curvatures were computed for d = d ∗−1 (cyan), d = d∗ + 1 (magenta) and d = d∗ (green). The panels show the distribution of (left to right) scalar curvatures (S), standard errors (σS) and GOF p-values. The true value of the scalar curvature (at d = d∗) is constant across both manifolds and shown as a dashed red line. The average neighborhood size (r averaged over all points) is much larger for both d = d∗ − 1 and d = d∗ + 1 than for d = d∗ as shown in the legend. For the same σh, d = d ∗− 1 also leads to a more skewed distribution of GOF p-values relative to d = d∗, while the distribution for d = d∗ + 1 is still flat. See Methods Section 4.4.2.4. 46 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B C D E F Figure S3: Additional details of the image patch dataset and Klein bottle embeddings (related to Figure 3). (A) To compute scalar curvatures for Figure 3E, each image patch was associated to the (θ0,φ0) coordinates of the closest point on k0. Here we select a handful of these associated points on k0 (shown in black) and visualize how neighborhoods chosen in R8 to compute scalar curvatures for Figure 3E appear in (θ0,φ0) coordinates (shown in red). When noise of increasing magnitude, σ, is added to the set of closest points on k0 (see Methods Section 4.5.6), the neighborhood size at each point grows until σh is attained. (B) As in (A), but showing neighborhoods used in computing the scalar curvatures in Figure 3D for the image patch dataset. Note the close correspondence in neighborhood size with σ = 0.03 in (A). (C) Scalar curvatures computed for the set of closest points (θ0,φ0) on k 0 as in Figure 3E, but using the same neighborhood sizes determined for the image patch dataset shown in Figure 3D, some of which are visualized in (B). (D) As in (A) but showing neighborhoods used in computing the scalar curvatures in Figure 3H for the set of closest points on k1. Neighborhoods are visualized on (θ0,φ0) coordinates instead of (θ1,φ1) coordinates for ease of comparison. (E) As in (B) but showing neighborhoods used in computing the scalar curvatures in Figure 3J for the augmented image patch dataset. (F) Scalar curvatures computed for the augmented image patch dataset with N ≈ 1.3 × 108 points as in Figure 3J, but using the same neighborhood sizes determined for the original image patch dataset with N ≈ 4.2 × 105 shown in Figure 3D and (B). Note the close correspondence with Figure 3D. 47 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ J C D E H G F I A B Figure S4 48 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure S4: Additional details of the PBMC scRNAseq dataset (related to Figure 4). (A) Cell types overlaid onto UMAP coordinates and sorted in decreasing order of abundance in the legend. Cells were annotated as described in Methods Section 4.6.2. (B) A goodness-of-fit p-value was computed for each point by applying Mardia’s test to the residuals obtained from fitting the neighborhood around the point to a quadratic function (see Methods Section 4.3.3). These p-values are visualized on UMAP coordinates corresponding to each point (left) and their empirical distribution is shown using a histogram (right). Small p-values suggest that the residuals are non-normal so that approximating local neighborhoods as quadratic may not be valid. (C) Pearson correlation between the scalar curvature reported by each point and its kth-nearest neighbor (kNN) for different k (shown in blue). The red bar shows the mean and standard deviation of the Pearson correlation when neighbors are chosen randomly over 1000 trials (*p < 10−6). (D) The percentage of points with 95% CIs containing the scalar curvatures reported by their respective kNNs (shown in blue). The red bar shows the mean and standard deviation of this percentage when neighbors are chosen randomly over 1000 trials (*p < 0.001; see Methods Section 4.6.3.1). (E) The neighborhood size (r) used for computing scalar curvature at each point, overlaid onto UMAP coordinates (left) and a corresponding histogram of the empirical distribution (right). The dashed red lines correspond to the 25, 50, and 75%-ile values of r(p) used for computing scalar curvatures at fixed neighborhood sizes for Figure 4C. See Methods Section 4.3.2. (F) The number of points in each neighborhood (corresponding to the neighborhood sizes in (E)) overlaid onto UMAP co- ordinates (left) and a corresponding histogram of the empirical distribution (middle). (Right) The set of neighbors used for computing scalar curvature (purple) is visualized on UMAP coordinates for a handful of points (black). (G) Scalar curvatures were computed for manifold dimension d− 1 (left) and d + 1 (right). They are plotted here on UMAP coordinates after smoothing over the same set of k = 250 neighbors used in Figure 4A. See Methods Section 4.6.4. (H) The total number of transcripts observed in each cell overlaid onto UMAP coordinates. (I) Scalar curvatures were computed after downsampling the number of cells in the ambient space by a factor of 2 (left) and 4 (middle), using the same ambient dimension, manifold dimension and neighborhood sizes determined for the original dataset. They are plotted here on UMAP coordinates after smoothing over the same set of neighbors (which survive downsampling) used in Figure 4A. (Right) The percentage of points in the downsampled datasets with a 95% CI containing the originally reported scalar curvature (blue), and likewise for a negative control obtained by randomly pairing 95% CIs and originally reported scalar curvatures for points in the downsampled dataset (red). Errorbars for the negative control are the standard deviation of this percentage over 1000 trials with different random pairings (*p < 0.001; see Methods Section 4.6.3.2). (J) Scalar curvatures were computed after downsampling the number of transcripts by a factor of 2 (left) and 4 (middle), using the same ambient dimension, manifold dimension and neighborhood sizes determined for the original dataset. They are plotted here on UMAP coordinates after smoothing over the same set of k = 250 neighbors used in Figure 4A. (Right) The percentage of points in the downsampled datasets with a 95% CI containing the originally reported scalar curvature (blue), and likewise for a negative control obtained by randomly pairing 95% CIs and originally reported scalar curvatures for points in the downsampled dataset (red). Errorbars for the negative control are the standard deviation of this percentage over 1000 trials with different random pairings (*p < 0.001; see Methods Section 4.6.3.3). 49 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ B C D E H G F I A Figure S5: Additional details of the gastrulation scRNAseq dataset (related to Figure 4). Panels as in Figure S4. 50 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ B C D E H G F I A Figure S6: Additional details of the brain scRNAseq dataset (related to Figure 4). Panels as in Figure S4 but with t-SNE instead of UMAP plots. 51 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ References [1] A. M. Klein, L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li, L. Peshkin, D. A. Weitz, and M. W. Kirschner. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 161(5):1187–1201, 2015. [2] E. Z. Macosko, A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, A. R. Bialas, N. Kamitaki, E. M. Martersteck, J. J. Trombetta, D. A. Weitz, J. R. Sanes, A. K. Shalek, A. Regev, and S. A. McCarroll. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015. [3] G. X. Y. Zheng, J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, T. D. Wheeler, G. P. McDermott, J. Zhu, M. T. Gregory, J. Shuga, L. Montesclaros, J. G. Underwood, D. A. Masquelier, S. Y. Nishimura, M. Schnall-Levin, P. W. Wyatt, C. M. Hindson, R. Bharadwaj, A. Wong, K. D. Ness, L. W. Beppu, H. J. Deeg, C. McFarland, K. R. Loeb, W. J. Valente, N. G. Ericson, E. A. Stevens, J. P. Radich, T. S. Mikkelsen, B. J. Hindson, and J. H. Bielas. Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8(1):1–12, 2017. [4] D. R. Bandura, V. I. Baranov, O. I. Ornatsky, A. Antonov, R. Kinach, X. Lou, S. Pavlov, S. Voro- biev, J. E. Dick, and S. D. Tanner. Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Analytical Chem- istry, 81(16):6813–6822, 2009. [5] C. Giesen, H. A. O. Wang, D. Schapiro, N. Zivanovic, A. Jacobs, B. Hattendorf, P. J. Schüffler, D. Grolimund, J. M. Buhmann, S. Brandt, Z. Varga, P. J. Wild, D. Günther, and B. Bodenmiller. Highly multiplexed imaging of tumor tissues with subcellular resolution by mass cytometry. Nature Methods, 11(4):417–422, 2014. [6] J-R. Lin, M. Fallahi-Sichani, J-Y. Chen, and P. K. Sorger. Cyclic immunofluorescence (CycIF), a highly multiplexed method for single-cell imaging. Current Protocols in Chemical Biology, 8(4):251–264, 2016. [7] J-R. Lin, B. Izar, S. Wang, C. Yapp, S. Mei, P. M. Shah, S. Santagata, and P. K. Sorger. Highly multiplexed immunofluorescence imaging of human tissues and tumors using t-CyCIF and conventional optical microscopes. eLife, 7, 2018. [8] L. H. Nguyen and S. Holmes. Ten quick tips for effective dimensionality reduction. PLoS Computational Biology, 15(6):e1006907, 2019. 52 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ [9] J. B. Tenenbaum. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [10] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008. [11] E. Becht, L. McInnes, J. Healy, C-A. Dutertre, I. W. H. Kwok, L. G. Ng, F. Ginhoux, and E. W. Newell. Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 37(1):38–44, 2018. [12] A. Hatcher. Algebraic Topology. Cambridge University Press, 2001. [13] R. Ghrist. Barcodes: the persistent topology of data. Bulletin of the American Mathematical Society, 45(01):61–76, 2007. [14] D. Perrault-Joncas and M. Meilâ. Non-linear dimensionality reduction: Riemannian metric estimation and the problem of geometric discovery. arXiv, 2013. [15] J. M. Lee. Riemannian Manifolds: An Introduction to Curvature (Graduate Texts in Mathematics). Springer, 1997. [16] A. Zomorodian and G. Carlsson. Computing persistent homology. Discrete & Computational Geometry, 33(2):249–274, 2004. [17] G. Carlsson. Topology and data. Bulletin of the American Mathematical Society, 46(2):255–308, 2009. [18] M. Bernstein, V. De Silva, J. C. Langford, and J. B. Tenenbaum. Graph approximations to geodesics on embedded manifolds. Technical report, Department of Psychology, Stanford University, 2000. [19] F. Chazal, M. Glisse, C. Labruère, and B. Michel. Convergence rates for persistence diagram estimation in topological data analysis. Journal of Machine Learning Research, 16(1):3603–3635, 2015. [20] C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Minimax manifold estimation. Journal of Machine Learning Research, 13(1):1263–1291, 2012. [21] G. Carlsson, T. Ishkhanov, V. De Silva, and A. Zomorodian. On the local behavior of spaces of natural images. International Journal of Computer Vision, 76(1):1–12, 2008. [22] P. Lawson, A. B. Sholl, J. Q. Brown, B. T. Fasy, and C. Wenk. Persistent homology for the quantitative evaluation of architectural features in prostate cancer histology. Scientific Reports, 9(1):1–15, 2019. 53 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ [23] J. M. Chan, G. Carlsson, and R. Rabadan. Topology of viral evolution. Proceedings of the National Academy of Sciences, 110(46):18566–18571, 2013. [24] P. G. Cámara, A. J. Levine, and R. Rabadán. Inference of ancestral recombination graphs through topological data analysis. PLoS Computational Biology, 12(8):e1005071, 2016. [25] E. Abbott. Flatland: A Romance of Many Dimensions. Princeton University Press, 1991. [26] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems, 14:585–591, 2001. [27] M. Reuter, F-E. Wolter, and N. Peinecke. Laplace–Beltrami spectra as ‘Shape-DNA’ of surfaces and solids. Computer-Aided Design, 38(4):342–366, 2006. [28] M. Belkin, J. Sun, and Y. Wang. Constructing Laplace operator from point clouds in Rd. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1031–1040, 2009. [29] J. Liang, R. Lai, T. W. Wong, and H. Zhao. Geometric understanding of point clouds using Laplace- Beltrami operator. In IEEE Conference on Computer Vision and Pattern Recognition, pages 214–221, 2012. [30] N. G. Trillos, M. Gerlach, M. Hein, and D. Slepčev. Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace–Beltrami operator. Foundations of Computational Mathematics, 20(4):827–887, 2020. [31] H. P. McKean Jr. and I. M. Singer. Curvature and the eigenvalues of the Laplacian. Journal of Differential Geometry, 1(1-2):43–69, 1967. [32] B. Andrews. Lectures on Differential Geometry. https://maths-people.anu.edu.au/~andrews/DG. Australian National University. [33] I. T. Jolliffe and J. Cadima. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016. [34] H. Federer. Curvature measures. Transactions of the American Mathematical Society, 93(3):418–418, 1959. [35] P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 39(1-3):419–441, 2008. 54 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://maths-people.anu.edu.au/~andrews/DG https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ [36] U. Ozertem and D. Erdogmus. Locally defined principal curves and surfaces. Journal of Machine Learning Research, 12:1249–1286, 2011. [37] C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Nonparametric ridge estimation. The Annals of Statistics, 42(4):1511–1545, 2014. [38] R. W. Buccigrossi and E. P. Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Transactions on Image Processing, 8(12):1688–1701, 1999. [39] J. Malik, S. Belongie, T. Leung, and J. Shi. Contour and texture analysis for image segmentation. International Journal of Computer Vision, 43(1):7–27, 2001. [40] A. B. Lee, K. S. Pedersen, and D. Mumford. The nonlinear statistics of high-contrast patches in natural images. International Journal of Computer Vision, 54(1-2):83–103, 2003. [41] J. H. van Hateren and A. van der Schaaf. Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings: Biological Sciences, 265(1394):359–366, 1998. [42] 10x Genomics. PBMCs from a Healthy Donor: Whole Transcriptome Analysis. https://support. 10xgenomics.com/single-cell-gene-expression/datasets/4.0.0/Parent_NGSC3_DI_PBMC, 2020. [43] B. Pijuan-Sala, J. A. Griffiths, C. Guibentif, T. W. Hiscock, W. Jawaid, F. J. Calero-Nieto, C. Mulas, X. Ibarra-Soria, R. C. V. Tyser, D. L. L. Ho, W. Reik, S. Srinivas, B. D. Simons, J. Nichols, J. C. Marioni, and B. Göttgens. A single-cell molecular map of mouse gastrulation and early organogenesis. Nature, 566(7745):490–495, 2019. [44] 10x Genomics. 1.3 Million Brain Cells from E18 Mice. https://support.10xgenomics.com/ single-cell-gene-expression/datasets/1.3.0/1M_neurons, 2017. [45] D. van Dijk, R. Sharma, J. Nainys, K. Yim, P. Kathail, A. J. Carr, C. Burdziak, K. R. Moon, C. L. Chaffer, D. Pattabiraman, B. Bierie, L. Mazutis, G. Wolf, S. Krishnaswamy, and D. Pe’er. Recovering gene interactions from single-cell data using data diffusion. Cell, 174(3):716–729, 2018. [46] L. Haghverdi, M. Büttner, F. A. Wolf, F. Buettner, and F. J. Theis. Diffusion pseudotime robustly reconstructs lineage branching. Nature Methods, 13(10):845–848, 2016. [47] A. Klimovskaia, D. Lopez-Paz, L. Bottou, and M. Nickel. Poincaré maps for analyzing complex hierar- chies in single-cell data. Nature Communications, 11(1):1–9, 2020. 55 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://support.10xgenomics.com/single-cell-gene-expression/datasets/4.0.0/Parent_NGSC3_DI_PBMC https://support.10xgenomics.com/single-cell-gene-expression/datasets/4.0.0/Parent_NGSC3_DI_PBMC https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ [48] S. Wang, J-R. Lin, E. D. Sontag, and P. K. Sorger. Inferring reaction network structure from single-cell, multiplex data, using toric systems theory. PLoS Computational Biology, 15(12):e1007311, 2019. [49] M. Hein, J-Y. Audibert, and U. von Luxburg. Graph Laplacians and their convergence on random neighborhood graphs. Journal of Machine Learning Research, 8(48):1325–1370, 2007. [50] D. Ting, L. Huang, and M. Jordan. An analysis of the convergence of graph Laplacians. arXiv, 2011. [51] K. V. Mardia. Measures of multivariate skewness and kurtosis with applications. Biometrika, 57(3):519– 530, 1970. [52] P. Campadelli, E. Casiraghi, C. Ceruti, and A. Rozza. Intrinsic dimension estimation: Relevant tech- niques and a benchmark framework. Mathematical Problems in Engineering, 2015:1–21, 2015. [53] A. Butler, P. Hoffman, P. Smibert, E. Papalexi, and R. Satija. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 36(5):411–420, 2018. [54] Y. Hu, M. Ranganathan, C. Shu, X. Liang, S. Ganesh, A. Osafo-Addo, C. Yan, X. Zhang, B. E. Aouizerat, J. H. Krystal, D. C. D’Souza, and K. Xu. Single-cell transcriptome mapping identifies common and cell-type specific genes affected by acute delta9-tetrahydrocannabinol in humans. Scientific Reports, 10(1):1–14, 2020. [55] K. Xie, Y. Huang, F. Zeng, Z. Liu, and T. Chen. scAIDE: clustering of large-scale single-cell RNA-seq data reveals putative and rare cell types. NAR Genomics and Bioinformatics, 2(4), 2020. 56 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425885doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction Results Estimators of the Laplace-Beltrami Operator Yield Inaccurate Scalar Curvatures Curvature Can Be Computed Accurately Using the Second Fundamental Form Curvature of Image Patch Manifold is Consistent with a Noisy Klein Bottle scRNAseq Datasets have Non-Trivial Intrinsic Curvature Discussion Methods Differential Geometry of Theoretical Manifolds Details of Intrinsic Approach to Curvature Estimation Approach for S2 Infinite Series Truncated Series Eigenvalue Convergence Estimating the Laplace-Beltrami Operator from Data Details of Extrinsic Approach to Curvature Estimation Quadratic Regression on Local Neighborhoods of Data Selecting Local Neighborhoods for Regression Goodness-of-Fit Test for Quadratic Regression Standard Error and Bias of Scalar Curvature Estimate Note on Length Scales Details of Toy Manifold Curvature Computations Analytical Forms Hypersphere One-Sheet Hyperboloid Ring Torus Hypercube Practical Issues for Curvature Estimation on Real-World Datasets Non-Uniform Sampling Observational Noise Large Ambient Dimension Choice of Manifold Dimension Parameters for Curvature Estimation Details of Image Patch Dataset and Klein Bottle Manifolds Notation and Preliminaries Image Dataset Parametric Family of Klein Bottle Embeddings Associating Image Patches to a Klein Bottle Embedding Optimal Klein Bottle Embedding Noisy Klein Bottle Embeddings Parameters for Curvature Estimation Details of scRNASeq Datasets Preprocessing Cell Type Annotations Statistical Tests Spatial Precision of Errorbars Sensitivity to Cell Downsampling Sensitivity to Transcript Downsampling Parameters for Curvature Estimation Acknowledgements Data and Code Availability Supplementary Figures References 10_1101-2021_01_06_425581 ---- 27015732 Periodicity in the embryo: emergence of order in space, diffusion of order in time Bradly Alicea​1​,​2​, Ujjwal Singh​1,​3 Keywords: Periodicity, Dynamical Systems, ​C. elegans​, Zebrafish, Developmental Biology, Modeling and Simulation Abstract Does embryonic development exhibit characteristic temporal features? This is quite apparent in evolution, where evolutionary change has been shown to occur in bursts of activity. Using two animal models (Nematode, ​Caenorhabditis elegans and Zebrafish, ​Danio rerio​) and simulated data, we demonstrate that temporal heterogeneity exists in embryogenesis at the cellular level, and may have functional consequences. Cell proliferation and division from cell tracking data is subject to analysis to characterize specific features in each model species. Simulated data is then used to understand what role this variation might play in producing phenotypic variation in the adult phenotype. This goes beyond a molecular characterization of developmental regulation to provide a quantitative result at the phenotypic scale of complexity. Introduction While the case for the effects of "tempo and mode" [1] have been made for the evolutionary process, a similar relationship between phenotypic change, time, and space may also exist in development. One obvious answer to this question is to examine the expression and sequence variation of genes associated with cell cycle and developmental patterning [2]. However, there is a potentially more compelling top-down explanation. We will use two model organisms to demonstrate how periodicity becomes less synchronized over developmental time and space. In the case of the nematode ​Caenorhabditis elegans, a comparison of embryogenetic and postembryonic cells (developmental and terminally-differentiated cell birth times acquired from [3]) reveals two general patterns. For the Zebrafish ( ​Danio rerio ​), comparisons within and between embryogenesis stages based on measurements of cell nuclei in the animal hemisphere [4] reveal patterns at multiple scales. One of the most notable signatures is burstiness [5, 6], or a large number of events occurring in a short period of time. These bursts can either be periodic or aperiodic, and these statistical features define the temporal nature of development, potentially in a universal manner across species. Based on two species and a computational model, we predict that periodic changes in the frequency of new cells over developmental time represents cell proliferation without functional distinction. We also analyze the intervals between bursts in cell division (and cell differentiation in the case of ​C. elegans ​). These bursts are derived from both time-series segmentation and decomposition in the frequency domain. We show that these results consistently point to great temporal variation at the cellular level, and may play a role in shaping morphogenesis. In addition, these 1 ​OpenWorm Foundation, Boston, MA USA. ​balicea@openworm.org 2 ​Orthogonal Research and Education Laboratory, Champaign, IL USA. 3 ​IIIT Delhi, Delhi, India. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint mailto:balicea@openworm.org https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ changes in frequency and periodicity over time results in spatial variation (Supplemental Figure 1). To characterize spatial variation, we utilize embryo networks [7]. Embryo networks are complex networks based on the relative proximity of cells as they divide and migrate during the developmental process. The resulting network topologies provide not only information about spatial variation, but cellular interactions and other signaling connections as well [8, 9]. The existence of network structure in the form of modules or regions of dense connectivity can reveal a great deal about the unfolding of lineage trees in time. Returning to the first prediction, we can create computational summaries of cell division events called numeric embryos to model the proliferation of cells over time. We call these computational models, numeric embryos, and can be used to model branching events in a lineage tree. Numeric embryos can be used to model the distribution of branching events in time, independent of cell identity or spatial context. Approximating this distribution provides us with a periodic time-series that tells us something about the speed of embryogenesis: how quickly can different underlying distributions of cell division produce a phenotype with many undifferentiated cells. The rate at which developmental cells are produced could affect the rate of overall development, as we will see in an example from Zebrafish. Finally, we predict that the emergence and subsequent changes in spatiotemporal periodicity at the cellular level lead to regulatory phase transitions. For example, there is a one-to-one correspondence between cell division and waves of differentiation after the syncytial stage in ​Drosophila melanogaster [10]. In a similar fashion, amphibians exhibit a decay of synchrony of division [11, 12] that corresponds to differentiation wave activity [13]. Based on data analysis, modeling, and literature review, we anticipate that further investigation could uncover whether, in regulating embryos, mitosis and cell differentiation are correlated. In interpreting the data, we discuss the potential applicability of Holtzer’s quantal mitosis hypothesis [14, 15] as it relates to the process of differentiation relative to the proliferation of developmental (undifferentiated) cells. Methods A summary of the methods could be given here for smooth reading and interest. All materials are located on Github: ​https://github.com/Orthogonal- Research-Lab/Periodicity-in-the-Embryo ​. This repository includes processed data, supplemental materials, and associated code. Secondary Datasets The ​C. elegans ​and ​D. rerio data sets were acquired from the Systems Science of Biology Database ( ​http://ssbd.qbic.riken.jp/ ​). The ​C. elegans (nematode) data [16] is based on cell tracking of the nucleus, PMID:16477039. The ​D. rerio (Zebrafish) data [17] is likewise based on cell track of the nucleus, PMID:18845710. The cell tracking data is used to determine the total number of new cells (cell birth time) present at a particular time step. For the ​C. elegans ​data, cell births correspond to minutes of developmental time, and windows of size five (5 minutes of developmental time) is used for the time-series plots and histograms. Since lineage trees and the nature of developmental 1 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://github.com/Orthogonal-Research-Lab/Periodicity-in-the-Embryo https://github.com/Orthogonal-Research-Lab/Periodicity-in-the-Embryo http://ssbd.qbic.riken.jp/ https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ cell identification are different in Zebrafish, cell births correspond to the number of observed cells at discrete points in developmental time. Windows representing a certain number of cells in the embryo observed at a given sampling point are used instead of directly converting this process to minutes of developmental time. Zebrafish Developmental Stages Estimates and calculations of ​D. rerio developmental stages are derived from [18] and the ZFIN Zebrafish Developmental Staging Series web resource ( ​https://zfin.org/zf_info/zfbook/stages/​). Where applicable, embryo stages are approximated from the number of cells observed at any given point in developmental time. Peak-finding method For both the ​C. elegans and ​D. rerio data, a peak finding method is used to evaluate periodicity and to generate data points representing distinct bursts of cell birth. Briefly, local peaks in the cell division series are discovered by finding the highest value around the peak over an interval of 10 data points. The data are then visually inspected to ensure that local maximal fluctuations were not selected. Using this segmentation method, we are able to define intervals between peaks in a way that allows for the aperiodic regions of our series to be compared to the highly periodic regions. The peak finding method results are supplemented by a Fast Frequency Analysis (FFT) of cell divisions in ​C. elegans embryo (Supplemental Figure 2), cell differentiation events in ​C. elegans embryo (Supplemental Figure 3), and time series for cell divisions in Zebrafish embryo (Supplemental Figure 4). The power spectra largely confirm the nature of our interval and peak analysis. While the analysis of Zebrafish reveals a power spectrum at a single scale, the C. elegans embryo reveals a power spectrum of multiple time scales for both cell divisions and differentiations. Embryo Networks The full methodology for constructing and evaluating can be found in [7]. Briefly, embryo networks are complex networks constructed from the locations of cells in an embryo. Nodes are represented by centroids representing cell nuclei, and edges represent the spatial (Euclidean) distance between cells in a three- (static) or four- (dynamic) dimensional graph. All nuclei are plotted in embryo space, which is a coordinate system normalized to the center point between all cell locations in a complete embryo. For example, an edge of length 1.0 represents two centroids at opposite edges of the embryo space. A distance threshold is then derived from the length of the edge: in this paper, a distance threshold of 0.05 is used, excluding all but the cell nuclei in very close proximity to each other. Numeric Embryo Numeric embryos are statistical summaries of the type of information acquired from our secondary datasets, but in a more generic manner. Numeric embryos are based on generated pseudo data and are meant to capture the structure of hypothetical developmental scenarios. All analyses of our pseudo data were conducted using SciLab 6.1 (Paris, France). Each numeric embryo consists of one or more vectors describing rounds of cell division in the embryo. Briefly, each minute of developmental time is represented by either a zero or a positive non-zero value. For 2 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://zfin.org/zf_info/zfbook/stages/ https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ purposes of temporal comparison, all non-zero values are thresholded to one. To generate cell division intervals of different sizes, we start with a uniform distribution (division events occur every ​n minutes) and then compare this with a distribution generated using the grand function in SciLab. For the Poisson distribution, we use a 𝜆 = 0.1 (except where otherwise noted), while for the Binomial distribution, we use parameters ​N ​= 1.0 and ​p ​= 0.5. This produces intervals that are variable over developmental time. Results Our analysis will proceed from ​C. elegans to Zebrafish, to a comparison of the two species, then to a network analysis, and finally to a simulation of cell division in development. First, we plot the developmental cell division dynamics in ​C. elegans and Zebrafish in Figures 1 and 3, respectively, and cell differentiation in ​C. elegans in Figure 1. We then examine the intervals between cell division events ( ​C. elegans ​) and relative frequency of birth rates across development (Zebrafish) in Figures 2 and 4, respectively. Focusing on the peaks (maximum of bursts of cell births) shown in Figures 1 and 3, Figure 5 shows the distribution of intervals between peak values for ​C. elegans and Zebrafish. Figure 6 helps us extend this finding from temporal dynamics to connectivity between cells and spatial distributions of newly-born cells. We conclude with an investigation of how the intervals found between cell divisions can be modeled using various statistical distributions and is shown in Figure 7. These simulations (called numeric embryos) can reveal properties related to the speed of development, particularly the linear and nonlinear accumulation of cells. Caenorhabditis elegans ​ Example To understand the temporal nature of cell division and differentiation, we start by looking at patterns in ​C. elegans development over time. Figure 1 shows a time series of such events from zygote to adulthood. We are particularly interested in potential spikes or bursts of events in a short period of time. Figure 1 shows the fluctuations in cell divisions in embryonic division (Figure 1, top) and differentiation (Figure 1, bottom) events. Differentiation events occurring after 1000 minutes of developmental time (postembryonic development) occur in a long series of bursts, likely corresponding to the differentiation of seam cells. This can be contrasted with the burstiness that occurs in embryonic development, which is similar to the burstiness of division events. Figure 2 shows the intervals between cell division events across embryonic development in ​C. elegans ​. This plot confirms an exponential distribution with a long tail, presumably representing intervals in postembryonic development. Yet this plot is also sparse, yielding only 12 distinct intervals of cell division throughout all of ​C. elegans development. This is likely due to the deterministic nature of ​C. elegans development along with the relatively small number of cells. Supplemental Figures 2 and 3 reveal the power spectrum for cell division and cell differentiation in ​C. elegans​, respectively. To compare, contrast, and understand these trends further, we now turn to the embryonic development of the Zebrafish (​D. rerio​). 3 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ Figure 1. Developmental cell births in the nematode ​C. elegans​. Cell divisions occur according to developmental time (minutes). The timeline ranges from fertilized egg (zygote) to adulthood. Embryonic division events (blue), differentiation events (red). Zebrafish In Figure 3 (top), we observe six regular busts of cell division, followed by aperiodic cell division behavior. This transition in periodicity is observed after the embryo reaches 1529 cells in size (Figure 3, bottom). We do not observe this in ​C. elegans embryos, and may have to do with the more regulative nature of Zebrafish embryogenesis [19]. Changes in periodicity may also have to do with the establishment of spatial differentiation beyond the axial variability observed in ​C. elegans​. To better understand the nature of periodicity in Zebrafish, we examined the distribution of intervals between birth times. Figure 4 and Supplemental Figure 4 confirms the bursty nature of cell division in Zebrafish, in that most sampling time points only feature a few cell births, while a small number of sampling time points represents a large number of cells born. For example, a large majority of sampling time points feature fewer than 25 new cells per time point. By contrast, there are also single sampling points where over 70 cells are born at a single time. In terms of the 4 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ power spectrum shown in Supplemental Figure 4, there is a very high amplitude at very low frequencies, perhaps related to the significant noise and aperiodicity in the later part of the time-series shown in Figure 2. Figure 2. The interval between cell division events across embryonic development in C. elegans ​. Considering the cell divisions for the first period of Zebrafish embryogenesis, we conduct an interval analysis for each oscillation of the data shown in Figure 5 for ​C. elegans (top) and ​D. rerio (bottom). These are measured from peak to peak as described in the Methods. For the analysis of ​C. elegans data (Figure 5, top), our analysis yields a roughly unimodal distribution, with a mean peak interval of 3-5 minutes. In pre-hatch ​C. elegans embryogenesis, there are many quick bursts of cell division as confirmed in Figure 1 (top). This results in bursty behavior that is regular and perhaps even periodic. By contrast., an analysis of our Zebrafish data yields three interval groups (Figure 5, bottom): the greatest number of oscillations occurs at a period of 2-5 minutes, while a smaller number of oscillations occur with periods from 16-19. There is also a longer 22-minute interval between oscillations. This is consistent with the shift from periodic bursts to aperiodic but still bursty behavior later in Zebrafish development shown in Figure 3. This multimodal distribution of peaks points to a more complex process at play, something that might be better understood by investigating morphogenesis as a spatial process. Embryo Networks: an example from Zebrafish Another way to identify the consequences of bursts in cell division timing and other non-uniform temporal phenomena is to utilize embryo networks. An embryo network was constructed (Figure 6, Top) for cells born during our sampling time 5 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ points of ​D. rerio ​embryogenesis. The resulting circular graph demonstrates a high degree of modularity, but only across part of the graph. Figure 3. Cell births in Zebrafish embryos during embryogenesis up to the Gastrula stage. Instead of developmental time, relative developmental progress is plotted as all cells observed in the embryo at each sampling time point. For Figure 3, bottom: Periodic region (red), Aperiodic region (unshaded). A three-dimensional plot (Figure 6, Bottom) demonstrating the position of each cell born during these stages of development shows that the highest degrees of connectivity are clustered in the center of the embryo, while cells that are disconnected based on our connectivity threshold exist on the edges of the embryo. Importantly, it appears that cells are more densely clustered toward the center of the 6 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ embryo early at the earliest stages of development. These dense clusters are likely the product of cell division fluctuations shown in Figures 3 and 4. Figure 4. Relative frequency of birth rate across developmental time in ​D. rerio​. Histogram demonstrates the distribution of cells born during a single sampling time point. Numeric Embryo Experiments A numeric embryo (or perhaps more accurately a numeric one) allows us to understand the fundamental features of cell division events relative to the efficiency of their timing. Is one timing scheme superior to another? We know that in real (biological) lineage trees that cell divisions do not occur at a completely regular rate. Are there advantages in one particular statistical signature over another, particularly when comparing it to an artificial (regular) scheme? Table 1 shows a summary of how this simulation is constructed. Table 1. An example of our numeric simulation, with variable and sample values. We use the uniform distribution as the basis for Poisson noise, which helps to execute things a bit faster on average. Compare this to uniform division times such as a division event occurring once every 20 units of time. Generated Poisson Interval represents the size of the interval between division events, while division Interval 7 Developmental Time Unit Division Time (AU) Generated Poisson Interval Division Interval 0 0 0 0 1 1 1 0 2 0 2 2 3 1 3 0 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ represents when the event occurs in developmental time. Our timing data can be modeled as branches of a binary tree which are generated every ​n units of developmental time. The intervals between ​n​1​, n​2​, n​3​,…. n​t are determined by a probability distribution, which can be uniform (every branching event occurring at completely regular intervals), or a Poisson distribution (where branching events are distributed in an exponential fashion). Figure 5. Interval size of peaks in cell division for all developmental cells in ​C. elegans ​(top) and first 206 minutes of Zebrafish (bottom). ​C. elegans sampling time points correspond to most of the pre-hatch developmental period (660 minutes 8 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ post-fertilization), while the Zebrafish sampling time points correspond roughly to the period between the Zygote and the oblong/sphere stages of the Blastula. Figure 6. TOP: an embryo networks for the ​D. rerio embryo at the 239 cell stage (all cells born during the zygote and cleavage stages), with 920 edges. The edge threshold is an embryo distance of 0.05. BOTTOM: Cells in developmental location color-coded by status in the network. WHITE: all cells not above the threshold, RED: 9 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ all source cells with at least one edge to another cell. BLUE: all destination cells with at least one edge to another cell. Red and blue are equivocal. BLACK: all cells with more than eight edges to other cells. The graphs in Figure 7 tells us that modeling division events using a Poisson distribution is that we can achieve the same number of divisions as fewer developmental time units. Figure 7 (top) shows a Uniform distribution of division events, while Figure 7 (bottom) shows the Uniform case as compared to other distributions (Exponential, Poisson, and Binomial). The Poisson distribution yields the “fastest” time relative to the number of divisions produced. By contrast, the Binomial distribution yields the lowest number of divisions (hence is the slowest method examined). However, none of these methods produce orders-of-magnitude differences in division rate, which is what would be expected from a bursty signature. Discussion In this paper, we examine the periodicity of cell proliferation and division examined using three model systems: Zebrafish ( ​Danio rerio ​), Nematode ( ​Caenorhabditis elegans ​), and a simulated embryo. When we refer to periodicity in development, we mean events that reoccur over time. Regular pulses of cell proliferation events in a short period of time. This leads us to propose a principle of development based on timing. There can also be a spatial component of developmental periodicity as well. These include signatures of time-independent spatial periodicity such as tilings and other repeatable patterns across space. Interpretation of Figures We interpret Figures 1 and 3 in a number of ways. The first is by looking at components of variation over time. We measure this in terms of the interval between cell birth times in ​C. elegans (Figure 2) and the frequency of cell birth rates in Zebrafish (Figure 4). We also focus on intervals between other features in the time-series such as peaks for both species in Figure 5. In investigating peak intervals, we discover a similar distribution of cell division events between species in Figures 2 and 4, but a difference between species when looking at specific time-series features (Figure 5). The reason for this is clear: features such as peaks (magnitude) have a different underlying mechanism than events such as cell division. While both are linked to the lineage tree, magnitude differences are linked to the synchronization of cell division due to deterministic timing. With deterministic timing, synchronized cell divisions produce a lot of cells at any one point in developmental time, but little fluctuation between time points. In the case of stochastic timing, a lot of cells can be produced with a great degree of fluctuation between time points. There are a number of ways to interpret the embryo network and 3-D plot shown in Figure 6. One interpretation is that in Zebrafish, the phenotype is built from the inside out, with densely-packed cells representing fledgling anatomical structures such as the notochord and heart. These clusters may be linked to rounds of cell division (occuring in temporal bursts), while cell divisions occurring during the inter-burst intervals may contribute to cells at the outer edge of the embryo and perhaps representing the ectoderm layer [20, 21]. In this way, temporal bursts of cell division lead to a spatial hierarchy of cell differentiation. 10 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ Figure 7. Comparison of cumulative cell division events and the speed of division generated by a numeric embryo. Top: Uniform only (blue). Bottom: Uniform (blue), Exponential (orange), Poisson (gray), and Binomial (yellow). This spatial hierarchy involves a number of evolutionary and biophysical constraints that have been demonstrated in a number of experimental settings. For example, physical confinement affects the overall axial alignment and geometry of an embryo [22]. This includes our Zebrafish embryo network. Other types of fishes (Astyanax, see [23]) exhibit morphological changes in neural crest cell proliferation based on evolutionary changes due to ecological constraints. In C. elegans, asymmetrical cells (or daughter cells with significantly different volumes) result from physical constraints and compose 40% of C. elegans developmental cell divisions [24, 25]. Asymmetric cell divisions set up key cell-cell interactions [24] that are highlighted by the edges of embryo networks. Finally, by comparing nematic 11 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ alignment of liquid crystals to spindles of mitotic cells, phase transitions in actively dividing cells are found to result from the timing of centrosome separation [26]. Figure 7 provides an introduction to the numeric embryo concept. In this Figure, we focus exclusively on the timing component of lineage trees. This is essentially a version of the time series shown for Zebrafish and ​C. elegans developmental time series, but with the temporal fluctuations smoothed out. These fluctuations are replaced with a cumulative sum of all cell division events occurring over a certain period of time. It is also apparent that comparisons between different distributions do not yield an appreciable difference in developmental speed (or the accumulation of ​x cells over a certain period of time). In Figure 7, all simulations were run for 60 iterations. Investigating the potential of the Poisson distribution further, we investigate how this distribution approximates cumulative cell division (as was done in Figure 7) for three values of λ (0.1, 0.5, and 1.0). The results of this experiment are shown in Supplemental Figure 5. As this parameter value is increased, the number of cells per developmental time point increases while the interval between cell divisions decreases. While the function derived from λ = 0.1 is always slowest, the functions derived from λ = 0.5 and λ = 1.0 are similar for the first 20 timepoints, then diverge to reveal that λ = 1.0 clearly results in both faster cell divisions and a larger number of total cells after 200 iterations. Broader Questions We can ask what it means when embryogenetic systems exhibit multiple pulses of cell proliferation from division events. In particular, the intervals between pulses provide information about the generative mechanisms behind production of the embryo. Our inquiry is particularly suited to quantitative interpretation, particularly in terms of characterizing "bursty" behaviors. These bursty behaviors are non-normally distributed generative processes [27] that describe the tempo and mode of development. While tempo and mode is generally an evolutionary phenomenon, these concepts also yield a model of developmental regulation that is explicitly temporal. Our results also suggest that developmental regulation is not simply a molecular mechanism. Our network analysis also demonstrates a connection between the spatiotemporal dynamics of cell division, cell differentiation, and systems-level view of timing. For example, we have found that structure and timing of interactions shape embryo network coherence signaling [28], which in turn is an indicator of diffusion between developmental cells that share network connections. While it is not discussed in this paper, gene expression fluctuations and stochastic noise in gene expression drives heterogeneity in division timing and even timing of differentiation [29, 30]. In particular, a focus on the molecular biology of the cell cycle across groups of developmental cells [31, 32] can provide more information about how fluctuations work in general at the single-cell level. Yet single cells acting in synchrony (or in the aggregate) define the patterns observed in our empirical data. One way to generalize our results to a broader cross-species context is to examine related phenomena such as mitotic bookmarking [33], in which heritable regulatory information is transmitted from mother to daughter cells in a cell lineage. 12 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ Our approach is also quite valuable [see 34] for understanding this particular scale of the biological organism. To understand these results more fully in the context of groups of cells producing mean behaviors, we can appeal to the quantal mitosis hypothesis. Quantal mitosis involves changes in gene expression, in which the fate depends upon mitosis. This is also a gene expression-related memory mechanism that is widespread in development [33]. In cases of an observed wave or peak in cell divisions at a certain point in developmental time, mitosis provides an opportunity to change gene expression [35], and ultimately serves as a collective signal for changes in cell fate [36]. Finally, the way in which we decompose the spatiotemporal dynamics of the embryo might be useful as a supplement to reaction-diffusion models of morphogenesis [37]. Future work will involve extending this type of analysis to other species, in addition to developing our numerical models to include explicitly spatial phenomena. Acknowledgements We would like to thank members of the DevoWorm group for their support and feedback, particularly Susan Crawford-Young. Thanks also go to the OpenWorm Foundation for their institutional support. Supplemental Figures Supplemental Figure 1. Example of an embryo network from the 16-cell ​C. elegans embryo build using cell tracking data. Data shown in the context of a cartoon showing the anterior end of the embryo. Different colored edges represent cells born at different generations of the lineage tree (levels). 13 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ Supplemental Figure 2. Frequency-domain plot of cell division event frequencies in C. elegans embryo. All events greater than an amplitude of 200 shown in red, while all events greater than an amplitude of 800 shown in blue. Supplemental Figure 3. Frequency-domain plot of cell differentiation event frequencies in ​C. elegans embryo. All events greater than an amplitude of 300 shown in red, while all events greater than an amplitude of 600 shown in blue. 14 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ Supplemental Figure 4. Frequency-domain plot of cell division event frequencies in Zebrafish embryo. Supplemental Figure 5. Comparison of cumulative cell division events and the speed of division generated by a numeric embryo for the Poisson distribution at three different values of λ. Blue: λ = 0.1, Black: λ = 0.5, Red: λ = 1.0. 15 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ References [1] Simpson, G.G. (1944). Tempo and Mode in Evolution. Columbia University Press, New York. [2] Ogura, Y. & Sasakura, Y. (2016). Developmental Control of Cell-Cycle Compensation Provides a Switch for Patterned Mitosis at the Onset of Chordate Neurulation. ​Developmental Cell​, 37(2), P148-161. doi:10.1016/j.devcel.2016.03.013 [3] Bhatla, N. (2011). An Interactive Visualization of the ​C. elegans Cell Lineage. WormWeb ​, wormweb.org/celllineage [4] Keller. P.J., Schmidt, A.D., Wittbrodt, J., & Stelzer, E.H.K. (2008). Reconstruction of Zebrafish Early Embryonic Development by Scanned Light Sheet Microscopy. ​Science ​, 322(5904), 1065-1069. doi:10.1126/science.1162493. [5] Barabasi, A.L. (2005). The origin of bursts and heavy tails in human dynamics. Nature ​, 435(7039), 207–211. [6] Abney, D.H., Dale, R., Louwerse, M.M., and Kello, C.T. (2018). The Bursts and Lulls of Multimodal Interaction. ​Cognitive Science​, 42(2), 1297-1316. [7] Alicea, B. and Gordon R. (2018). Cell Differentiation Processes as Spatial Networks: identifying four-dimensional structure in embryogenesis. ​BioSystems​, 173, 235-246. [8] Alicea, B. (2018). The Emergent Connectome in ​Caenorhabditis elegans Embryogenesis. ​BioSystems ​, 173, 247-255. [9] Alicea, B. (2020). Raising the Connectome: the emergence of neuronal activity and behavior in ​C. elegans ​. ​Frontiers in Cellular Neuroscience ​, doi:10.3389/ fncel.2020.524791. [10] Foe, V.E. & Alberts, B.M. (1983). Studies of nuclear and cytoplasmic behaviour during the five mitotic cycles that precede gastrulation in ​Drosophila embryogenesis. Journal of Cell Science ​, 61, 31-70. [11] Boterenbrood, E.C., Narraway, J.M. & Hara, K. (1983) Duration of cleavage cycles and asymmetry in the direction of cleavage waves prior to gastrulation in Xenopus laevis ​. ​Roux's Archives Developmental Biology​, 192(5), 216-221. [12] Boterenbrood, E.C. & Narraway, J.M. (1986). The direction of cleavage waves and the regional variation in the duration of cleavage cycles on the dorsal side of the Xenopus laevis ​ blastula. ​Roux's Archives of Developmental Biology​, 195, 484-488. [13] Gordon, N.K. & Gordon, R. (2016). Embryogenesis Explained. World Scientific Publishing, Singapore. 16 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ [14] Holtzer, H., Rubinstein, N., Fellini, S., Yeoh, G., Chi, J., Birnbaum, J. & Okayama, M. (1975). Lineages, quantal cell cycles, and the generation of cell diversity. ​Quarterly Reviews in Biophysics ​, 8(4), 523-557. [15] Holtzer, H., Biehl, J., Antin, P., Tokunaka, S., Sasse, J., Pacifici, M. & Holtzer, S. (1983). Quantal and proliferative cell cycles: how lineages generate cell diversity and maintain fidelity. ​Progress in Clinical Biological Research​, 134, 213-227. [16] Bao, Z., Murray, J.I., Boyle, T., Ooi, S.L., Sandel, M.J., and Waterston, R.H. (2006). Automated cell lineage tracing in ​Caenorhabditis elegans ​. ​PNAS​, 103(8), 2707-2712. [17] Keller, P.J., Schmidt, A.D., Wittbrodt, J., and Stelzer, E.H.K. (2008). Reconstruction of Zebrafish early embryonic development by scanned light sheet microscopy. ​Science ​, 322(5904), 1065-1069. [18] Kimmel, C.B., Ballard, W.W., Kimmel, S.R., Ullmann, B., and Schilling, T.F. (1995). Stages of Embryonic Development of the Zebrafish. ​Developmental Dynamics​, 203, 253-310. [19] Raible, D.W. and Eisen, J.S. (1996). Regulative Interactions in Zebrafish Neural Crest. ​Development ​, 122, 501-507. [20] Menon, T., Borbora, A.S., Kumar, R., and Nair, S. (2020). Dynamic optima in cell sizes during early development enable normal gastrulation in Zebrafish embryos. Developmental Biology ​, 468(1-2), 26-40. [21] Shah, G., Thierbach, K., Schmid, B., Waschke, J., Reade, A., Hlawitschka, M., Roeder, I., Scherf, N., and Huisken, J. (2019). Multi-scale imaging and analysis identify pan-embryo cell dynamics of germ layer formation in Zebrafish. ​Nature Communications ​, 10, 5753. [22] Desmaison, A., Guillaume, L., Triclin, S., and Weiss, P., Ducommun, B., and Lobjois, V. (2018). Impact of physical confinement on nuclei geometry and cell division dynamics in 3D spheroids. ​Scientific Reports ​, 8, 8785. doi:10.1038/s41598-018-27060-6. [23] Yoshizawa, M., Hixon, E., and Jeffery, W.R. (2018). Neural Crest Transplantation Reveals Key Roles in the Evolution of Cavefish Development. Integrative and Comparative Biology ​, 58(3), 411-420. [24] Fickentscher. R. and Weiss, M. (2017). Physical determinants of asymmetric cell divisions in the early development of ​Caenorhabditis elegans ​. ​Scientific Reports ​, 7, 9369. doi:10.1038/s41598-017-09690-4. [25] Alicea, B. and Gordon, R. (2016). Quantifying Mosaic Development: Towards an Evo-Devo Postmodern Synthesis of the Evolution of Development Via Differentiation Trees of Embryos [invited]. Biology, 5(3), 33. 17 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ [26] Leoni, M., Manyuhina, O.V., Bowick, M.J., and Marchetti, M.C. (2017). Defect driven shapes in nematic droplets: analogies with cell division. ​Soft Matter ​, 13, 1257-1266. doi:10.1039/C6SM02584F [27] Bono, R., Blanca, M.J., Arnau1, J., and Gómez-Benito, J. (2017). Non-normal Distributions Commonly Used in Health, Education, and Social Sciences: A Systematic Review.​ Frontiers in Psychology ​, 8, 1602. doi:10.3389/fpsyg.2017.01602 [28] Akbarpour, M. and Jackson, M. (2018). Diffusion in networks and the virtue of burstiness. ​PNAS ​, 115(30), E6996-E7004. [29] Ben-Moshe, S. and Itzkovitz, S. (2016). Bursting through the cell cycle. eLife​, 5, e14953. [30] Wang, H., Yuan, Z., Liu, P., and Zhou, T. (2015). Division time-based amplifiers for stochastic gene expression. ​Molecular Biosystems ​, 11(9), 2417-2428. doi: 10.1039/c5mb00391a. [31] Csikasz-Nagy, A. (2009). Computational systems biology of the cell cycle. ​Briefs in Bioinformatics ​, 10(4), 424-434. doi:10.1093/bib/bbp005. [32] Dangarh, P., Pandey, N., Vinod, P.K. (2020). Modeling the Control of Meiotic Cell Divisions: Entry, Progression, and Exit. ​Biophysical Journal ​, 119(5), 1015-1024. doi:10.1016/j.bpj.2020.07.017. [33] Festuccia, N., Gonzalez, I., Owens, N., and Navarro, P. (2017). Mitotic bookmarking in development and stem cells. ​Development​, 144, 3633-3645. [34] Alfieri, R., Merelli, I., Mosca, E., and Milanesi, L. (2007). A data integration approach for cell cycle analysis oriented to model simulation in systems biology. BMC Systems Biology ​, 1, 35. doi:10.1186/1752-0509-1-35. [35] Halley-Stott, R.P., Jullien, J., Pasque, V., and Gurdon, J. (2014). Mitosis Gives a Brief Window of Opportunity for a Change in Gene Transcription. ​PLoS Biology​, 12(7), e1001914. https://doi.org/10.1371/journal.pbio.1001914 [36] Perez-Carrasco, R., Beentjes, C. and Grima, R. (2020). Effects of cell cycle variability on lineage and population measurements of messenger RNA abundance. Journal of the Royal Society Interface ​, 1720200360. [37] Green, J.B.A. and Sharpe, J. (2015). Positional information and reaction-diffusion: two big ideas in developmental biology combine. ​Development​, 142, 1203-1211; doi:10.1242/dev.114991. 18 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.06.425581doi: bioRxiv preprint https://doi.org/10.1101/2021.01.06.425581 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_01_08_425967 ---- Partition Quantitative Assessment (PQA): A quantitative methodology to assess the embedded noise in clustered omics and systems biology data Partition Quantitative Assessment (PQA): A quantitative methodology to assess the embedded noise in clustered omics and systems biology data Camacho-Hernández, Diego A.1,2†, Nieto-Caballero, Victor E.1,2†, León-Burguete, José E.1,2, and Freyre-González, Julio A.1,* 1 Regulatory Systems Biology Research Group, Laboratory of Systems and Synthetic Biology and 2 Undergraduate Program in Genomic Sciences, Center for Genomic Sciences, Universidad Nacional Autónoma de México (UNAM), Morelos, Mexico. † These authors contributed equally to this work. * Corresponding author: jfreyre@ccg.unam.mx Abstract: Identifying groups that share common features among datasets through clustering analysis is a typical problem in many fields of science, particularly in post-omics and systems biology research. In respect of this, quantifying how a measure can cluster or organize intrinsic groups is important since currently there is no statistical evaluation of how ordered is, or how much noise is embedded in the resulting clustered vector. Many of the literature focuses on how well the clustering algorithm orders the data, with several measures regarding external and internal statistical measures; but none measure has been developed to statistically quantify the noise in an arranged vector posterior a clustering algorithm, i.e., how much of the clustering is due to randomness. Here, we present a quantitative methodology, based on autocorrelation, to assess this problem. Keywords: omics data; hierarchical clustering; noise quantification. 1. Introduction A common task in today’s research is the identification of specific markers, as predictors of a classification yielded in clustering analysis of the data. For instance, this approach is particularly useful after high-throughput experiments to compare gene expression or methylation profiles among different cell lines [1]. This task is coming handful in the nascent field of single-cell sequencing, leading the important step of clustering cells to further classification or as a qualifying metric of the sequencing process [2]. Regarding the vastly used gene expression assays, the vector of profiles for each marker across different cell lines is recorded using hierarchical clustering algorithms. These algorithms yield a dendrogram and a heat map representing the vector of marker profiles, illustrating the arrangement of the clusters. To assess how well the clustering is segregating different cell lines, a class stating the desired partitioning of each cell line is provided a posteriori. Then, a simple visual inspection of the vector of classes is used to estimate whether the clustering is providing a good partition. Such partition vector is colored according to the classification that each item is associated with, and it is expected that similar items will be contiguous, so the formed groups are assessed qualitatively on the biological background of each item. This procedure should not be confused with “supervised clustering”, which provides a vector of classes starting the desired partitioning a priori. This is then used to guide the clustering algorithms by allowing the learning of the metric distances that optimizes the partitioning [3]. Additionally, it may get confused with the metric assessment of the clustering algorithms, especially with the external cluster evaluation. For this, various metrics have been developed to qualify the clustering algorithm itself, such as intrinsic and extrinsic measures. These metrics are used for clustering algorithm validation. The extrinsic validation compares the clustering to a goal to say whether it is a good clustering or not. The internal validation compares the elements within the cluster and their differences [4]. PQA involves characteristics of both kinds of validation, through using both the .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425967doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425967 http://creativecommons.org/licenses/by/4.0/ crafted goal standard and the yielded signal itself (clustered vector). However, PQA gathers these elements not qualifying the clustering algorithm itself but to quantify the noise embedded in the cluster, this noise may be due to the intrinsic metric or marker used to order the data set. A possible caveat of the qualitative assessment discussed above is that humans tend to perceive meaningful patterns within random data leading to a cognitive bias known as apophenia [5]. While interpreting the partitions obtained from unsupervised clustering analysis, researchers attempt to visually assess how close the classifications are to each other finding patterns that are not well supported by the data. Such an effect is raised because the adjacency between items may give a notion of the dissimilarity distance in the dendrogram leaves. Unfortunately, as much as we know, there is no method to quantitatively assess the quality of the groups of classifications from the clustering or, at least, there is no attempt to quantify whether certain configuration or order of the items may be due to randomness. This is a serious caveat, since the insertion of noise can lead to false conclusion or misleading results. Furthermore, the purging of this noise can lead to a more efficient descriptions of markers and its phenomena, accelerating the advance in many fields. In statistics, serial correlation (SC) is a term used to describe the relationship between observations of the same variable over specific periods. It was originally used in engineering to determine how a signal, for instance, a radio wave, varies with itself over time. Later, SC was adapted to econometrics to analyze economic data over time principally to predict stock prices and, in other fields, to model-independent random variables [6]. We applied the SC to propose a manner to quantify how well is the grouping of a posterior classification just by retrieving the results of unsupervised clustering analysis. Thus, we propose a novel relative score, PQA, to solve the subjectivity of the visual inspection and to statistically quantify how much noise is embedded in the results of clustering analysis. 2. Methodology 2.1. Assigning numeric labels to classifications A vector denoting the putative similarities among the variables in a study is usually obtained after a clustering analysis. Each variable is classified to generate a vector of profiles (VP). Such a vector of classifications is usually translated into a colors vector, in which each color represents a classification. It is common to inspect this vector to find groups that make sense according to the analyzed data. To the method presented in this work, the VP may be as simple as a vector of strings or numbers that represent the input. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425967doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425967 http://creativecommons.org/licenses/by/4.0/ Figure 1. The pipeline of the PQA methodology. Whatever representation of the classifications may be, it is necessary to transform the classifications to a vector of numeric labels, in which a number represents a classification, to be able to calculate SC. To accomplish this, we assign the first numeric label (number 1) to the first item in the vector, which usually lays at one of the vector’s extremes. Then, if the classification o the next item is different from the previous one, the next number in the sequence is assigned, and so on. This way of labeling warrants that the changes in the SC values are due to the order of numbers, that is to say, the grouping of the classifications resulting from the clustering, and it is not an artifact of the labeling itself (Figure 1). 2.2. PQA score Because the order of the VP could be interpreted as the grouping of the classifications, we measure how well the same classifications are held together in the VP through an SC shifted one position. Such sort of correlation is defined as the Pearson-product-moment correlation between the VP discarding the first item, and the VP discarding the last (Equation 1, xi (order vector i-th position), n (length of x), 𝜌𝑖 (resulting SC)). 𝜌𝑖 = ∑ (𝑥𝑖− ∑ 𝑥𝑖 𝑛 𝑗=2 𝑛−1 ) ∑ (𝑥𝑖− ∑ 𝑥𝑖 𝑛−1 𝑗=1 𝑛−1 ) 𝑛−1 𝑖=1 𝑛 𝑖=2 √∑ (𝑥𝑖− ∑ 𝑥𝑖 𝑛 𝑗=2 𝑛−1 ) 2 𝑛 𝑖=2 √∑ (𝑥𝑖 − ∑ 𝑥𝑖 𝑛−1 𝑗=1 𝑛−1 ) 2 𝑛−1 𝑖=1 (1) We then define the PQA as the SC of the VP after removing background noise, normalized for the SC of the percent grouping partitions (defined as the sorted vector in ascending order). This, the more similar VP is to its sorted vector, the higher the score is yielded (Equation 2, 𝝆𝒙 (SC of the VP), 𝝆𝑹𝒂𝒏𝒅𝒙̅̅ ̅̅ ̅̅ ̅̅ ̅ (Mean of the SC of one thousand randomizations), 𝝆𝑷𝒆𝒓𝒇𝒆𝒄𝒕𝒙 SC of the sorted vector in ascending order)). 𝑷𝑸𝑨𝒙 = 𝝆𝒙−𝝆𝑹𝒂𝒏𝒅𝒙̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅ 𝝆𝑷𝒆𝒓𝒇𝒆𝒄𝒕𝒙 (2) 2.3. Background-noise correlation factor in the PQA score .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425967doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425967 http://creativecommons.org/licenses/by/4.0/ To compute the background-noise correlation factor in the PQA score definition, we sample the indexes of the VP and the swapping the corresponding items. This background correction is aimed to remove inherent noise in the data, even though the score may still be subjected to noise from the chosen clustering algorithm or discrepancies in the posterior classification. 2.4. Statistical significance of the PQA score To quantify the statistical significance of the PQA score, we calculate a Z-score (Equation 3), 𝒛𝒙 = 𝑷𝑸𝑨𝒙−𝑷𝑸𝑨𝑹𝒂𝒏𝒅̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅ 𝑺𝑫𝑷𝑸𝑨𝑹𝒂𝒏𝒅 (3) where 𝑃𝑄𝐴𝑥 is the PQA score of the VP, 𝑃𝑄𝐴𝑅𝑎𝑛𝑑̅̅ ̅̅ ̅̅ ̅̅ ̅̅ ̅̅ is the mean of PQA scores of one thousand randomizations of the VP. These randomizations have the purpose of generating a solid random background to compare it to the real signal. The number of randomizations does not depend on the size of the VP. It is worth to notice that there are two randomization processes, one is meant to generate the input population of random vectors to calculate the PQA score to further calculate a Z- score and the other is representing the noise in Equation 2. 2.5. Defining noise proportions To provide a quantification of the embedded noise in the VP, we calculate the Z-scores from the distribution of PQA values of the randomized vectors. This shuffling is yielded by scrambling the vector. Then this Z-score is interpolated to retrieve the estimated noise in the VP cluster. 2.6. Effect of the length and number of partitions of the vector in the Z-score distributions. Since we want to compare the PQA with the noise, we randomized 1000 times the VP. We opted to describe the dynamic of the Z-score given the different percentage of noise and the number of partitions. For this, we synthetically crafted vector of both ranging from 0 to 100 elements and number of classifications. The Z-scores were retrieved from the crafted vectors using the formulas described above. 3. Results and Discussion 3.1. Effects of permuted numeric labels on the partition We wondered whether the correct assigning of numeric labels to alter the less possible the SC calculations, so we analyzed how the SC changes over the synthetic partitions with permuted labels. We began generating synthetic partitions in ascending and descending order, increasing both the number of classifications and the number of items, up to 100. It is important to highlight that the number of items belonging to each classification was kept constant. Because trying all the possible permutations for each vector would be implausible, we created a subset of 1000 permutations of each vector, then we calculated the mean SC (Figure 1, see Methodology). We observed that the mean SC got high when the number of items in the VP was greater or equal to 2 times the number of classifications, nevertheless, we got the highest SC when the numeric labels we assigned by sequential order, either ascending or descending (Figure 2). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425967doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425967 http://creativecommons.org/licenses/by/4.0/ Figure 2. Z-scores of the PQA scores from partitions varying in the number of classifications and the length of the partition. 3.2. Length of partitions as a proxy of the number of classifications We wonder whether the number of classifications and the length of the VP may change the statistical significance of the PQA score because of the less the number of items in the VP, the greater the chance to group each item with any order. We then tested such effect by calculating a Z-score from ordered synthetic partitions increasing both the number of classifications and the number of items up to 100. We also kept constant the number of classifications for the sake of this analysis. We noticed that only the length of the partition has a true effect on the Z-score, but that is not the case for the number of classifications. We observed that every partition minor than 13 could be considered as pure noise, however, we consider a Z-score cutoff of greater than 3 (p-value of 0.002). We also observed Z-score values still greater than 2 with a length of 12, 11, and 10, but lesser than with lengths between 2 and 9 (Figure 2). If we were more flexible, we could have laid out a length cutoff on those values without losing statistical significance, since a Z-score of 2 corresponds roughly to a p-value of 0.05. The results of this analysis were expected by intuition because the probability of an item to occupy a position in the VP increases the number of items does the same. 3.3. Proof of concept: Quantifying real noise After a literature revision, we noticed that some datasets were subject to visual inspection in their respective papers, so we applied our method to quantify the proportion of noise embedded in those datasets and to test whether they may lead to apophenia. We choose two datasets from literature because of two main reasons, first, the data should have a high number of items that are way above our Z-score significance threshold (>13) and, second, we wanted contrasting orderings of the partitions so to have one dataset that looks very disordered and another that looks somewhat ordered to compare the noise proportions. Lastly, we assessed the behavior of the metric in highly ordered data. This also matches our threshold mentioned above. 3.3.1. Cancer methylation signatures The first dataset consists of methylation profiles of 242 different cancerous and non-cancerous samples [7] (Figure 3). Though the classifications look very sparse and the groups are torn apart in many subgroups distributed along with the data’s VP. We detected 25.1% of noise and a PQA score .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425967doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425967 http://creativecommons.org/licenses/by/4.0/ of 0.53 (Figure 4, with a Z-score of 8.2 and a p-value of 9.6x10-17), both numbers imply that even though there may be disordered in the VP, there is not a very high noise proportion nor a high PQA score. These results suggest that, like any other statistical test, the longer the number of items in the partition the more diluted is the effect of disorder in the VP, and the results also lead to a greater statistical significance as shown in the analysis of the number of items and classifications. Besides the authors concluded that their clustering analysis results made sense from their molecular and biological background, as well as the perspectives about the analyzed profiles, they only assessed grouping just by visual inspection and concluded the grouping was well done. However, understanding the noise in the cluster can help to pursue better markers since it could help to narrow the search space in these kinds of studies. (a) (b) Figure 3. Visual representation of clustered data used to assess the method. (a) Dataset from Jie Shen et. al. (b) Dataset from Tooyoka et. al. 3.3.2. Distribution of microRNAs in cancer The second dataset consists of 103 expression profiles of microRNAs from three classes of samples: invasive breast cancer, those with ductal carcinoma in situ (DCIS), and health (Figure 3) [8]. The authors visually identified three clusters, though selecting the right cutting height threshold is difficult. Besides, one of the clusters is a mix of classes in different proportions, leading the authors to arguably conclude that the DCIS and control sample profiles are not different. On this matter, the PQA score and the proportion of noise are 0.62 and 30.2%, respectively (Figure 4, with Z-score of 6.2 and a p-value of 3.9x10-10) providing a quantitative assay to support the grouping that the authors claimed. Furthermore, in comparison with the methylation profiles discussed above, we can appreciate that a partition which appear even less fuzzy has even a higher noise ratio, supporting the idea of how visual inspection could lead to misleading results. (a) (b) Figure 4. Z-score distribution by percentage of randomized items. (a) Dataset from Jie Shen et. al. (b) Dataset from Tooyoka et. al. The red dots represent the Z-score interpolation of the corresponding data sets. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425967doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425967 http://creativecommons.org/licenses/by/4.0/ 3.3.3. Comparison of genetic regulatory networks with theoretical models Finally, to assess the PQA methodology using systems biology data we clustered 210 networks according to their pairwise dissimilarity [9]. First, 42 curated biological networks were retrieved from Abasy Atlas (v2.2) [10]. For each biological network, we then constructed four networks each according to a theoretical model (Barabasi-Alberts, Erdos-Renyi, Scale-free, and Hierarchical- modular). We estimated the parameters of each theoretical model from the properties of the corresponding biological network. The models used reproduce one or more intrinsic characteristics of the biological networks, such as power-law distribution, hubs, and scale-free degrees, and hierarchical modular structure [11]. Visual inspection suggested that the classification yielded a highly ordered PV, distinguishing according to the nature of each network (Figure 5). The PQA score for this VP is 0.92 (p-value = 2.5x10-40, Z-score =13.2) and the proportion of noise was 5.8% (Figure 6). In contrast to the previous examples, here we obtained a highly ordered clustering and a very low proportion of noise, which suggests that although the models recapitulate some of the properties of genetic regulatory networks, each of them is not sufficient to capture their structural properties. Figure 5. Cluster analysis of distance among gene regulatory networks and theoretical network models. The abbreviations and colors used in the posterior classification are as follows: Barabasi- Alberts (BA, red), Erdos-Renyi (ER, blue), Scale-free (SF, green), Hierarchical modularity (HM, purple), and biological networks (Bi, orange). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425967doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425967 http://creativecommons.org/licenses/by/4.0/ Figure 6. Z-score distribution by percentage of randomized items of VP from genetic regulatory networks. The red dot represents the Z-score interpolation of the actual data set. 4. Conclusions In this work, we presented a novel method to quantify the proportion of noise embedded in the grouping of associated classes of the elements in hierarchical clustering. We proposed a relative score derived from an SC of the VP from the dendrogram of any clustering analysis and calculated Z- statistics as well as an extrapolation to deliver an estimation of noise in the VP. We explain how the method is formulated and show the tests we made to systematically refine it. We additionally made a proof of concept by using clustering data from two works that we think perfectly represent overfitting by apophenia. Additionally, we added an example from network biology where clustered networks are separated by intrinsic characteristics. Although in this work we focused on examples where hierarchical clustering is performed, this framework can apply to any partition algorithm in which the elements are identified and a vector of the order can be acquired. We concluded that the clustered sets of biologic data have a high measure of noise, despite looking well grouped. We proved what a minimum number of classifications should be considered in this sort of clustering analysis to have a significant reduction of noise. On the other hand, we permuted the labels of the associated classes and concluded that the effect is negligible. We proved that randomness still plays an important role by biasing the results, though it may not be evident through visual inspection. The PQA could be used as a benchmark to test what clustering algorithm should be appropriate for the analyzed dataset by minimizing the noise proportion and to guide omics experimental designs. Nevertheless, a word of caution, the PQA score alone can be subject to subjectivity if not used properly since it depended on the characteristics of the analyzed data. Thus, the PQA score is thought to be considered as a quantification of noise in clustered data and should be used with discretion. Author Contributions: Conceptualization, J.A.F.G.; methodology, J.A.F.G.; software, D.A.C.H., V.E.N.C., and J.A.F.G.; validation, D.A.C.H., V.E.N.C., and J.A.F.G.; formal analysis, D.A.C.H., V.E.N.C., and J.A.F.G.; investigation, D.A.C.H., V.E.N.C., J.R.L.B., and J.A.F.G.; resources, J.A.F.G.; data curation, D.A.C.H., V.E.N.C., and J.E.L.B.; writing—original draft preparation, D.A.C.H., V.E.N.C., J.E.L.B., and J.A.F.G.; writing—review and editing, D.A.C.H., V.E.N.C., and J.A.F.G.; visualization, D.A.C.H., V.E.N.C., J.E.L.B., and J.A.F.G.; supervision, J.A.F.G.; project administration, J.A.F.G.; funding acquisition, J.A.F.G. All authors have read and agreed to the published version of the manuscript. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425967doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425967 http://creativecommons.org/licenses/by/4.0/ Funding: This work was supported by the Programa de Apoyo a Proyectos de Investigación e Innovación Tecnológica (PAPIIT-UNAM) [IN205918 to J.A.F.G.]. Conflicts of Interest: The authors declare no conflict of interest. References 1. Kang, S., Kim, B., Park, S.-B., et al. 2013. Stage-specific methylome screen identifies that NEFL is downregulated by promoter hypermethylation in breast cancer. International Journal of Oncology 43(5), pp. 1659–1665, doi:10.3892/ijo.2013.2094. 2. Kiselev, V. Y., Andrews, T. S., & Hemberg, M. (2019). Challenges in unsupervised clustering of single-cell RNA-seq data. Nature Reviews Genetics, 20(5), 273-282, doi:10.1038/s41576-018-0088-9. 3. Al-Harbi, S.H. and Rayward-Smith, V.J. 2006. Adapting k-means for supervised clustering. Applied Intelligence 24(3), pp. 219–226, doi:10.1007/s10489-006-8513-8. 4. Hassani, M., & Seidl, T. (2017). Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam Journal of Computer Science, 4(3), 171-183, doi:10.1007/s40595-016-0086-9. 5. Fyfe, S., Williams, C., Mason, O.J. and Pickup, G.J. 2008. Apophenia, theory of mind and schizotypy: perceiving meaning and intentionality in randomness. Cortex 44(10), pp. 1316–1325, doi:10.1016/j.cortex.2007.07.009. 6. Getmansky, M., Lo, A.W. and Makarov, I. 2004. An econometric model of serial correlation and illiquidity in hedge fund returns. Journal of financial economics 74(3), pp. 529–609, doi:10.1016/j.jfineco.2004.04.001 . 7. Shen, J., Hu, Q., Schrauder, M., et al. 2014. Circulating miR-148b and miR-133a as biomarkers for breast cancer detection. Oncotarget 5(14), pp. 5284–5294, doi:10.18632/oncotarget.2014. 8. Toyooka, S., Toyooka, K. O., Maruyama, R., Virmani, A. K., Girard, L., Miyajima, K., ... & Brambilla, E. (2001). DNA Methylation Profiles of Lung Tumors. Molecular cancer therapeutics, 1(1), 61-67. 9. Schieber, T. A., Carpi, L., Díaz-Guilera, A., Pardalos, P. M., Masoller, C., & Ravetti, M. G. (2017). Quantification of network structural dissimilarities. Nature communications, 8(1), 1-10. 10. Escorcia-Rodríguez, J. M., Tauch, A., & Freyre-González, J. A. (2020). Abasy Atlas v2. 2: The most comprehensive and up-to-date inventory of meta-curated, historical, bacterial regulatory networks, their completeness and system-level characterization. Computational and Structural Biotechnology Journal, doi:10.1016/j.csbj.2020.05.015. 11. Barabasi, A. L., & Oltvai, Z. N. (2004). Network biology: understanding the cell's functional organization. Nature reviews genetics, 5(2), 101-113, doi:10.1038/nrg1272. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425967doi: bioRxiv preprint http://f1000.com/work/bibliography/3408741 http://f1000.com/work/bibliography/3408741 http://f1000.com/work/bibliography/3408741 http://f1000.com/work/bibliography/3408741 http://f1000.com/work/bibliography/2985013 http://f1000.com/work/bibliography/2985013 http://f1000.com/work/bibliography/2985013 http://f1000.com/work/bibliography/2985013 http://f1000.com/work/bibliography/2985013 http://f1000.com/work/bibliography/8693887 http://f1000.com/work/bibliography/8693887 http://f1000.com/work/bibliography/8693887 http://f1000.com/work/bibliography/8693887 http://f1000.com/work/bibliography/8561657 http://f1000.com/work/bibliography/8561657 http://f1000.com/work/bibliography/8561657 http://f1000.com/work/bibliography/8561657 https://doi.org/10.1101/2021.01.08.425967 http://creativecommons.org/licenses/by/4.0/ 10_1101-2021_01_08_425918 ---- 94119894 A global cancer data integrator reveals principles of synthetic lethality, sex disparity and immunotherapy. Christopher Yogodzinski1,2,#*, Abolfazl Arab1-3, Justin R. Pritchard4, Hani Goodarzi1-3, Luke A. Gilbert1,2,5* 1 Department of Urology, University of California, San Francisco, San Francisco, CA, USA 2 Helen Diller Family Comprehensive Cancer Center, San Francisco, San Francisco, CA, USA 3 Department of Biochemistry and Biophysics, University of California, San Francisco, CA, USA 4 Department of Biomedical Engineering, Pennsylvania State University, University Park, PA 5 Department of Cellular & Molecular Pharmacology, University of California, San Francisco, CA, USA # Current Address: University of North Carolina Chapel Hill School of Medicine, Chapel Hill, NC, USA *Corresponding authors Correspondence: cyogodzi@unc.edu (C.Y.), luke.gilbert@ucsf.edu (L.A.G) (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Abstract Advances in cancer biology are increasingly dependent on integration of heterogeneous datasets. Large scale efforts have systematically mapped many aspects of cancer cell biology; however, it remains challenging for individual scientists to effectively integrate and understand this data. We have developed a new data retrieval and indexing framework that allows us to integrate publicly available data from different sources and to combine publicly available data with new or bespoke datasets. Beyond a database search, our approach empowered testable hypotheses of new synthetic lethal gene pairs, genes associated with sex disparity, and immunotherapy targets in cancer. Our approach is straightforward to implement, well documented and is continuously updated which should enable individual users to take full advantage of efforts to map cancer cell biology. Introduction Large scale but often independent efforts have mapped phenotypic characteristics of more than one thousand human cancer cell lines. Despite this, static lists of univariate data generally cannot identify the underlying molecular mechanisms driving a complex phenotype. We hypothesized that a global cancer data integrator that could incorporate many types of publicly available data including functional genomics, whole genome sequencing, exome sequencing, RNA expression data, protein mass spectrometry, DNA methylation profiling, ChIP- seq, ATAC-seq, and metabolomics data would enable us to link disease features to gene products 1–15. We set out to build a resource that enables cross platform correlation analysis of multi-omic data as this analysis is in and of itself is a high-resolution phenotype. Multi-omic analysis of (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 functional genomics data with genomic, metabolomic or transcriptomic profiling can link cell state or specific signaling pathways to gene function 2,3,13,15–18. Lastly, co-essentiality profiling across large panels of cell lines has revealed protein complexes and co-essential modules that can assign function to uncharacterized genes 19. Problematically, in many cases publicly available data are poorly integrated when considering information on all genes across different types of data and the existing data portals are inflexible. For example, lists of genes cannot be queried against groups of cell lines stratified by mutation status or disease subtype. Furthermore, one cannot integrate new data derived from individual labs or other consortia. We created the Cancer Data Integrator (CanDI) which is a series of python modules designed to seamlessly integrate genomic, functional genomic, RNA, protein and metabolomic data into one ecosystem. Our python framework operates like a relational database without the overhead of running MySQL or Postgres and enables individual users to easily query this vast dataset and add new data in flexible ways. This was achieved by unifying the indices of these datasets via index tables that are automatically accessed through CanDI’s biologically relevant Python Classes. We highlight the utility of CanDI through four types of analysis to demonstrate how complex queries can reveal previously unknown molecular mechanisms in synthetic lethality, sex disparity and immunotherapy. These data nominate new small molecule and immunotherapy anti-cancer strategies in KRAS-mutant colon, lung and pancreatic cancers. Results CanDI is a global cancer data integrator. We set out to integrate three types of data by creating programmatic and biologically relevant abstractions that allow for flexible cross referencing across all datasets. Data from the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Cancer Cell Line Encyclopedia (CCLE) for RNA expression, DNA mutation, DNA copy number and chromosome fusions across more than 1000 cancer cells lines was integrated into our database with the functional genomics data from the Cancer Dependency Map (DepMap) (Fig. 1a,b and Supplementary Fig. 1) 1,12,20. We also integrated protein-protein interaction data from the CORUM database along with three additional distinct protein localization databases 4,7,11,21. CanDI by default will access the most recent release of data from DepMap although users can also specify both the release and data type that is accessed. The key advantage to this approach is that CanDI enables one to easily input user defined queries with multi-tiered conditional logic into this large integrated dataset to analyze gene function, gene expression, protein localization and protein-protein interactions. CanDI identifies genes that are conditionally essential in BRCA-mutant ovarian cancer. The concept that loss-of-function tumor suppressor gene mutations can render cancer cells critically reliant on the function of a second gene is known as synthetic lethality. Despite the promise of synthetic lethality, it has been challenging to predict or identify genes that are synthetic lethal with commonly mutated tumor suppressor genes. While there are many underlying reasons for this challenge, we reasoned that data integration through CanDI could identify synthetic lethal interactions missed by others. A paradigmatic example of synthetic lethality emerged from the study of DNA damage repair (DDR)22. Somatic mutations in the DNA double-strand break (DSB) repair genes, BRCA1/2, create an increased dependence on DNA single strand break (SSB) repair. This dependence can be exploited through small molecule inhibition of PARP1 mediated SSB repair. Inhibition of PARP provides significant clinical responses in advanced breast and ovarian cancer (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 patients but they ultimately progress22. Thus, new synthetic lethal associations with BRCA1/2 are a potential path towards therapeutic development PARP refractory patients. To illustrate the flexibility of CanDI to mine context specific synthetic sick lethal (SSL) genetic relationships we hypothesized that the genes that modulate response to a PARP1 inhibitor might be enriched for selectively essential proliferation or survival of BRCA1/2-mutant cancer cells. To test this hypothesis, we integrated the results of an existing CRISPR screen that identified genes that modulate response to the PARP inhibitor olaparib23. We then tested whether any of these genes are differentially essential for cell proliferation or survival in ovarian cancer and in breast cancer cell models that are either BRCA1/2 proficient or deficient (Fig. 1c,d). This query revealed that the Fanconi Anemia pathway is selectively essential in BRCA1/2-mutated ovarian cancer models but not in BRCA1/2-wild type ovarian cancer, BRCA1/2-mutated breast cancer or BRCA1/2-wildtype breast cancer models (Fig. 1e and Supplementary Table 1). To our knowledge a SSL phenotype between FANCM and BRCA1/2 has never been reported although a recent paper nominated a role for FANCM and BRCA1 in telomere maintenance24. Importantly, FANCM is a helicase/translocase and thus considered to be a druggable target for cancer therapy25. Clinical genomics data support this SSL hypothesis although this remains to be tested in ovarian cancer patient samples26. Because the DepMap currently only allows single genes to be queried and does not enable users to easily stratify cell lines by mutation such analysis would normally take a user several days to complete manually. Our approach enabled this analysis to be completed using a desktop computer in less than two hours, which includes the visualization of data presented here (Fig. 1e). Figure 1. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Figure 1. (A) A schematic showing human cell models integrated by CanDI. (B) A schematic illustrating types of data integrated by CanDI. (C) A cartoon of a genome-scale CRISPRi screen to identify genes that modulate response to PARP inhibition by Olaparib. (D) A schematic depicting data feature inputs parsed by CanDI. (E) Essentiality of Fanconi Anemia genes in ovarian and breast cancer cell lines separated by BRCA mutation status. A Bayes Factor score of gene essentiality is displayed by a heat map. N=4 BRCA1/2-mutant ovarian cancer, N=27 BRCA-wildtype ovarian cancer, N=5 BRCA1/2-mutant breast cancer, N=19 BRCA1/2-wildtype breast cancer. Conditional genetic essentiality in KRAS- and EGFR- mutant NSCLC cells. Beyond TSGs, many common driver oncogenes such as KRASG12D are currently undruggable, which motivates the search for oncogene specific conditional genetic dependencies. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 We reasoned that CanDI enables us to rapidly search functional genomics data for genes that are conditionally essential in lung cancer cells driven by KRAS- and EGFR-mutations. We stratified non-small cell lung cancer cell (NSCLC) models by EGFR and KRAS mutations and then looked at the average gene essentiality for all genes within each of these 4 subtypes of NSCLC. We observed that KRAS is conditionally self-essential in KRAS-mutant cell models but that no other genes are conditionally essential in KRAS-mutant, EGFR-mutant, KRAS-wildtype or EGFR-wildtype cell models (Fig. 2a,b and Supplementary Table 2). This finding demonstrates that very few---if any--- genes are synthetic lethal with KRAS- or EGFR- in KRAS- and EGFR- mutant lung cancer cell lines. It may be that these experiments are underpowered or it may be that when the genetic dependencies of diverse cell lines representing a disease subtype are averaged across a single variable (e.g. a KRAS-mutation) very few common synthetic lethal phenotypes are observed27. CanDI provides potential solutions for both of these hypotheses. CanDI enables a global analysis of conditional essentiality in cancer. It is thought that data aggregation across vast landscapes of unknown co-variates does not necessarily increase the statistical power to identify rare associations27. Thus, the global analyses of aggregated cancer data sometimes lies in systematically sub setting data based on key co- variates post aggregation. This has been observed in driver gene identification28. Inspired by our analysis of TSG and oncogene conditionally essentiality above, we next used CanDI to identify genes that are conditionally essential in the context of several hundred cancer driver mutations. We first grouped driver mutations (e.g. nonsense or missense) for each driver gene. For this analysis, we selected several thousand genes that are in the 85-90th percentile of essentiality within the DepMap data and therefore conditionally essential, meaning these genes are required (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 for cell growth or survival in a subset of cell lines. Importantly, it is not known why these several thousand genes are conditionally essential. We then tested whether each of these conditionally essential genes has a significant association with individual driver mutations. Our analytic approach does not weight the number of cell models representing each driver mutation nor does this give information on phenotype effect sizes. Our analysis nominates a large number of conditionally dependent genetic relationships with both TSG and oncogenes (Fig. 2c,d and Supplementary Table 3). A number of the conditional genetic dependencies identified in our independent variable analysis above are represented by a limited number of cell models and so further investigation is needed to validate these conditional dependencies, but this data further suggests that averaging genetic dependencies across diverse cell lines with un-modeled covariates obscures conditional SSL relationships. To further investigate this hypothesis, we analyzed these same conditional genetic relationships with a second analytic approach that weights the number of cell models representing each driver mutation. We observed a limited number of conditional genetic dependencies that largely consists of oncogene self-essential dependencies as previously highlighted for KRAS-mutant cell lines (Fig. 2e-g and Supplementary Table 4)13,29. Thus, analysis that averages each conditional phenotype across diverse panels of cell lines with unknown covariates masks interesting conditional genetic dependencies. Figure 2. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Figure 2. (A) Average gene essentiality for KRAS and EGFR in groups of NSCLC cell lines stratified by KRAS mutation status or by both KRAS and EGFR mutation status. N=38 for KRAS-wildtype shown in blue N=19 for KRAS-mutant shown in blue. N=30 for KRAS- wildtype EGFR-wildtype shown in grey and N=16 for KRAS-mutant EGFR-wildtype shown in grey. Gene essentiality is an averaged Bayes Factor score for each group of cell lines. (B) Average gene essentiality for KRAS and EGFR in groups of NSCLC cell lines stratified by EGFR mutation status or by both EGFR and KRAS mutation status. N=46 for EGFR-wildtype shown in blue, N=11 for EGFR-mutant shown in blue. N=30 for EGFR-wildtype KRAS- wildtype shown in grey and N=8 for EGFR-mutant KRAS-wildtype shown in grey. Gene essentiality is an averaged Bayes Factor score for each group of cell lines. (C) P-values from Chi2 tests of gene essentiality and nonsense mutations. (D) P-values from Chi2 tests of gene essentiality and missense mutations. (E) A scatter plot showing effect size of the change in gene essentiality with select missense mutations and the -Log10(P-value) of each essentiality/mutation pair. (F) A scatter plot showing effect size of the change in gene essentiality with select nonsense mutations and the -Log10(P-value) of each essentiality/mutation pair. (G) A scatter plot showing effect size of the change in gene essentiality with all mutations and the -Log10(P-value) of each essentiality/mutation pair. CanDI reveals female and male context specific essential genes in colon, lung and pancreatic cancer. Cancer functional genomics data is often analyzed without consideration for fundamental biological properties such as the sex of the tumor from which each cell line is derived. It is well established that biological sex influences cancer predisposition, cancer progression and response (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 to therapy30. We hypothesized that individual genes may be differentially essential across male and female cell lines. This hypothesis to our knowledge has never been tested in an unbiased large-scale manner. To maximize our statistical power to identify such differences we chose to test this hypothesis in a disease setting with large number of relatively homogenous cell lines and fewer unknown covariates. Using CanDI, we stratified all KRAS-mutant NSCLC, pancreatic adenocarcinoma (PDAC), and colorectal cancer (CRC) by sex and then tested for conditional gene essentiality. This analysis identified a number of genes that are differentially essential in male or female KRAS-mutant NSCLC, PDAC and CRC models (Fig. 3a-f and Supplementary Table 5). The genes that we identify are not common across all three disease types suggesting as one might expect that the biology of the tumor in part also determines gene essentiality. To test whether any association between differentially essential genes could be identified from expression data (e.g essential genes encoded on the Y chromosome) we first used CanDI to identify genes that are differentially expressed between male and female cell lines within each disease 31. We then plotted the set of differentially essential genes against the differentially expressed genes in KRAS-mutant NSCLC, PDAC and CRC models (Fig. 3a,c,e and Supplementary Table 6) and found little overlap between these gene lists. A number of genes that are more essential in male cells, such as AHCYL1, ENO1, GPI and PKM, regulate cellular metabolism. This finding is consistent with previous literature on sex and metabolism32. Our analysis demonstrates that stratifying groups of heterogeneous cancer models by three variables, in this case tumor type, KRAS mutation status and sex, reveals differentially essential genes. CanDi enables biologically principled stratification of data in the CCLE and DepMap by any feature associated with a group of cell models. This stratification allows us to identify genes associated with sex, which is not possible with other covariates included. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Figure 3. Figure 3. (A) Differential gene expression and differential gene essentiality in male and female CRC cell lines. N=7 male cell lines and N=3 female cell lines. (B) The distribution of Bayes factor gene essentiality scores in male and female CRC cell lines. The top seven and bottom (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 three differentially essential genes are shown in violin plots split by the sex of the cell lines. (C) Differential gene expression and differential gene essentiality in male and female NSCLC cell lines. N=9 male cell lines and N=5 female cell lines. (D) The distribution of Bayes factor gene essentiality scores in male and female NSCLC cell lines. The top seven and bottom three differentially essential genes are shown in violin plots split by the sex of the cell lines. (E) Differential gene expression and differential gene essentiality in male and female PDAC cancer cell lines. N=13 male cell lines and N=5 female cell lines. (F) The distribution of Bayes factor gene essentiality scores in male and female PDAC cell lines. The top seven and bottom three differentially essential genes are shown in violin plots split by the sex of the cell lines. CanDI enables rapid integration of external datasets to reveal new immunotherapy targets. An emerging challenge in the cancer biology is how to robustly integrate larger “resource” datasets like CCLE with the vast amount of published data from individual laboratories. For example, a big challenge in antibody discovery is identifying specific surface markers on cancer cells. To approach these big questions we utilized CanDIs ability to rapidly take new datasets, such as raw RNA-seq counts data in a disparate study of interest, then normalize and integrate this data into the CCLE, DepMap and protein localization databases previously described. Specifically, we rapidly integrated an RNA-seq expression dataset that measured the set of transcribed genes in primary lung bronchial epithelial cells from 4 donors 33. Classes within CanDI enable rapid application of DESeq2 to assess the differential expression between outside datasets and the CCLE. We used this feature to identify genes that are differentially expressed between primary lung bronchial epithelial cells and KRAS-mutant NSCLC, EGFR-mutant NSCLC or all NSCLC models in CCLE. We then used CanDI to identify (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 genes that are upregulated in cancer cells over normal lung bronchial epithelial cells with protein products that are localized to the cell membrane. This analysis of KRAS-mutant, EGFR-mutant and pan-NSCLC generated highly similar lists of differentially expressed surface proteins (Fig. 4a-f and Supplementary Table 7). Notably, overexpression of several of these genes, such as CD151 and CD44, has been observed in lung cancer and is associated with poor prognosis 34–36. These proteins represent potential new immunotherapy targets in KRAS-driven NSCLC. Figure 4. Figure 4. (A) A graph showing genes that are upregulated in KRAS-mutant NSCLC cell lines relative to primary human bronchial epithelial cells. A cell membrane protein localization score (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 is shown for each gene. Higher protein localization scores indicate higher confidence annotations. (B) A scatter plot showing gene expression for genes that encode cell surface proteins in KRAS-mutant NSCLC cell lines and primary human bronchial epithelial cells. N=46 for KRAS-mutant NSCLC cell lines and N=4 for primary human bronchial epithelial cells. (C) A graph showing genes that are upregulated in EGFR-mutant NSCLC cell lines relative to primary human bronchial epithelial cells. A cell membrane protein localization score is shown for each gene. Higher protein localization scores indicate higher confidence annotations. (D) A scatter plot showing gene expression for genes that encode cell surface proteins in EGFR-mutant NSCLC cell lines and primary human bronchial epithelial cells. N=21 for EGFR-mutant NSCLC cell lines and N=4 for primary human bronchial epithelial cells. (E) A graph showing genes that are upregulated in NSCLC cell lines relative to primary human bronchial epithelial cells. A cell membrane protein localization score is shown for each gene. Higher protein localization scores indicate higher confidence annotations. (F) A scatter plot showing gene expression for genes that encode cell surface proteins in NSCLC cell lines and primary human bronchial epithelial cells. N=141 for NSCLC cell lines and N=4 for primary human bronchial epithelial cells. Discussion Data integration is a critical requirement in biology research in the era of genomics and functional genomics. Large scale efforts such as the CCLE have revealed genomic features of more than 1000 cell line models. This data has not to our knowledge previously been integrated with functional genomics data in a manner that individual users can enter batched queries that are stratified by disease subtype or mutation status. This is not just a small improvement in functionality, but rather it is an enabling format that makes possible the types of conditional (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 genomics analyses that drive discovery. Moreover, it fills a fundamental gap in the cancer research community that integrates large scale projects with investigator initiated studies Our data framework enables biologists without specialized expertise in bioinformatics to use the full spectrum of data in the CCLE and DepMap in a higher throughput and precise manner. Using CanDI, we identified genes that are selectively essential in male versus female KRAS-mutant NSCLC, PDAC and CRC models. To our knowledge, such analysis has never been performed to begin to query the biologic basis of sex disparity in cancer or cancer therapy. We illustrate another feature of our framework by analyzing a list of hit genes nominated by a bespoke CRISPR drug screen for gene essentiality in BRCA1/2-wild type and BRCA1/2- mutated breast and ovarian cancer. In a third application, we analyzed the principle of synthetic lethality for 17427 genes in 19 KRAS-mutant and 11 EGFR-mutant NSCLC models. We then used CanDI to globally identify genes that are conditionally essential in the context of common cancer driver mutations. Finally, we nominated 12 potential new immunotherapy targets in KRAS-mutant, EGFR-mutant and pan -NSCLC models by using CanDI to identify genes that are differentially expressed in normal bronchial epithelial cells versus NSCLC models that are localized at the plasma membrane. Our data reveal a wealth of new hypotheses that can be rapidly generated from publicly available cancer data. By sharing data flows and use cases with a CanDI community we illustrate the ways in which individual research groups can interact with massive cancer genomics projects without reinventing tools or relying upon DepMap tool releases. We anticipate that CanDI will be widely used in cell biology, immunology and cancer research. Methods (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 CanDI The CanDI data integrator is available at https://github.com/Yogiski/CanDI. CanDI Module Structure The CanDI data integrator is a python library built on top of the Pandas that is specialized in integrating the publicly available data from The Cancer Dependency Map (DepMap Release: 2019 Quarter 3)12, The Cancer Cell Line Encyclopedia (CCLE Release: 2019 Quarter 3) 1, The Pooled In-Vitro CRISPR Knockout Essentiality Screens Database (PICKLES Library: Avana 2018 Quarter 4) 20, The Comprehensive Resource of Mammalian Protein Complexes (CORUM)8 and protein localization data from The Cell Atlas4, The Map of the Cell11, and The In Silico Surfaceome7,21. Data from DepMap and CCLE used in the following analyses are from the 2019Q3 release. Data from PICKLES is from the 2018 Quarter 4 release of DepMap using the Avana library. Access to all datasets is controlled via a python class called Data. Upon import the data class reads the config file established during installation and defines unique paths to each dataset and automatically loads the cell line index table and the gene index table. Installation of CanDI, configuration, and data retrieval is handled by a manager class that is accessed indirectly through installation scripts and the Data class. Interactions with this data are controlled through a parent Entity class and several handlers. The biologically relevant abstraction classes (Gene, CellLine Cancer, Organelle, GeneCluster, CellLineCluster) inherit their methods from Entity. Entity methods are wrappers for hidden data handler classes who perform specific transformations, such as data indexing and high throughput filtering. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Differential Expression In all cases where it is mentioned differential expression was evaluated using the DESeq2 R package (Release 3.10) 31. Significance was considered to be an adjusted p-value of less than 0.01. Differential Essentiality Essentiality scores are taken from the PICKLES database (Avana 2018Q4). To reduce the number of hypotheses posed during this analysis the mutual information of gene essentiality was calculated using the mutual information metric from the python package SciKitLearn (Version 0.22.0). Genes with mutual information scores greater than one standard devation above the median were removed from consideration. Differential essentiality was evaluated by performing a Mann-Whitney u-test between two groups on every gene that passed the mutual information filter. Significance was considered to be a p-value of less than 0.01. Magnitude of differential essentiality of a given gene was shown as the difference in mean Bayes factors between two groups of cell lines. Protein Localization Confidence Protein localization data was assembled from The Cell Atlas4, The Map of the Cell11, and The In Silico Surfaceome7,21. Confidence annotations were taken from the supplemental data of each paper and put on a number scale from 0 to 4 and summed for a total confidence score for each localization annotation for every gene where across all three papers. The analysis shown in Figure 4 represents a gene list that was further manually curated to remove the genes that are (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 localized to the intracellular space at the cell membrane revealing cell surface protein targets that are highly expressed in NSCLC cancer models over normal lung bronchial epithelial cells 4,7,11,21. DepMap Creative Commons License When an individual user runs CanDI they are downloading DepMap data and thus are agreeing to a CC Attribution 4.0 license (https://creativecommons.org/licenses/by/4.0/). Synthetic Lethality of Fanconi Anemia Genes in Ovarian and Breast Cancer Models We made a list of the top 50 gene hits that confer sensitivity to PARP inhibition in HeLa cells23. Using CanDI the essentiality scores of these top hits were visualized across all ovarian cancer cell models in PICKLES (Avana 2018Q4). FANCA and FANCE showed selective essentiality in the BRCA1/2 mutant ovarian cancer cell lines. Following this observation CanDI was used to gather the gene essentiality for all FANC genes in the fanconi anemia pathway. CanDI was then used to visualize these data across all ovarian and breast cancer cell lines, sorting by BRCA1/2 mutation status. Synthetic Lethality in KRAS and EGFR mutant Cell Lines CanDI was leveraged to bin NSCLC cell lines present in both CCLE (Release: 2019Q3) and PICKLES (Avana 2018Q4) into 8 groups. KRAS mutant and KRAS wild type cell lines with and without EGFR mutants removed as well as EGFR mutant and EGFR wild type cell lines with and without KRAS mutants removed. The mean essentiality score for every gene in the genome was calculated for every group of cell lines. Synthetic lethality score per gene is defined as the change in mean essentiality from the mutant groups to the wild type groups. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Pan Cancer Synthetic Lethality Analysis A set of 299 core oncogenes and tumor suppressor driver mutations was chosen for analysis37. To test the effect of these gene’s mutations on gene essentiality CanDI was leveraged to split into two groups: a nonsense mutation group containing genes annotated as tumor suppressors (N=153) and a missense mutation group containing genes annotated as oncogenes with specific driver protein changes (N=53). CanDI was then used to collect a core set of genes with highly variable essentiality. To do this the Bayes factors from the PICKLES database (Avana 2018Q4) were converted to binary numeric variables. Bayes factors over 5 were assigned a 1=essential and Bayes factors under 5 were assigned a 0=non-essential. Genes were then sorted buy their variance across cell lines and genes between the 85th and 95th percentile were used for this analysis (N=2340). To determine a short list of genes with which to follow up on Chi2 tests were applied to the 95940 gene pairs in the missense group and the 603720 gene pairs in the tumor suppressor group. Three new groups were formed for further analysis: the first consisted of the significant gene/mutation pairs from the oncogenic group, the second consisted of the significant gene/mutation pairs from the tumor suppressor group, and the third was a combination of the significant pairs from both groups with no discrimination on the type of mutations considered. These groups were further analyzed for differential essentiality via the Mann Whitney method described above and the Cohens D effect size were calculated to measure the extent of the phenotype. Differential Expression and Essentiality of Male and Female KRAS driven cancers (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 We used CanDI to gather all cell lines that are present in both PICKLES (Avana 2018Q4) and CCLE (Release 2019Q3). CanDI was then leveraged to put these cell lines into the following tissue groups: KRAS mutant Colon/Colorectal, PDAC, and NSCLC. Each tissue group was then split into male and female sub-groups. Differential expression was analyzed by applying the methods described above to raw RNA-seq counts data from CCLE (Release: 2019Q3). Genes with adjusted p-values less than 0.01 were considered significantly differentially expressed. Differential essentiality was analyzed using the methods described above on the previously described sex-subgroups for each tissue type. Genes with p-values less than 0.01 were considered significantly differentially essential between male and female cell models. For each tissue type the distributions of the top 7 significantly differentially essential genes were highlighted in comparison with the bottom 3 as a negative control. Differential expression of benign and malignant cancer cell lines We downloaded human bronchial epithelial (HBE) RNA-seq data from Gillen et al via the European Nucleotide Archive to use as a benign lung tissue model33. This 4 data set contains gene expression data for primary HBE cells cultured from three different donors and also NHBE cells (Lonza CC-2541, a mixture of HBE and human tracheal epithelial cells). We then used CanDI to put NSCLC models into three different groups: KRAS mutant, EGFR mutant, and all cell lines. For our benign model raw counts were quantified via kallisto38. Raw counts for our malignant cell lines were queried via CanDI. DESeq2 was then applied to evaluate the differential expression between our normal lung tissue model and our three malignant lung tissue groups. The results from DESeq2 were then filtered by significance (adjusted p-value < 0.01). To filter based on potential immunotherapy targets we removed all genes not annotated as being (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 localized to the plasma membrane, and genes with localization confidence scores lower than six. Genes that were obviously mis-annotated as surface proteins were also manually removed. Supplementary Figure/Table Legends Supplementary Figure 1. Supplementary Figure 1. An Object-oriented schema diagram showing core structure of CanDI software. Supplementary Table 1. A table containing raw PICKLES Bayes factors displayed in the heat map of Fig. 1e. Supplementary Table 2. A table containing mean PICKLES Bayes factors for each series displayed in Fig. 2a,b. A (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Supplementary Table 3. A table containing the data for all chi2 tests performed to generate Fig. 2c,d. Supplementary Table 4. A table containing the data for scatter plots shown in Fig. 2e,f,g. Supplementary Table 5. A table containing the data from the differential essentiality analysis for all three tissues in Fig. 3a-f. Supplementary Table 6. A table containing the data from the differential expression analysis for all three tissues in Fig. 3a,c,e. Supplementary Table 7. A table containing the differential expression analysis data merged with the location data for all three tissues shown in Fig. 4. Acknowledgements We thank everyone in the Gilbert lab for helpful comments and discussion. LAG is supported by K99/R00 CA204602 and DP2 CA239597 as well as the Goldberg-Benioff Endowed Professorship in Prostate Cancer Translational Biology. Conflicts of Interest None Bibliography 1. Ghandi, M. et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019). 2. Li, H. et al. The landscape of cancer cell line metabolism. Nat. Med. 25, 850–860 (2019). 3. Tsherniak, A. et al. Defining a Cancer Dependency Map. Cell 170, 564-576.e16 (2017). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 4. Thul, P. J. et al. A subcellular map of the human proteome. Science 356, (2017). 5. Cancer Cell Line Encyclopedia Consortium & Genomics of Drug Sensitivity in Cancer Consortium. Pharmacogenomic agreement between two cancer cell line data sets. Nature 528, 84–87 (2015). 6. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012). 7. Bausch-Fluck, D. et al. The in silico human surfaceome. PNAS 115, E10988–E10997 (2018). 8. Giurgiu, M. et al. CORUM: the comprehensive resource of mammalian protein complexes- 2019. Nucleic Acids Res. 47, D559–D563 (2019). 9. Nusinow, D. P. et al. Quantitative Proteomics of the Cancer Cell Line Encyclopedia. Cell 180, 387-402.e16 (2020). 10. Szklarczyk, D. et al. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 (2017). 11. Itzhak, D. N., Tyanova, S., Cox, J. & Borner, G. H. Global, quantitative and dynamic mapping of protein subcellular localization. Elife 5, (2016). 12. Meyers, R. M. et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet. 49, 1779–1784 (2017). 13. Behan, F. M. et al. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature 568, 511–516 (2019). 14. Wang, T. et al. Identification and characterization of essential genes in the human genome. Science 350, 1096–1101 (2015). 15. Hart, T. et al. High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype- Specific Cancer Liabilities. Cell 163, 1515–1526 (2015). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 16. Wang, T. et al. Gene Essentiality Profiling Reveals Gene Networks and Synthetic Lethal Interactions with Oncogenic Ras. Cell 168, 890-903.e15 (2017). 17. Chan, E. M. et al. WRN helicase is a synthetic lethal target in microsatellite unstable cancers. Nature 568, 551–556 (2019). 18. Adamson, B. et al. A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response. Cell 167, 1867-1882.e21 (2016). 19. Wainberg, M. et al. A genome-wide almanac of co-essential modules assigns function to uncharacterized genes. http://biorxiv.org/lookup/doi/10.1101/827071 (2019) doi:10.1101/827071. 20. Lenoir, W. F., Lim, T. L. & Hart, T. PICKLES: the database of pooled in-vitro CRISPR knockout library essentiality screens. Nucleic Acids Res 46, D776–D780 (2018). 21. Bausch-Fluck, D. et al. A Mass Spectrometric-Derived Cell Surface Protein Atlas. PLoS One 10, (2015). 22. O’Connor, M. J. Targeting the DNA Damage Response in Cancer. Mol. Cell 60, 547–560 (2015). 23. Zimmermann, M. et al. CRISPR screens identify genomic ribonucleotides as a source of PARP-trapping lesions. Nature 559, 285–289 (2018). 24. Pan, X. et al. FANCM, BRCA1, and BLM cooperatively resolve the replication stress at the ALT telomeres. PNAS 114, E5940–E5949 (2017). 25. Lou, K., Gilbert, L. A. & Shokat, K. M. A Bounty of New Challenging Targets in Oncology for Chemical Discovery. Biochemistry 58, 3328–3330 (2019). 26. Narayan, G. et al. Promoter Hypermethylation of FANCF: Disruption of Fanconi Anemia- BRCA Pathway in Cervical Cancer. Cancer Res 64, 2994–2997 (2004). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 27. Ideker, T., Dutkowski, J. & Hood, L. Boosting signal-to-noise in complex biology: prior knowledge is power. Cell 144, 860–863 (2011). 28. Chang, M. T. et al. Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nat. Biotechnol. 34, 155–163 (2016). 29. Lou, K. et al. KRASG12C inhibition produces a driver-limited state revealing collateral dependencies. Sci Signal 12, (2019). 30. Cancer Disparities - National Cancer Institute. https://www.cancer.gov/about- cancer/understanding/disparities (2016). 31. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550 (2014). 32. Rubin, J. B. et al. Sex differences in cancer mechanisms. Biol Sex Differ 11, (2020). 33. Gillen, A. E. et al. Molecular characterization of gene regulatory networks in primary human tracheal and bronchial epithelial cells. J. Cyst. Fibros. 17, 444–453 (2018). 34. Mj, K. et al. Prognostic Significance of CD151 Overexpression in Non-Small Cell Lung Cancer. Lung cancer (Amsterdam, Netherlands) vol. 81 https://pubmed.ncbi.nlm.nih.gov/23570797/ (2013). 35. Ko, Y. H. et al. Prognostic significance of CD44s expression in resected non-small cell lung cancer. BMC Cancer 11, 340 (2011). 36. Penno, M. B. et al. Expression of CD44 in Human Lung Tumors. Cancer Res 54, 1381–1387 (1994). 37. Bailey, M. H. et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 173, 371-385.e18 (2018). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 38. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34, 525–527 (2016). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Count sgRNAs abundance by deep sequencing to measure gene/drug phenotypes T0 SampleCRISPR Hela cell line Lentiviral transduction of genome-scale CRISPR sgRNA library Olaparib Untreated 1 1 3 2 Hela Cell Line CAL51 Cell Line KPL1 Cell Line ZR751 Cell Line ... COV362 Cell Line JHOS2 Cell Line TOV31G Cell Line ... Breast cancer Cervical cancer Ovarian cancer CA B D E CanDI Integration Cancer Data Integrator Essentiality Mutation ... CanDI Cellular Genomics Functional Genomics Transcriptomics Proteomics Vs. 2 3 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 −40 −20 0 20 40 60 Differential Essentiality (Δ Average BF) −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 PPP1R15B CFLAR NXT1 CTNNB1 SLC4A7 MANSC1 AHCYL1 ARHGEF10L MRPL20 EFCAB11 C ol on Non-Sigfnificant Differentially Expressed Differentially Essential Shown in Violin Plots PP P1 R1 5B CF LA R NX T1 CT NN B1 SL C4 A7 MA NS C1 AH CY L1 AR HG EF 10 L MR PL 20 EF CA B1 1 Gene −60 −40 −20 0 20 40 60 80 100 B ay es F ac to r Top Hit Female Top Hit Male −30 −20 −10 0 10 20 30 Differential Essentiality (Δ Average BF) −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 D iff er en ti al E xp re ss io n ( Lo g2 (F C )) BCL2L1 GPI ENO1 RTCB PKM WAC PCID2 ARHGAP12 SLC19A2 GPR137 BC L2 L1 GP I EN O1 RT CB PK M W AC PC ID 2 AR HG AP 12 SL C1 9A 2 GP R1 37 Gene −50 −25 0 25 50 75 100 B ay es F ac to r −30 −20 −10 0 10 20 30 Differential Essentiality (Δ Average BF) −10 −5 0 5 10 15 20 CHMP3 CHMP5 HAUS6 WLS KATNB1 ID1 ACSL3 KCNE1 RUFY1 KRT16 Pa nc re as CH MP 3 CH MP 5 HA US 6 W LS KA TN B1 ID 1 AC SL 3 KC NE 1 RU FY 1 KR T1 6 Gene −50 −25 0 25 50 75 100 B ay es F ac to r Lu ng Negative Control Female Negative Control Male Essential Gene ThresholdM or e Es se nt ia l Le ss E ss en tia l M or e Es se nt ia l Le ss E ss en tia l M or e Es se nt ia l Le ss E ss en tia l Female Cell LinesMale Cell Lines More Essential In More Essential In Male Cell Lines More Essential In Female Cell Lines More Essential In Male Cell Lines More Essential In Female Cell Lines More Essential In U p re gu la te d In U p re gu la te d In D iff er en ti al E xp re ss io n ( Lo g2 (F C )) U p re gu la te d In M al e C el l L in es U p re gu la te d In Fe m al e C el l L in es D iff er en ti al E xp re ss io n ( Lo g2 (F C )) U p re gu la te d In U p re gu la te d In M al e C el l L in es Fe m al e C el l L in es M al e C el l L in es Fe m al e C el l L in es A B C D E F (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 0 2 4 6 8 10 12 14 16 Log2(Fold Change) 0 10 20 30 40 50 60 70 80 -L og 10 (Q V al ue ) CD151 SLC4A2 B2M ITGA3 SLC3A2 HLA-C CD44 LRPAP1 DDR1 VDAC2 SLC29A1 SLCO4A1 KRAS Mutant CD151 SLC4A2 B2M ITGA3 SLC3A2 HLA-C CD44 LRPAP1 DDR1 VDAC2 SLC29A1 SLCO4A1 Gene 0 2 4 6 8 10 12 14 Lo g2 ( TP M + 1 ) KRAS Mutant Cell Line Type Benign Bronchial Malignant 0 2 4 6 8 10 12 14 16 Log2(Fold Change) 0 10 20 30 40 50 -L og 10 (Q V al ue ) B2M SLC4A2 CD151 ITGA3 ATP1A1 SLC3A2 CD44DDR1 HLA-CLRPAP1 ITGA5 TFPI EGFR Mutant B2M SLC4A2 CD151 ITGA3 ATP1A1 SLC3A2 CD44 DDR1 HLA-C LRPAP1 ITGA5 TFPI Gene 0 2 4 6 8 10 12 14 Lo g2 ( TP M + 1 ) EGFR Mutant 0 5 10 15 20 25 Log2(Fold Change) 0 10 20 30 40 -L og 10 (Q V al ue ) B2M CD151 THY1 SLC3A2 SLC4A2 LRPAP1 HLA-C DDR1 SLC29A1 ITGA3 PTGFRN VDAC2 All Lung Cancer B2M CD151 THY1 SLC3A2 SLC4A2 LRPAP1 HLA-C DDR1 SLC29A1 ITGA3 PTGFRN VDAC2 Gene 0 2 4 6 8 10 12 14 Lo g2 ( TP M + 1 ) All Lung Cancer Location Confidence 6 7 8 9 10 A B C D E F (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Gene Essentiality in KRAS MT Cell Lines (Average BF) G en e Es se nt ia lit y in K R AS W T C el l L in es ( Av er ag e BF ) KRAS EGFR KRAS EGFR More EssentialLess Essential M ore Essential Less Essential Essential Gene Threshold EGFR MT Included EGFR MT Removed Gene Essentiality in EGFR MT Cell Lines (Average BF) G en e Es se nt ia lit y in E G FR W T C el l L in es ( Av er ag e BF ) KRAS EGFR KRAS EGFR More EssentialLess Essential M ore Essential Less Essential Essential Gene Threshold KRAS MT Included KRAS MT Removed A B C Es se nt ia lit y Nonsense Tumor Supressor Genes Context Speci�c 0 Effect Size 0.0 BRAF/BRAF NRAS/NRAS KRAS/KRAS HRAS/HRAS 0 Effect Size 0 Effect Size 0 KRAS/KRAS NRAS/NRAS BRAF/BRAF HRAS/HRAS NRAS/KRAS Non-Hit Signi�cant Hit Essentiality/Mutation Missense All Mutations Nonsense E F G More Essential Less Essential 0.00 0.05 1.00 P-value D Missense Oncogenes Tumor Supressor Genes Context Speci�c Mutations (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 A (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 10_1101-2021_01_08_426008 ---- AncestralClust: Clustering of Divergent Nucleotide Sequences by Ancestral Sequence Reconstruction using Phylogenetic Trees AncestralClust: Clustering of Divergent Nucleotide Sequences by Ancestral Sequence Reconstruction using Phylogenetic Trees Lenore Pipes 1,∗ and Rasmus Nielsen 1,2,3∗ 1Department of Integrative Biology, University of California-Berkeley, Berkeley, 94707, USA, 2Department of Statistics, University of California-Berkeley, Berkeley, CA 94707, USA, and 3Globe Institute, University of Copenhagen, 1350 København K, Denmark ∗To whom correspondence should be addressed. Abstract Motivation: Clustering is a fundamental task in the analysis of nucleotide sequences. Despite the expo- nential increase in the size of sequence databases of homologous genes, few methods exist to cluster divergent sequences. Traditional clustering methods have mostly focused on optimizing high speed clus- tering of highly similar sequences. We develop a phylogenetic clustering method which infers ancestral sequences for a set of initial clusters and then uses a greedy algorithm to cluster sequences. Results: We describe a clustering program AncestralClust, which is developed for clustering divergent sequences. We compare this method with other state-of-the-art clustering methods using datasets of homologous sequences from different species. We show that, in divergent datasets, AncestralClust has higher accuracy and more even cluster sizes than current popular methods. Availability and implementation: AncestralClust is an Open Source program available at https://github.com/lpipes/ancestralclust Contact: lpipes@berkeley.edu Supplementary information: Supplementary figures and table are available online. 1 Introduction Traditional clustering methods such as UCLUST (Edgar, 2010), CD-HIT (Fu et al., 2012), and DNACLUST (Ghodsi et al., 2011) use hierarchical or greedy algorithms that rely on user input of a sequence identity threshold. These methods were developed for high speed clustering of a high quantity of highly similar se- quences (Ghodsi et al., 2011; Li et al., 2001; Edgar, 2010) and, generally, these methods are considered unreliable for identity thresholds <75% because of either the poor quality of alignments at low identities (Zou et al., 2018) or because the performance of the threshold used to count short words drops dramatically with low identities (Huang et al., 2010). At low identities, these meth- ods produce uneven clusters where the majority of sequences are contained in only a few clusters (Chen et al., 2018) and the high variance in cluster sizes reduces the utility of the clustering step for many practical purposes. Clustering of divergent sequences is a fundamental step in genomics analysis because it allows for an early divide-and-conquer strategy that will significantly increase the speed of downstream analyses (Zheng et al., 2018) and clus- tering of divergent sequences is a frequent request of users of at least one clustering method (Huang et al., 2010). Currently, there are no clustering methods that can accurately cluster large taxo- nomically divergent metabarcoding reference databases such as the Barcode of Life database (Ratnasingham and Hebert, 2007) in relatively even clusters. Only a few other methods, such as Sp- Clust (Matar et al., 2019) and TreeCluster (Balaban et al., 2019), exist for clustering potentially divergent sequences. SpClust cre- ates clusters based on the use of Laplacian Eigenmaps and the Gaussian Mixture Model based on a similarity matrix calculated on all input sequences. While this approach is highly accurate, the calculation of an all-to-all similarity matrix is a computation- ally exhaustive step. TreeCluster uses user-specified constraints for splitting a phylogenetic tree into clusters. However, TreeClus- ter requires an input tree and thus can also be prohibitively slow .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.426008 http://creativecommons.org/licenses/by/4.0/ 2 Pipes and Nielsen for large numbers of sequences where a phylogenetic tree is dif- ficult to estimate reliably. With the increasing size of reference databases (Schoch et al., 2020), there is a need for new compu- tationally efficient methods that can cluster divergent sequences. Here we present AncestralClust that was specifically developed for clustering of divergent metabarcoding reference sequences in clusters of relatively even size. 2 Methods To cluster divergent sequences, we developed AncestralClust which is written in C (Figure 1). Firstly, k random sequences are chosen and the sequences are aligned pairwise using the wavefront algorithm (Marco-Sola et al., 2020). A Jukes-Cantor distance ma- trix is constructed from the alignments and a neighbor-joining phylogenetic tree is constructed. The Jukes-Cantor model is cho- sen for computational speed, but more complex models could in principle be used to potentially increase accuracy but also in- crease computational time. The C − 1 longest branches in the tree are then cut to yield C clusters. These subtrees comprise the initial starting clusters. The sequences in each starting clus- ter are aligned in a multiple sequence alignment using kalign3 (Lassmann, 2020). The ancestral sequences at the root of the tree of each cluster is estimated using the maximum of the posterior probability of each nucleotide using standard programming algo- rithms from phylogenetics (see e.g., Yang, 2014). The ancestral sequences are used as the representative sequence for each cluster. Next, the rest of the sequences are assigned to each cluster based on the shortest nucleotide distance from the wavefront alignment between the sequence and the C ancestral sequences. If the short- est distance to any of the C ancestral sequences is larger than the average distance between clusters, the sequence is saved for the next iteration. We iterate this process until all sequences are as- signed to a cluster. In each iteration after the first iteration, a cut of a branch in the phylogenetic tree is chosen if the the branch is longer that the average length of branches cut in the first iteration. In praxis, only one or two iterations are needed for most data sets if k is defined to be sufficiently large. We compared AncestralClust to five other state-of-the-art clustering methods: UCLUST (Edgar, 2010), meshclust2 (James and Girgis, 2018), DNACLUST (Ghodsi et al., 2011), CD-HIT (Fu et al., 2012), and SpClust (Matar et al., 2019). We used a variety of measurements to assess the accuracy and evennness of the clustering. We calculated two traditional measures of accu- racy, purity and normalized mutual information (NMI), used in Bonder et al. (2012). The purity of clusters is calculated as: purity(Ω, C) = 1 N ∑ k max j |ωk ∩ cj| (1) where Ω = w1, w2, ..., wk is the set of clusters, C = c1, c2, ..., cj is the set of taxonomic classes and N is the total number of sequences. NMI is calculated as: NMI(Ω, C) = I(Ω, C) [H(Ω) + H(C)]/2 (2) where mutual information gain is I(Ω, C) and H is the entropy function. To measure the evenness of the clusters, we used the coefficient of variation which is calculated as: CV = √∑j i (ni − m) 2/j m (3) where ni is the number of sequences in cluster i, j is the total number of clusters, and m is the mean size of the clusters. We also used a taxonomic incompatibility measure to assess the ac- curacy of the clusters. Let a,b be a pair of species found in cluster i. Incompatibility at a given taxonomic rank is calculated by first identifying the number of times a and b exist in clusters other than cluster i. The total incompatibility is calculated by summing over all pairs of sequences (a,b) and all i. Both NMI and taxonomic incompatibility are very sensitive to the number of clusters and also to unevenness of cluster sizes. To allow fair comparison when numbers of clusters and evenness of cluster sizes vary we, therefore, calculate the relative NMI and relative incompatibility. These measures are calculated by scaling them relative to their expected values under random as- signments given the number of clusters and the cluster sizes. We estimated relative NMI by dividing the raw NMI score by the average NMI of 10 clusterings in which sequences have been as- signed at random with equal probability to clusters, such that the cluster sizes are same as the cluster sizes produced in the original clustering. The same procedure was used to convert the taxonomic incompatibility measure into relative incompatibility. 3 Results To first assess performance of clustering methods on divergent nucleotide sequences, we used 100 random samples of 10,000 sequences from three metabarcode reference databases (16, 18S, and Cytochrome Oxidase I (COI)) from the CALeDNA project Meyer et al. (2019). We chose to compare our method on this dataset against UCLUST because it is the most widely used clus- tering program and it performs better than CD-HIT on low identity thresholds (Chen et al., 2018). We first compared AncestralClust against UCLUST using relative NMI and Coefficient of Variation (Figure 2). We used k = 300 random initial sequences, which is 3% of the total num- ber of sequences in each sample and C = 16 cuts in the initial phylogenetic tree. Notice that the relative NMI tends to be higher with a lower coefficient of variation for AncestralClust across all barcodes. This suggests, that for these divergent eDNA sequences, AncestralClust provides clusterings that are more even in size and that are more consistent with conventional taxonomic assignment. As a second measure of accuracy we measured relative incom- patibility and coefficient of variation using AncestralClust and UCLUST using for the same datasets under the same running .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.426008 http://creativecommons.org/licenses/by/4.0/ AncestralClust 3 conditions. Notice in Figure 3, AncestralClust tends to create balanced clusters with lower relative taxonomic incompatibilities compared to UCLUST at all taxonomic levels. Similar results are seen for metabarcode 18S (Fig S1). However, for metabar- code 16S (Fig S2), AncestralClust performs noticeably better than UCLUST at the species, genus, and family levels but at the order, class, and phylum levels it performs either the same or worse. Also, at the species, genus, and family levels, it is apparent that as the UCLUST clusters approach a lower coefficient of variation, the relative incompatibility increases dramatically. Next, we analyzed two datasets with different properties: one dataset of diverse species from the same gene and another dataset of homologous genes from species of the same phyla. In the first dataset, we expect that the sequences to cluster according to species. In the second dataset, we expect the sequences to cluster according to different genes. We compared AncestralClust to four commonly used clustering programs (UCLUST, meshclust2, CD- HIT2, and DNACLUST) and one clustering program designed for divergent sequences, SpClust. The first dataset contained 13,043 sequences from the COI CaleDNA database from 11 divergent species that were from 7 different phyla and 11 different classes and the second data set contained sequences from 6 different genes from taxonomically similar species. First, we compared all meth- ods using 13,043 COI sequences from the 11 different species (Table 1). We expect these sequences to form 11 different clus- ters, each including all the sequences from one species. We chose identity thresholds to enforce the expected number of clusters for each method. We were unable to form 11 clusters using CD-HIT because the program does not allow clustering of sequences with identity thresholds < 80% at default parameters. For SpClust, we used the three precision modes available for the method. In this analysis, AncestralClust achieved a perfect clustering (the purity was 1 and relative incompatibility was 0) although it was the second slowest, and had the second lowest memory require- ments. UCLUST was one of the fastest methods and used the least amount of memory but had the second lowest purity with third highest relative NMI values. meshclust2 had no incompatibilities and the second highest purity and relative NMI values but was the third slowest method. DNACLUST had the most uneven clusters and the second lowest relative NMI value with the highest relative incompatibility. SpClust only identified one cluster, with a com- putational time of ~2 days. In comparison, AncestralClust took ~5 minutes and UCLUST used < 1 second. Next, we analyzed ’genomic set 1’ from Matar et al. (2019), which consists of 39 sequences from 6 homologous genes (FCER1G, S100A1, S100A6, S100A8, S100A12, and SH3BGRL3 in Table 2). We expect these sequences to form 6 clusters. We varied the identity thresholds for UCLUST and meshclust2 using thresholds 0.4, 0.6, and 0.8. For CD-HIT, we used the lowest identity threshold available on default parameters which is 0.8. We were unable to use DNACLUST for this anal- ysis because it cannot handle sequences longer than 4500bp (the average sequence length was 2,387.9bp and the longest sequence was 5,363bp). Since this dataset contained 6 different genes, we calculated relative NMI using genes as the classes and did not use incompatibility as an accuracy measure. Only AncestralClust, UCLUST, and meshclust2 produced the expected number of clus- ters, and among the methods that created the expected number of clusters, AncestralClust had the highest purity value. Ancestral- Clust was the second slowest method and had the highest memory requirements which is due to the wavefront algorithm alignment which isO(s2) in memory requirements where s is the alignment score. Since alignments were performed using 6 different genes that were longer than 1.5kb, this resulted in a high value of s. Sp- Clust had the highest relative NMI using all precision modes and the same purity as AncestralClust for its moderate and maximum precision modes, however, failed to produce the expected number of clusters. 4 Conclusions We developed a phylogenetic-based clustering method, Ances- tralClust, specifically to cluster divergent metabarcode sequences. We performed a comparative study between AncestralClust and widely used clustering programs such as UCLUST, CD-HIT, DNACLUST, meshclust2, and for divergent sequences, SpClust. UCLUST and DNACLUST are substantially faster than Ances- tralClust and should be the preferred method if computational speed is the main concern. However, AncestralClust tends to form clusters of more even size with lower taxonomic incompatibility and higher NMI than other methods, for the relatively divergent sequences analyzed here. We recommend the use of Ancestral- Clust when sequences are divergent, especially if a relatively even clustering is also desirable, for example for various divide-and- conquer approaches where computational speed of downstream analyses increases faster than linearly with cluster size. Acknowledgements This work used the Extreme Science and Engineering Discov- ery Environment (XSEDE) Bridges system at the Pittsburgh Supercomputing Center through allocation BIO180028. References Balaban, M., Moshiri, N., Mai, U., Jia, X., and Mirarab, S. (2019). Treecluster: Clustering biological sequences using phylogenetic trees. PloS one, 14(8), e0221068. Bonder, M. J., Abeln, S., Zaura, E., and Brandt, B. W. (2012). Compar- ing clustering and pre-processing in taxonomy analysis. Bioinformatics, 28(22), 2891–2897. Chen, Q., Wan, Y., Zhang, X., Lei, Y., Zobel, J., and Verspoor, K. (2018). Comparative analysis of sequence clustering methods for deduplication of biological databases. J. Data and Information Quality, 9(3). Edgar, R. C. (2010). Search and clustering orders of magnitude faster than blast. Bioinformatics, 26(19), 2460–2461. Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150–3152. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.426008 http://creativecommons.org/licenses/by/4.0/ 4 Pipes and Nielsen Ghodsi, M., Liu, B., and Pop, M. (2011). Dnaclust: accurate and efficient clustering of phylogenetic marker genes. BMC bioinformatics, 12(1), 1–11. Huang, Y., Niu, B., Gao, Y., Fu, L., and Li, W. (2010). CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 26(5), 680–682. James, B. T. and Girgis, H. Z. (2018). Meshclust2: Application of alignment-free identity scores in clustering long dna sequences. bioRxiv, page 451278. Lassmann, T. (2020). Kalign 3: multiple sequence alignment of large datasets. Li, W., Jaroszewski, L., and Godzik, A. (2001). Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17(3), 282–283. Marco-Sola, S., Moure López, J. C., Moreto Planas, M., and Es- pinosa Morales, A. (2020). Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics, (btaa777), 1–8. Matar, J., Khoury, H. E., Charr, J.-C., Guyeux, C., and Chrétien, S. (2019). Spclust: Towards a fast and reliable clustering for potentially divergent biological sequences. Computers in biology and medicine, 114, 103439. Meyer, R. S., Curd, E. E., Schweizer, T., Gold, Z., Ramos, D. R., Shirazi, S., Kandlikar, G., Kwan, W.-Y., Lin, M., Freise, A., et al. (2019). The california environmental dna “caledna” program. bioRxiv, page 503383. Ratnasingham, S. and Hebert, P. D. (2007). Bold: The barcode of life data system (http://www. barcodinglife. org). Molecular ecology notes, 7(3), 355–364. Schoch, C. L., Ciufo, S., Domrachev, M., Hotton, C. L., Kannan, S., Khovanskaya, R., Leipe, D., Mcveigh, R., O’Neill, K., Robbertse, B., et al. (2020). Ncbi taxonomy: a comprehensive update on curation, resources and tools. Database, 2020. Yang, Z. (2014). Molecular evolution: a statistical approach. Oxford University Press. Zheng, W., Mao, Q., Genco, R. J., Wactawski-Wende, J., Buck, M., Cai, Y., and Sun, Y. (2018). A parallel computational framework for ultra-large- scale sequence clustering analysis. Bioinformatics, 35(3), 380–388. Zou, Q., Lin, G., Jiang, X., Liu, X., and Zeng, X. (2018). Sequence clus- tering in bioinformatics: an empirical study. Briefings in Bioinformatics, 21(1), 1–10. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.426008 http://creativecommons.org/licenses/by/4.0/ AncestralClust 5 Figure 1. Overview of AncestralClust. In (1), k random sequences are chosen for the initial clusters. (2) Using the k sequences a distance matrix is constructed. Using the distance matrix, a neighbor-joining tree is constructed and C − 1 cuts are made to create C clusters. In (4), each cluster is multiple sequenced aligned and the ancestral sequences are reconstructed in the root node of each tree. The rest of the unassigned sequences are then aligned to the ancestral sequences of each cluster and the shortest distance to each ancestral sequence is calculated. The process is iterated until all sequences are assigned to a cluster. Figure 2. Relative NMI against coefficient of variation for AncestralClust and UCLUST for 100 samples of 10,000 randomly chosen 16S, 18S, and COI reference sequences from the CALeDNA Project (Meyer et al., 2019). The similarity threshold for UCLUST was 0.58. For AncestralClust, we used 300 initial random sequences with 15 initial clusters. Relative NMI was calculated by dividing NMI by the average of 10 random samples of the same fixed cluster size. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.426008 http://creativecommons.org/licenses/by/4.0/ 6 Pipes and Nielsen Figure 3. Relative incompatibility against coefficient of variation for AncestralClust and UCLUST for 100 samples of 10,000 randomly chosen COI reference sequences. COI reference sequences are from the CALeDNA Project (Meyer et al., 2019). The similarity threshold for UCLUST was 0.58. For AncestralClust, we used 300 initial random sequences with 15 initial clusters. Table 1. Comparisons of clustering methods using 13,043 COI sequences from 11 different species. The list of species can be found in Table S1. Incompatibility was calculated at the taxonomic rank of species. For UCLUST, meshclust2, and DNACLUST, the identity thresholds were chosen to force the expected 11 number of clusters. For CD-HIT, the lowest possible identity was chosen which is 0.8. In the case of SpClust, Coefficient of Variation cannot be calculated for 1 cluster. SpClust clusters were created with version 2. Method # of clusters Time (sec) Mem (MB) Purity Relative Incompat. (species) Relative NMI Coeff. of Var. AncestralClust 11 293.2 19.3 1 0 551.09 0.8574 UCLUST 11 <1 9.9 0.8717 0.0182 474.63 0.8300 meshclust2 11 108.14 46.5 0.9855 0 498.898 0.1053 CD-HIT 24 5.86 43.9 0.8561 0 241.66 1.2031 DNACLUST 11 <1 170.6 0.9455 0.0545 24.21 1.8987 SpClust (fast) 1 152046.5 2678.9 1 0 1 - SpClust (moderate) 1 188172.9 6457.6 1 0 1 - SpClust (maxPrecision) 1 189577.1 6452.5 1 0 1 - .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.426008 http://creativecommons.org/licenses/by/4.0/ AncestralClust 7 Table 2. Comparisons of clustering methods using 39 sequences from 6 homologous genes from Matar et al. (2019).’id’ refers to the identity threshold used. We used identity thresholds of 0.4, 0.6, and 0.8 for UCLUST and meshclust2. We used precision levels of fast, moderate, and maximum for SpClust using version 1 since version 2 only produced 1 cluster for all modes. DNACLUST has a maximum sequence length of 4500bp and could not be used on this dataset. Method # of clusters Time (sec) Memory (Mb) Purity Relative NMI Coefficient of Variation AncestralClust 6 370.3 412.0 0.9487 1.8660 0.3982 UCLUST (id=0.4) 6 1 15.4 0.7436 1.5667 0.5396 UCLUST (id=0.6) 19 1 20.1 0.7179 1.4379 0.7166 UCLUST (id=0.8) 29 1.9 20.4 0.5641 1.1717 0.4565 meshclust2 (id=0.4) 6 1.1 7.7 0.8462 1.6625 1.2489 meshclust2 (id=0.6) 10 2.9 8.8 0.7949 1.9257 1.071 meshclust2 (id=0.8) 26 2.4 9.4 0.6410 1.2240 0.6325 SpClust (fast) 4 44.6 166.2 0.8718 2.2463 0.8432 SpClust (moderate) 4 112.5 166.1 0.9487 2.4335 0.6453 SpClust (max precision) 4 570.1 166.0 0.9487 2.9449 0.6809 CD-HIT (id=0.8) 31 0.48 39.9 0.4103 1.0950 0.4574 DNACLUST - - - - - - .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.426008 http://creativecommons.org/licenses/by/4.0/ 10_1101-332965 ---- 96865564 Identification and design of vinyl sulfone inhibitors against Cryptopain-1 – a cysteine protease from cryptosporidiosis- causing Cryptosporidium parvum Arpita Banerjee Author contributions: Designed the computational experiments: AB Performed the computational experiments: AB Analyzed the data: AB Wrote the paper: AB Correspondence: arpita.005@gmail.com .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ Abstract: Cryptosporidiosis, a disease marked by diarrhea in adults and stunted growth in children, is associated with the unicellular protozoan pathogen Cryptosporidium; often the species parvum. Cryptopain-1, a cysteine protease characterized in the genome of Cryptosporidium parvum, had been earlier shown to be inhibited by a vinyl sulfone compound called K11777 (or K-777). Cysteine proteases have long been established as valid drug targets, which can be covalently and selectively inhibited by vinyl sulfones. This computational study was initiated to identify purchasable vinyl sulfone compounds, which could possibly inhibit cryptopain-1 with higher efficacy than K11777. Docking simulations screened a number of such possibly better inhibitors. The work was furthered to probe the enzymatic pocket of cryptopain-1, through in-silico mutations, to derive a map of receptor-ligand interactions in the docked complexes. The idea was to provide crucial clues to aid the design of inhibitors, which would be able to bind the protease well by making favorable interactions with important residues of the enzyme. The analyses dictated placement of ligands towards the front of the enzymatic cleft, and disfavored interactions deep within. The S1’ and S2 subsites of the enzyme preferred to remain occupied by polar ligand subgroups. Reasonably distanced ring systems and polar backbones of ligands were desired across the cleft. Large as well as inflexible subgroups were not tolerated. Double ringed systems such as substituted napthalene, especially in S1, were exceptions though. The S2 subsite, which is typically a specificity determinant in papain (C1) family cysteine proteases such as cathepsin L-like cryptopain-1, can possibly accommodate polar and hydrophobic ligand subgroups alike. Keywords: Vinyl sulfone inhibitors, Cryptopain-1, Cysteine protease, Molecular modeling, Covalent docking, In-silico mutational analysis, Drug design. Running title: Identification and design of vinyl sulfone inhibitors against cryptopain-1 .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ INTRODUCTION: Cryptosporidiosis is an intestinal disease that is clinically manifested by diarrhea in adults [1] and stunted growth in children [2]. The infection can persist indefinitely in immunocompromised individuals such as HIV patients, and could be fatal in the form of life-threatening diarrhea [3]. The disease is caused by unicellular protozoan parasite Cryptosporidium, which infects humans and animals [4] through consumption of contaminated water and/or ingestion of contaminated food products [5]. The majority of infections are caused by Cryptosporidium species hominis and parvum [6] [7]. A cysteine protease named Cryptopain-1, characterized in the genome of Cryptosporidium parvum [8], most likely facilitates host cell invasion and nutritional uptake (through proteolytic degradation) [9] [10] [11]. The pathogenic enzyme, being cathepsin L –like, belongs to papain-like or clan CA (family C1) cysteine protease enzymes - which in general have been of particular use as therapeutic targets against parasitic infections [12]. The catalytic triad of such enzymes is constituted by Cys, His and Asn residues [12], [13]. Orthologous proteases to Cryptopain-1 have been validated as drug targets viz: cruzain (from Chagas’ disease agent Trypanosoma Cruzi), rhodesain (from sleeping sickness causing Trypanosoma brucei), falcipain-3 (from malarial parasite Plasmodium falciparum), SmCB1 (from intestinal schistosomiasis causing Schistosoma mansoni) [14] [15] etc. Vinyl sulfone compounds have been particularly effective inhibitors of such parasitic cysteine proteases [13] [14] [15] [16] [17]. These inhibitors form a covalent bond with the active site Cys thiol to bind the proteases, thereby irreversibly blocking the enzymatic pocket. Such inhibition interferes with the pathogenic activity of the proteases that would otherwise participate in general acid-base reaction for hydrolysis of host-protein peptide bonds [13]. Molecular modeling studies had previously shown that unlike serine proteases (which also cleave peptide bonds and have Ser in their active site), the catalytic His in cysteine proteases remains protonated to act as a general acid [18]. Hydrogen bonding between the protonated His and the sulfone oxygen of a vinyl sulfone compound polarizes the vinyl group of the ligand to impart a positive charge on its beta carbon that eventually promotes nucleophilic attack by negatively charged Cys thiolate of the protease’s active site. Vinyl sulfone class of inhibitors are preferred over other covalent inhibitors because of its selectivity for cysteine proteases over serine proteases, relative inertness in the absence of target protease [18] [19], and safe pharmacokinetic profile [20] [21]. The peptidyl vinyl sulfones that have been co-crystallized with cysteine proteases so far reveal that the –CO-NH- backbones of the pharmacologically active compounds fit snugly in the enzymatic cleft, with the ligand sidechains (or subgroups) protruding into the different subsites of the proteases. The subgroup near the vinyl carbon that undergoes nucleophilic attack is equivalent to P1 in the inhibitor/substrate [13]. Therefore, ligand sidegroups starting from the vinyl side are designated as P1, P2… that interact with the S1, S2… protease subsites. The ligand subgroups beyond the sulfonyl are referred to as P1’, P2’… and they occupy the S1’, S2’… subsites on the prime side of the enzyme (Figure 1). Typically, the P2-S2 interaction is the key specificity determinant in papain (C1) family cysteine proteases [12] [13 like cryptopain-1. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ K11777 (or K-777), a vinyl sulfone that binds cryptopain-1 as its target as per inhibitor competition experiments with active site probe of the recombinant protease, has been demonstrated to arrest Cryptosporidium parvum growth in human cell lines at physiologically achievable concentrations [21]. The cryptopain-1 structure however, by itself or in complex with K11777, has not been solved till date. K11777-bound co-crystals of other orthologous cysteine proteases such as cruzain, rhodesain and SmCB1 [14] [15], showed the orientation of the inhibitor in the cysteine proteases as depicted in Figure 1. The earlier mentioned study on cryptopain-1 had simulated the binding of K11777 within the active site of the enzyme homology model [21], and mimicking nature, the inhibitor was put in an orientation as illustrated in Figure 1 The present computational study was initiated to explore other (purchasable) vinyl sulfones that could better bind the active site of the cryptopain-1 enzyme, with possibly higher efficacy than K11777. The study was extended to probe the enzymatic pocket of cryptopain-1 to figure preferential binding of certain ligand chemical groups at the subsites, for the purpose of providing clue to drug design against the pathogenic cysteine protease. MATERIALS AND METHODS: Homology model building of enzyme The sequence of cryptopain-1, with the accession number ABA40395.1, belonging to cryptosporidium parvum was retrieved from Genbank [22]. The protein sequence was downloaded in fasta format. The homology model template search for cryptopain-1 (cathepsin L-like) through NCBI BLAST against PDB database [23] led to 3F75, which is the activated Toxoplasma gondii cathepsin L (TgCPL) in complex with its propeptide. The template shared 48% sequence identity with the sequence to be modeled. The homology model of cryptopain-1 was built within the full refinement module of ICM [24]. The structure-guided sequence alignment between the template and the model was generated using the default matrix with gap opening penalty of 2.40 and gap extension penalty of 0.15. Loops were sampled for the alignment gaps where the template did not have co-ordinates for the model. The loop refinement parameters were used according to default settings. Acceptance ratio for the simulation process was 1.25. The generated homology model of a length of 231 amino acids was then validated in PROCHECK [25] and PROSA [26] webservers. Ligand structures from chemical compound database K11777 (or K-777) was downloaded from PubChem [27] in SDF format. The vinyl sulfone substructure of K11777 was then searched in PubChem, with the additional option of ‘Ring systems not embedded’ so as to filter out those structures where the vinyl bonds would extend into ring systems. The search, which was obviously not restricted to peptidyl vinyl sulfones, led to 10,663 hits (as of April 5, 2016). 2115 compounds, which were purchasable amongst the hits, were downloaded in SDF format. The downloaded compounds were checked for redundancy. From the 1890 non-redundant vinyl sulfone .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ compounds, 774 cyanide compounds were discarded due to the usual high toxicity profile of such compounds, and the remaining 1116 were saved to be used as ligands for docking into cryptopain-1. Docking simulation of covalent inhibition of enzyme The N-terminal propeptide (which is not part of the active enzyme and acts as a self- inhibitory peptide for regulatory purposes) of the cryptopain-1 homology model was deleted. The residues were then renumbered in the enzyme model, with position 1 allocated to the beginning of the mature protease. The pdb file of the edited cryptopain-1 model was then prepared as a receptor in ICM with the addition of protons, optimization of His, Pro, Asn, Gln and Cys residues. The protonation step was crucial for mimicking the reaction (and hence bonds) between a vinyl sulfone and the cysteine protease. The active site residues of the binding pocket had been derived from the structural alignment of cryptopain-1 homology model with the orthologous cruzain that was bound to K11777 (PDB ID: 2OZ2), followed by mapping of the residues around K11777 in the cruzain onto the cryptopain-1 sequence. The pre-determined pocket residues were selected (except the catalytic Cys24 or C24) on the prepared cryptopain-1 in the GUI of ICM and the relevant box size was created on the receptor for defining the area for ligand docking. Further, C24 was selected for specifying the covalent docking site. From the set of preloaded reactions in ICM, alpha, beta-unsaturated sulfone/sulfonamide/cysteine reaction was selected, which specified the simulation of covalent bond formation between the supposedly thiolate (C24 of protease) and the beta carbon atom (of the vinyl group of ligand). The receptor maps were finally made for grid generation. K11777, downloaded from PubChem in SDF format, was read in as a chemical table in the GUI of ICM, and was specified for docking into the prepared cryptopain-1 receptor. Thoroughness of 3.00 was set in the docking protocol, and twenty conformations of the ligand in the receptor were generated. Following K11777, a total of 1116 non-cyanide vinyl sulfone compounds were attempted for covalent docking into the cryptopain-1 homology model, using the same protocol as described above. In-silico mutation of enzyme residues for assessing binding For the purpose of evaluating the contribution of the individual residues to the binding of the ligands, mutational analysis was undertaken. The protein-ligand stability was measured by in-silico mutation of the contact residues in the complexes. K11777-docked cryptopain-1 and the best-scored complexes (with a score of -29 or lower) were read in separately, and then for each of them, the ligand-subgroup contacting residues were selected one at a time in the workspace panel, and were mutated to Alanine. The outputs of the calculations were displayed in several columns. dGwt column had the dG (Gibbs free energy) value for the wild type complex (without mutation), the dGmut held the dG value for the mutated complex (where the residue was mutated to Ala), and the ddGbind (dGmut – dGwt) column, which showed the binding free energy change (in Kcal/mol) upon mutation, essentially predicted the stability of the native complex, thereby hinting at the contribution of the residue in question towards binding the ligand. Positive values of ddGbind implied the mutation to be less favorable, indicating greater .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ contribution of the wild type residue towards binding. Hence, with more positive ddGbind, better binding of the ligand by the residue could be expected. Negative values, on the other hand, implied the mutated form to be more stable, thereby delineating the native residue’s involvement in unfavorable interactions with the ligand. The residues that were detected to make high number of favorable ligand interactions in thirty-two of the complexes (K11777-cryptopain-1 plus thirty-one best-scored ones) were subjected to a fresh round of mutations in the updated version of the ICM software. The recalculated ddGbind values were then tallied with the placement and orientation of ligand-subgroups around the residues to decipher the preference of chemical groups across the enzymatic cleft of cryptopain-1. [The GUI of ICM was used to make the enzyme/complex structure figures. Illustration and compilation of figures were done in Inkscape, which is an open-source vector graphics editor] RESULTS AND DISCUSSION: Validation of theoretical enzyme structure The ramachandran plot for the cryptopain-1 homology model showed 98% of the residues to lie in the allowed region, and the remaining 2% to be within the generously allowed region of the plot (Supplementary Figure 1A). The PROSA Z-score for the cryptopain-1 model was -7.79, better than the -6.66 Z-score of its crystal structure template (Supplementary Figure 1B). Screening of docked compounds Besides K11777, a total of 1116 purchasable, non-redundant and non-cyanide vinyl sulfone compounds were docked and scored in the cryptopain-1 homology model (50 symmetric molecules could not be docked using ICM). Post docking, the conformation of K11777 - where the ligand P1’ group (beyond the sulfonyl) got oriented across the enzyme S1’ and its P1..P3 groups (beyond the vinyl) were placed across the S1..S3 subsites (as in Figure 1), and had the lowest score in the said category, was chosen as a reference for the analysis. Such orientation appeared first in the eighteenth pose (conformation) of K11777 docked into cryptopain-1, with a score of -19.15. The conformations of some other docked vinyl sulfone compounds that had similar orientation (described above) where the ligand subgroups beyond the sulfonyl were placed across S1’ or beyond, with lowest scores <= -29.0 (and hence possibly better binders than K11777), were included in the study for further detailed analysis. [The chemical structures of K11777 and the thirty-one best-scored vinyl sulfones are provided in Supplementary Figure 2, as PubChem IDs associated with (some) chemical compounds change due to frequent updates to the database. The IDs .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ mentioned throughout the text, tables and figures are from the current PubChem records as of May 26, 2018] Ligand binding to preferential enzyme residues The residues around 4Å of the ligand subgroups were noted for each complex. K11777- docked cryptopain-1 was taken as a reference, as K11777 had been shown experimentally (on bench) to bind Cryptopian-1. The protease subsite residues were thus primarily derived from this complex. Figure 2 show the chosen conformation of K11777 docked into cryptopain-1 with the derived subsites colored differently. For the other best-scored complexes, the additional contact residues that showed up were assigned subsites according to their vicinity/placement to the already derived subsite residues in the three dimensional structure of cryptopain-1. Figure 3 shows all the residues that were contacted by ligand subgroups across the enzymatic cleft, in one or more of the complexes. The panels A, B, C and D of Figure 4 show the selected conformations of the other vinyl sulfones in the cryptopain-1, amidst the subsites derived from the reference complex. The ligand subgroup-contacting residues in each complex had been mutated to Alanine; one at a time, to figure the favorable interactions based on the ddGbind values. The interactions that showed ddGbind values worse than -1 (less than -1) were not taken into account. The residues that corresponded with the rest of the ddGbind values (greater than -1) were considered to be contributing to favorable interactions with the ligand. Supplementary Table 1 lists the ddGbind interactions in terms of residue versus ligand (represented by PubChem IDs). The columns have all the residues that had been favorably contacted in one or many of the complexes, and the rows hold the compounds whose subgroups had shown favorable interactions with the corresponding column residues. Table 1 lists the scores, contact residues, H-bonding residues and the favorably interacting subsite residues (derived from Supplementary Table 1) in the complexes. The tables feature also the additional subsite residues that showed up in the other best-scored complexes, which included ligands that, unlike K11777, were not typical peptidyl vinyl sulfones. Thirteen of the favorably interacting cryptopain-1 residues emerged to be heavily contacted by ligand subgroups in the complexes (see Supplementary Table 1). The number of times each of the residues was shown to make favorable interactions ranged from 1 to 26. With a threshold of 16, Q18, K19, G22, C24, W25, G68, T69, A138, V162, N163, H164, G165, and W188 turned out to be the most frequently contacted of the favorably interacting residues. The derived residues were then subjected to ddGbind recalculations (barring A138). The results from the calculations were studied with respect to the orientation and positioning of the ligand subgroups near the mentioned residues in the complexes. The ddGbind values for the interaction of the frequently contacted residues with the ligands are listed in Table 2. The purpose was to deduce the contributing factors for binding and to shed light on the enzymatic-pocket preference for accommodating certain ligand groups, which could be ultimately useful for designing a potent vinyl sulfone inhibitor (better than K11777) to target cryptopain-1. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ Interactions: enzyme subsite residues - ligand subgroups Unlike K11777 which occupied the central part of the pocket and was spread equally amongst all the subsites (Figure 2), the best-scored vinyl sulfones more often occupied the upper part of the cleft and tended to position themselves on the right, making contacts mostly with S1’ and S1. Ligands that lacked P1’, P2’ etc., were sometimes exceptions and got placed at the lower end of the cleft, heavily contacting S2. The positioning of the ligand-contacting residues in the three dimensional structure of the enzyme can be seen in Figure 3, and the other vinyl sulfone ligands’ placement therein is visible in Figure 4. The accommodation of various ligand subgroups of the best-scored vinyl sulfones across the enzymatic cleft is described as follows. S2’ enzyme subsite The S2’ subsite residues F148 and W192, in the uppermost part of the pocket, were not amongst the frequently contacted, and hence they were excluded from detailed analysis. S1’ enzyme subsite The derived S1’ residues N163, H164 and W188 were frequently contacted by the other vinyl sulfones, along with an additional G165 (placed between N163 and H164). Q18 and K19 also featured as additional contacts, which though positioned on the opposite side in the structure, made interactions with P1’ of the ligands. Thus the residues were categorized as part of S1’. The upper part of the heavily occupied enzymatic pocket region is constituted by S1’ residues: W188 on the right, and Q18, K19 on the left. W188, which made most of the hydrophobic interactions, on the right side of the pocket, with the ligand ring systems showed highly positive ddGbind values for thiophen group in particular. The residue seemed to prefer pi stacking with ligand ring systems as it showed favorable ddGbind values for in-plane ring interactions. The ligands with ethenyl group as well as the ones that did not place any subgroups near the residue showed moderately favorable interactions. The ligands whose rings were out of plane with the residue’s six-membered ring, and the ones that had groups like bromopyridine near the residue, showed unfavorable interactions. For Q18 that is situated at the back of the cleft wall, the compounds’ covalent moiety with their sulfonyl group and/or benzyl/phenyl ring(s), when placed near the lower end of the residue, resulted in favorable interactions. Large halide containing subgroups such as bromopyridine resulted in unfavorable interaction. K19, positioned at the front of the cleft, showed favorable interactions with reasonably distanced polar substituents. Interactions were favorable even when no substituent was close to the residue. Understandably, unfavorable interactions were observed when the non-polar moiety of the residue’s sidechain was near polar ligand atoms, and interactions of non-polar ethenyl group of the ligand with polar end of the residue also led to highly negative ddGbind values. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ The mid-region of the highly occupied cleft is constituted by N163, H164 and G165 (S1’ residues) on the right. These frequently contacted residues were actually within the contact range of both P1’ and P2 of K11777. However, the proximity of the ligand’s P1’ to the sidechains of N163 and H164 in the reference complex led to the residues’ allocation to S1’ – which therefore extends into the middle of the cleft. N163 showed favorable interaction with halide-containing substituents including bromopyridine that otherwise had unfavorable interactions with the other residues. The ligands that had their benzyl/phenyl rings at a comfortable distance from the residue showed favorable interactions. Closely spaced ligand ring systems led to clashes. H164, which is situated at the back (compared to N163) of the enzyme’s mid-pocket, preferred favorable interactions with the ligands’ sulfonyl or backbone. The residue, if not always, showed favorable interactions even when no ligand group was placed near it. Favorable ring interactions were observed when the ligands’ ring systems were mostly tilted towards W188. Unfavorable ddGbind values were observed for inflexible ethenyl groups in ligands. G165, which is buried in the mid-pocket, made interactions primarily with the covalent- bond forming moieties of the ligands. The residue showed favorable interactions with reasonably distanced ring systems. Interactions were unfavorable for closely spaced rings and inflexible groups such as ethenyl. Overall, the arrangement of the mentioned residues suggest that substituted benzene/napthalene ring systems could be accommodated in the upper region of the subsite, where the ligand rings can engage in hydrophobic interaction with W188, and the polar substituents on those rings could interact with Q18 and K19 to the left of the pocket. However, large (polar) halide-substituted rings such as bromopyridine could lead to clashes. The S1’ in the mid-pocket shows a preference for reasonably distanced ring systems and halide-substituted ligand subgroups. The subsite is not likely to tolerate inflexible groups such as diazospiro, ethenyls etc. S1 enzyme subsite The frequently contacted (derived) S1 residues G22 and C24 were positioned on the left side of the mid-pocket. W25, that emerged as an additional frequent contact was placed close-by to G22 and C24 on the left, and formed part of S1. G22 was observed to like interactions with double ring systems such as substituted napthalene or two separate benzyl/phenyl rings placed near the residue. It also showed favorable interactions with groups like sulfonyl and/or polar backbone atoms. Ring as well as polar interactions showed the most favorable ddGbind values. The interactions became unfavorable when no ligand group was in the vicinity of the residue. Bromopyridine showed unfavorable interactions with this residue too. C24, the enzymatic triad residue that formed the covalent bond with the vinyl sulfones, preferred the ligands to be placed away from it and towards the front of the cleft. The favorably interacting compounds were positioned to the right and at the bottom of the residue. The compounds that were tilted towards the inside of the cleft showed moderately unfavorable interactions, and so did the ones that did not place any ring .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ system near the residue. Unfavorable interactions for the residue were observed with the close proximity of ligands’ polar substituents or backbone. Again, bromopyridine made unfavorable interactions with this residue as well. Unlike other residues, C24 had far less borderline interactions and the individual ddGbind values mostly ranged on either side of favorable and unfavorable. W25 made favorable interactions with the ring systems of the ligands that were placed away, and towards the right side of the pocket. The interactions were better with more number of rings. The highest ddGbind value was obtained for the compound that had four ring systems. However close interactions either with the ligand backbone or side chain resulted in unfavorable interactions. Inflexible groups such as diazospiro, even if placed away from the residue, amounted to negative ddGbind values. Taken together, inflexible groups such as diazospiro, ethenyl etc. would not be tolerated by S1. The subsite can accommodate multiple ring systems. The mid-pocket would have a preference towards polar backbone of ligands that are positioned towards the front. The catalytic C24 of S1 too dictates the compounds to be placed not too deep inside the cleft. Large halide containing subgroups such as bromopyridine will not be favored in the subsite. The site shows a propensity towards closely packed ring interactions. S2 enzyme subsite The lowest part of the heavily occupied pocket is comprised by the frequently contacted (derived) S2 subsite residues: G68, T69, A138 and V162. The S2 residues are distributed on both sides of the cleft. G68, T69 are on the left, and A138, V162 are on the right. G68, placed above T69, engaged mostly in H-bond interactions with backbone of the ligands, rather than favorably accommodating their side chains. The residue showed favorable ddGbind values for slightly spaced away ring systems of ligands. The most unfavorable interactions were shown for the compound containing bromopyridine. For T69, the highest positive ddGbind value was observed for a halide-substituted ligand subgroup (fluro-triazinyl group) with its polar ring and polar backbone near the residue. T69 preferred reasonably distanced ring interactions (polar and non-polar). However, with no ligand group placed near the residue, the interactions were unfavorable. Also, with large subgroups like bromopyridine again, the interactions were unfavorable. A138 had to be excluded from the mutational analysis as ddGbind value for Ala to Ala mutation is zero, and could not have provided any useful clue towards the type of interactions. V162, despite being mostly hydrophobic, showed favorable interactions with comfortably distanced polar subgroups of ligands including the fluro-triazinyl group-containing compound that showed the best ddGbind value. Such polar groups were presumably stabilized by long-ranged electrostatic effect of other S2 residues (see tables). Summing up, S2 can certainly accommodate polar subgroups/backbone of ligands. The subsite however, like the other subsites, does not like to accommodate large polar subgroups like bromopyridine. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ Orientation and placement of ligands across the enzymatic cleft The best-scored vinyl sulfones tended to occupy the S2’, S1’, S1 and S2 subsites. Unlike K11777, the other compounds showed optimal interactions mostly with the prime site residues of the enzyme. The S1’ residues made half of the frequently contacted favorable interactions with the ligands. The rest half of such interactions were accounted by S1 and S2 members. With respect to the entire enzymatic cleft of cryptopain-1, it can be deduced that the ligands’ placement towards the front of the cleft would be preferred to deep-seated interactions. Polar backbones of ligands (even if not peptidyl) would be desired. S1’ and S2 like to be occupied, and are prone to make favorable interactions with polar subgroups of ligands. Large halide-containing subgroups are not well tolerated presumably because of their size. Reasonably distanced ring interactions would be preferred all across the cleft. Unlike inflexible groups like substituted napthalene which could be favorably accommodated in S1, the strain arising out of the inflexibility of ethenyl and/or diazospiro groups is not likely to be tolerated, especially in the S1’ and S1 subsites, as per the computational mutational analysis. Quite relevantly, the compound 23520342 that showed the maximum number of favorable interactions with the frequently contacted residues, (see Table 2) had all the preferred attributes and lacked the undesirable ones. The ligand-bound protease showed a very good score of -35.43. Some other compounds that showed slightly better scores than 23520342 were 11303991 (score: -36.41), 5279261 (score: -37.01), and 5279269 (score: -38.58). 11303991 and 5279261 were placed deep inside the cleft that led to clashes with the covalent bond forming C24. The ligands’ polar backbones, in addition to the occupation of the enzymatic S1’ site with polar subgroups, somewhat mitigated the unfavorable interactions in totality. The compounds also had the undesirable ethenyl near S1’, which contributed to unfavorable interactions with K19 in case of 5279261 (where the ethenyl was placed much closer to the residue). However, the overall scoring algorithm did not penalize ethenyl’s presence as much as the individual ddGbind calculations did. 5279269, which showed the best score, too had an ethenyl group (albeit not close to K19). This compound however was placed towards the front of the cleft, thereby avoiding unfavorable interactions with C24. Also, the ligand had ring systems in abundance (six) for favorable interactions. Rings comprised its (polar) backbone as well as subgroups. The ligand desirably occupied the S1’ and S2 subsites, though not with much polar subgroups. CONCLUSION: The efficacy of the thirty-one best-scored compounds as drug candidates within physiological limits remains to be tested on bench. The information, which has been garnered through this study on the substrate/ligand-binding cleft of the enzyme and its .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ interaction with the chemical groups of the docked compounds, could ultimately guide the design of potent vinyl sulfone inhibitors. 23520342 and 5279269 that shared most of the preferred ligand-subgroup attributes can serve as model compounds, based on which effective inhibitors against cryptopain-1 could be designed. Figure 5 provides the chemical structures of the reference (K11777) and the model compounds. Unlike the other two mentioned compounds (11303991 and 5279261), the subgroups of the model ligands extended into S2 – typically the key specificity determinant in cathepsin L-like cysteine proteases such as cryptopain-1. 23520342 placed a polar subgroup at S2 in contrast to the hydrophobic subgroup put by 5279269. Polar ligand subgroups (as in 23520342) at the enzyme’s S2 are likely to be stabilized via polar/electrostatic interactions by residues like T69, M70, T160, K161 and E215. Hydrophobic subgroups too (as in 5279269) could be accommodated by the virtue of S2 residues like A138 and V162. Thus, the study attempted to identify purchasable vinyl sulfone compounds that can possibly inhibit cryptopain-1, as well as it provided crucial information pertaining to receptor-ligand interactions to help future design of other vinyl sulfones, which could prove to be effective in curbing cryptosporidiosis. Acknowledgement: The author would like to thank Prof. Ruben Abagyan of University of California San Diego, for providing computational resources. REFERENCES: [1] DuPont HL, Chappell CL, Sterling CR, Okhuysen PC, Rose JB, Jakubowski W. 1995. The infectivity of Cryptosporidium parvum in healthy volunteers. N. Engl. J. Med. 332:855–859. [2] Janoff EN, Mead PS, Mead JR, Echeverria P, Bodhidatta L, Bhaibulaya M, Sterling CR, Taylor DN. 1990. Endemic Cryptosporidium and Giardia lamblia infections in a Thai orphanage. Am. J. Trop. Med. Hyg. 43:248–256. [3] Griffiths JK. 1998. Human cryptosporidiosis: epidemiology, transmission, clinical disease, treatment, and diagnosis. Adv. Parasitol. 40:37–85. [4] Fayer R, Santin M, Macarisin D. 2010. Cryptosporidium ubiquitum n. sp. in animals and humans. Vet. Parasitol. 172:23–32. [5] Juranek DD. 1995. Cryptosporidiosis: sources of infection and guidelines for prevention. Clin. Infect. Dis. 21(Suppl. 1): S57–S61 [6] O’Donoghue PJ. 1995. Cryptosporidium and cryptosporidiosis in man and animals. Int. J. Parasitol. 25:139–195. [7] Tzipori S, Widmer G. 2008. A hundred-year retrospective on cryptosporidiosis. Trends Parasitol. 24:184–189. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ [8] Na BK, Kang JM, Cheun HI, Cho SH, Moon SU, Kim TS, Sohn WM. 2009. Cryptopain-1, a cysteine protease of Cryptosporidium parvum, does not require the pro- domain for folding. Parasitology 136:149–157 [9] Teo CF, Zhou XW, Bogyo M, Carruthers VB. 2007. Cysteine protease inhibitors block Toxoplasma gondii microneme secretion and cell invasion. Antimicrobial Agents and Chemotherapy 51: 679–688. [10] Shaw MK, Roos DS, Tilney LG. 2002. Cysteine and serine protease inhibitors block intracellular development and disrupt the secretory pathway of Toxoplasma gondii. Microbes and Infection 4: 119–132. [11] Rosenthal PJ. 2002. Hydrolysis of erythrocyte proteins by proteases of malaria parasites. Current Opinions in Hematology 9: 140–145 [12] Sajid M, McKerrow JH. 2002. Cysteine proteases of parasitic organisms Molecular & Biochemical Parasitology 120:1–21. [13] Powers JC, Asgian JL, Ekici OD, James KE. 2002. Irreversible Inhibitors of Serine, Cysteine, and Threonine Proteases. Chem. Rev. 102: 4639-4750. [14] Kerr ID, Lee JH, Farady CJ, Marion R, Rickert M, Sajid M, Pandey KC, Caffrey CR, Legac J, Hansell E, McKerrow JH, Craik CS, Rosenthal PJ, Brinen LS. 2009. Vinyl Sulfones as Antiparasitic Agents and a Structural Basis for Drug Design. 284(38): 25697–25703. [15] Jílkova A, Rˇezácˇová P, Lepsˇík M, Horn M, Va´chova´ J, Fanfrlík J, Brynda J, McKerrow JH, Caffrey CR, Mares M. 2011. Structural Basis for Inhibition of Cathepsin B Drug Target from the Human Blood Fluke, Schistosoma mansoni. J. Biol. Chem. 286(41): 35770–35781. [16] Chen YT, Lira R, Hansell E, McKerrow JH, Roush WR. 2008. Synthesis of macrocyclic trypanosomal cysteine protease inhibitors. Bioorg Med Chem Lett.18 (22): 5860–5863. [17] Jaishankar P, Hansell E, Zhao DM, Doyle PS, McKerrow JH, Renslo AR. 2008. Potency and selectivity of P2/P3-modified inhibitors of cysteine proteases from trypanosomes Bioorg. Med. Chem. Lett. 18: 624–628. [18] Rasnick D. 1996. Small synthetic inhibitors of cysteine proteases Perspectives in Drug Discovery and Design December. 6(1): 47–63. [19] Palmer JT, Rasnick D, Klaus JL, Bromme D. 1995. Vinyl Sulfones as Mechanism- Based Cysteine Protease Inhibitors J. Med. Chem. 38 (17): 3193–3196 [20] McKerrow JH, Rosenthal PJ, Swenerton R, Doyle P. 2008. Development of protease inhibitors for protozoan infections. Curr Opin Infect Dis. 21(6): 668-72 [21] Ndao M, Nath-Chowdhury M, Sajid M, Marcus V, Mashiyama ST, Sakanari J, Chow E, Mackey Z, Land KM, Jacobson MP, Kalyanaraman C, McKerrow JH, Arrowood MJ, Caffrey CR. 2013. A Cysteine Protease Inhibitor Rescues Mice from a Lethal .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ Cryptosporidium parvum Infection. Antimicrob Agents Chemother. 57(12): 6063-73 [22] Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2013. GenBank. Nucleic Acids Res. 41(Database issue): D36-42. [23] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne P. 2000. The Protein Data Bank. Nucl Acids Res. 28: 235-242 [24] Abagyan RA, Totrov MM, Kuznetsov DA. 1994. ICM—A new method for protein modeling and design: Applications to docking and structure prediction from the distorted native conformation. J. Comp. Chem. 15: 488-506. [25] Laskowski RA, MacArthur MW, Moss DS. 1993 PROCHECK: A program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26: 283-291. [26] Wiederstein M, Sippl MJ. 2007 ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Research. 35(2): 407–410. [27] Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH. 2016. PubChem Substance and Compound databases. Nucleic Acids Res. 44(Database issue): D1202-1213. . .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ Ligands Score Contact residues H-bond residues Fav S2’ residues Fav S1’ residues Fav S1 residues Fav S2 residues Fav S3 residues 9851116 (K11777) -19.15 A141, D142 N163, H164, W188 G22, C24, C65, D66, G67 G68, T69, A138, V162, E215 F63, L72 G68, W188 A141, D142, H164 C65, G67 G68, T69, A138, V162, E215 F63, L72 10025975 -31.5 Q18, K19, N20, C21, G22, C24, W25, D66, G67, G68, T69, N163, H164,G165, W188, K19, W188 Q18, K19, N20, C21, N163, G165, W188 G22, C24, W25, D66, G67 G68, T69 11303991 -36.41 Q18, K19, C24, W25, G68, T69, A138, V162 N163, H164, G165, W188 W25 Q18, K19, H164, G165, W188 W25 T69, A138, V162 1475343 -30.57 N17, Q18, K19, C24, W25, G68, T69, M70, A138, V162, N163, H164, G165, W188, E215 H164, W188 N17, Q18, K19, N163, G165 W25 T69, M70, A138 4508639 -30.06 Q18, K19, N20, C21, G22, C24, W25, G68, T69, Q147, N163,H164, G165, W188, W192 Q18 Q147, W192 Q18, K19, N20, C21, N163, H164, G165, W188 G22, W25 G68, T69 .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5279261 -37.01 N17, Q18, K19, N20, G22, C24, W25, D66, G67, G68, T69, M70, A138, Q147 N163, H164, G165, W188, W192 G68, W188 Q147, W192 N17, Q18, K19, N20, N163, G165, W188 W25, D66 G68, M70, A138 5279267 -35.23 N17, Q18, K19, G22, C24, W25, D66, G67, G68, T69, Q147, N163, H164, G165 W188 W192 N17, G68, W188 Q147, W192 N17, Q18, K19, N163, H164, G165, W188 G22, C24, W25, D66, G67 G68, T69 54579435 -31.69 Q18, K19, N20, C21 G22, C24, W25, D66, G67, G68, T69, N163, H164, G165 W188 Q18, W188 Q18, K19, N20, C21, N163, H164, G165, W188 G22, C24, W25, D66, G67 G68, T69 6321066 -33.07 N17, Q18, K19, G22, C24, W25, D66, G67, G68, T69, N163, H164, G165, W188, W192 N17, G68, W188 W192 N17, Q18, K19, N163, H164, G165, W188 G22, C24, W25, D66, G67 G68, T69 101597384 -33.24 N17, Q18 K19, G22, C24, D142 Q147, F148 N163, H164 W188 W192 G68, Q147, H164, W188 Q147, F148, W192 N17, Q18, K19, D142, N163, H164, W188 G22, C24 11186426 -35.16 Q18, K19, C24, W25, G68, M70, A138, V162, Q18, K19, G165, W188 C24, W25 G68, M70, A138 .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ N163, H164 G165, W188 11325689 -35.46 C24, W25, G68, T69, M70, A138, V162, N163, H164, G165 W188 W25 N163, G165, W188 W25 G68, T69, M70, A138, V162 3696205 -30.81 Q18, K19, N20, C21, G22, C24, W25, G68, T69, M70, Q147, V162, N163, H164 G165, W188, W192 Q18, H164 Q147 Q18, K19, N20, C21, N163, H164, G165, W188 G22, C24, W25 G68, T69, M70, V162 4989711 -29.47 Q18, K19, C24, W25, G68, T69, M70, A138, A141, D142 V162, N163 H164, G165 W188 Q18, H164 Q18, K19, A141, H164, G165 W25 T69, M70, A138, V162 5279262 -32.38 N17, G22, C24, W25, D66, G67, G68, T69, M70, A138, A141, D142 Q147, V162, N163, H164 G165, W188, W192 C21, C24, G68, W188 Q147, W192 N17, A141, D142, N163, H164, G165, W188 G22, W25, G67 G68, T69, M70, A138, V162 5279269 -38.58 N17, Q18, K19, N20, C21, G22 C24, W25, F63, D66, G67, G68, T69, M70 A138, Q147, V162, N163, H164, G165, W188, E215 G68 Q147 N17, Q18, K19, N20, C21, N163, H164, G165, W188 G22, C24, W25, D66, G67 G68, T69, M70, A138, V162, E215 F63 5471616 -32.18 G22, D66, G67 G68, V162 N163 G22, G67 T69, A138, .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ T69, A138, T160, K161, V162, N163, H164, E215 T160, K161 6520483 -33.02 T69, A138, T160, K161, V162, N163 H164, E215 V162 N163, H164 T69, A138 T160, K161 V162, E215 71425014 -29.22 N17, Q18, K19, C24, W25, G67, G68, T69, A138, Q147, N163, H164 G165 W188 W188 Q147 N17, Q18, K19, N163, H164, G165, W188 C24, W25, G67 T69, A138 23520342 -35.43 Q18, K19, N20, C21, G22, C24, C65, D142, V162, N163, H164, W188 G68, N163, W188 Q18, K19, N20, C21, N163, H164, W188 C24 11269418 -29.04 Q18, K19, N20, C21, G22, C24, W25, D66, G67, G68, T69, N163, H164, G165 Q18, C21, W188 Q18, K19, N20, C21, N163, H164, G165 G22, C24, W25, D66, G67 G68, T69 1475342 -29.32 C21, G22, C24, W25 C65, D66 G68, T69, M70, A138, V162, N163, H164, G165 C21, N163, H164, G165 G22, C24, W25, C65, D66 G68, T69, M70, A138, V162 5288713 -30.75 G22, C24, W25, F63, C65, D66, G67, G68, T69, M70, A138, T160, K161, V162, H164, G165, W188, E215 G68, W188 H164, G165, W188 G22, W25, C65 G68, T69, M70, A138, T160, K161, V162, E215 F63 .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3570939 -29.49 Q18, G22, C24, W25 G68, T69 M70, A138, A141, D142, Q147, V162, N163, H164, G155, W188 Q18, C24, G68 Q147 Q18, A141, D142, N163, H164, G165, W188 G22, C24, W25 G68, T69, M70, A138, V162 3827331 -31.26 Q18, K19, N20, C21, G22, C24 W25, G68, T69, M70, A138, A141, D142, Q147, V162, N163, H164, G165 W188, W192 Q18 Q147, W192 Q18, K19, N20, C21, A141, D142, N163, H164, G165, W188 G22, W25 G68, T69, M70, A138, V162 5279260 -34.0 N17, Q18, K19, C24, W25, G67, G68, T69, A138, Q147 V162, N163, H164, G165 W188, W192 G68 Q147, W192 N17, Q18, K19, N163, H164, G165, W188 C24, W25, G67 G68, T69, M70,A138 V162 5279264 -33.04 C24, W25 G67, G68, T69, M70, A138, A141 D142, Q147 N163, H164 G165,W188, W192 Q18 A141, D142, N163, H164, W188 C24, G67 G68, T69, A138 5962006 -30.52 C24, W25 G68, T69 M70, A138 V162, N163, H164, G165 W188 A141, G165 W25 G68, T69, M70, A138, V162 71358204 -29.72 Q18, C21, G22, S23, C24, W25, C65, D66, G67, G68, T69, A138, A141, D142, Q18 Q147, W192 Q18, C21, A141, D142, N163, H164, G165, W188 G22, S23, C24, W25, C65, D66, G67 G68, T69, A138, V162 .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ Table 1: The contact residues around K11777 in cryptopain-1 are color-coded as per subsites. The residues around the P1’ sidegroup of K11777 (S1’ subsite) are in orange. The S1 site is in green, S2 in pink and S3 in red. The residues that made favorable contacts with K11777 are shown in bold in the subsequent columns. The residues around the ligand subgroups of the best-scored vinyl sulfones compounds (PubChem IDs in ligands column) are listed. The favorable interactions (including additional contact residues, which does not appear for K11777) are shown in bold and colored as per subsites. The additional S2’ subsite is shown in mauve. The scores and the H-bonding residues for the individual complexes are also listed. Q147, V162, N163, H164, G165, W188 90477904 -30.74 N17, Q18, K19, C21, G22, S23, C24, W25, A26, F27, C65, D66, G67, A141, Q147, N163, H164, W188 W192 Q18, G68, H164, W188 Q147, W192 N17, Q18, K19, C21, A141, N163, H164, W188 G22, S23, C24,W25, A26,F27, C65, D66, G67 3673954 -32.33 Q18, K19, N20, C21, G22, C24, W25, G68, T69, M70, A138, A141, D142, Q147 V162, N163, H164, G165 W188 Q18 Q147 Q18, K19, N20, C21, A141, D142, H164, G165, W188 C24, W25 G68, T69, M70, A138, V162 71367133 -31.09 N17, Q18, K19, N20, C21, G22, C24, C65, D66, G67 Q147, N163, H164, W188 G68, N163, W188 Q147 N17, Q18, K19, N20, C21, N163, H164, W188 G22, C24, C65, D66, G67 .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ Q18 K19 G22 C24 W25 G68 T69 V162 N163 H164 G165 W188 K11777 (9851116) -0.69 0.06 -1.02 -4.75 -0.21 -0.01 -0.04 -0.24 -0.12 -0.84 0.08 -14.27 10025975 -0.11 0.05 -0.02 -4.19 -0.05 0 -0.12 -0.22 -0.15 -1.64 0.05 -0.38 11303991 -0.06 0.05 -0.08 -38.16 0.01 0 -0.1 0.13 -0.56 -0.36 -0.06 -0.15 1475343 2.3 10.87 19.15 -61.75 13.68 -1.96 9.65 9 -0.95 14.48 11.6 12.11 4508639 78.92 0.05 -0.02 -0.2 -0.02 0 -0.09 -0.21 -0.01 -0.01 0.06 94.15 5279261 -0.27 -23.96 -1.07 -14.28 0.49 0.01 -0.08 -1.37 0.92 -1.17 -0.29 0.92 5279267 -0.01 0.05 -0.02 1.12 0.02 0 -0.06 -0.21 0.12 -0.09 0.01 -0.27 54579435 -0.08 0.05 -0.02 0.19 -0.02 0 -0.12 -0.22 -0.01 -0.62 0.05 -0.26 6321066 -0.08 0.05 -0.02 0.38 0.04 0 -0.09 -0.21 -0.01 -0.11 0.01 -0.26 101597384 -0.41 -0.3 -0.04 0.13 -0.05 0 -0.09 -1.07 -0.01 -0.14 0.01 -0.1 11186426 6.07 4.06 5.07 7.65 6.21 13.3 6.12 4.24 1.41 8.12 2.37 3.15 11325689 -0.08 0.17 -0.96 -26.39 0.77 0.01 -0.23 0.56 0.83 -5.35 -0.2 0.6 3696205 -0.06 0.06 -0.05 2.9 -0.06 0 -0.1 -0.21 -0.01 -0.06 0.01 -0.1 4989711 9.33 -124.55 -120.55 -121.95 -13.62 40.39 8.65 -28.45 -26.17 14.06 64.76 37.76 5279262 -0.06 0.07 -0.13 47.9 0.25 0 -0.07 -0.21 -0.01 -0.05 -0.01 -0.2 5279269 -0.08 0.05 20.02 0 0.07 0 -0.13 -0.21 -0.01 -0.08 0.01 -0.1 5471616 5.13 4.65 5.19 5.54 -1.93 2.95 5.84 4.95 1.51 5.13 4.34 2.05 6520483 -0.08 0.05 -0.04 1.27 0.01 0 -0.11 -0.18 0.14 -0.07 0.01 -0.1 71425014 -28.4 -7.13 -28.42 -35.48 18.81 -28.39 -2.23 -28.68 31.35 -6.79 24.72 -15.86 23520342 30.29 29.22 52.49 67.44 43.47 24.98 24.35 57.59 36.59 1.54 35.81 3.1 11269418 -0.08 0.05 -0.02 0.52 -0.02 0 -0.12 -0.22 -0.05 -0.07 0.05 -0.1 1475342 -0.36 0.05 -0.03 19.88 0.11 0 -0.1 -0.26 -0.47 -0.11 -0.01 -0.13 5288713 1.12 0.15 -0.4 -38.63 -1.21 -0.21 -0.16 -0.13 -0.03 0.25 0.08 1.13 3570939 -2.24 0.05 -0.03 0.11 -0.05 0 -0.12 -0.2 -0.01 -0.14 0.3 -0.1 .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3827331 -0.21 0.18 0.13 -1.65 -0.05 0 -0.09 -0.21 1.07 1.61 -0.08 -0.1 5279260 -0.08 0.05 -0.02 0.78 0.07 0 -0.11 -0.2 -0.02 -0.1 0.01 -0.1 5279264 -0.08 0.05 -0.02 93.87 -0.02 0 -0.07 -0.21 -0.01 -0.07 0.05 -0.1 5962006 10.18 12.74 -10.64 -12.15 6.82 4.4 7.33 -0.8 -2.89 -2.83 11.11 0.6 71358204 -0.08 0.05 -0.02 27.47 0.07 0 -0.05 -0.21 -0.01 -0.07 0.01 -0.1 90477904 -0.89 -0.72 -0.79 20.44 -2.89 5.42 0.04 -0.99 0.05 -2.09 -0.76 1.17 3673954 -0.08 0.06 -0.02 101.77 -0.09 0 -0.09 -0.21 -1.94 -0.07 -2.41 -0.1 71367133 8.18 -3.76 -4.19 71.7 2.88 14.46 -4.82 7.21 2.11 11.2 13.05 -1.88 Table 2: The ddGbind values for the interaction of K11777 and the best-scored ligands with the important residues of cryptopain-1 are tabulated. The residues that had showed high number of favorable interactions (Supplementary Table 1) were taken into consideration for the second round of calculations to chart this table. The values for the most favorable interactions are shown in purple, moderately favorable interactions in brown, slightly unfavorable in aquamarine and unfavorable in blue. The scale for demarcation varies for each residue, depending on the range and type of its interactions. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ S2' P2 Figure1: Illustration of the typical binding of vinyl sulfone inhibitors to cysteine protease enzymes. Colored spheres represent the different subsites of the enzyme, and the ligand sidechain/subgroups of the vinyl sulfone inhibitor are in violet rectangles. Spatial distribution of the subsites in three-dimensional protease structures differs from the linear arrangement that has been shown here for simplicity. The backbones of the enzyme and inhibitor are not shown. The site of covalent bond formation at C24 has been marked in red. The positioning/denotation of the ligand subgroups within the different subsites of the enzyme is according to their placement near the vinyl warhead – depicting what has been observed so far in the solved structures of peptidyl vinyl sulfone-bound cysteine proteases. The ligand sidegroup nearest the beta carbon of vinyl is P1 that fits into S1. The following ligand subgroups are P2, P3 etc. The groups beyond the sulfonyl are P1’, P2’ etc. which interact with the prime side subsites of the enzyme. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 sulfone P1 P1' P2P3 Figure 2: K11777 or K-777 (PubChem ID: 9851116) docked into the three-dimensional (homology) model of cryptopain-1. The selected conformation (score: -19.15) shown here conforms to the arrangement of the ligand subgoups (P1’, P1, P2, P3) in the different enzyme subsites as depicted in Figure1, and so does the color code that demarcates the subsites. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 Figure 3: All the residues that are contacted by one or more ligands in the docked complexes of K11777 and the best-scored (score <= -29.0) vinyl sulfones are labeled and shown in spacefill representation (colored as per hydrophobicity) in the three dimensional structure (homology model) of cryptopain-1. The enzymatic triad residue C24 - the site of covalent attachment - is in yellow. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10025975 11303991 1475343 4508639 5279261 5279267 54579435 6321066 101597384 4A .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11186426 11325689 3696205 4989711 5279262 5279269 5471616 6520483 71425014 4B .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23520342 11269418 1475342 5288713 3570939 3827331 5279260 5279264 5962006 4C .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 71358204 90477904 3673954 71367133 4D Figure 4: Panels A, B, C, D show the orientation and placement of the best-scored (score <= -29.0) compounds docked into the cryptopain-1 theoretical structure. The ligands are shown with respect to the enzyme subsites that have been derived from the K11777-cryptopain-1 reference complex. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9851116 (K11777) 23520342 5279269 5 Figure 5: The chemical structures (along with the PubChem identifiers) of the reference ligand K11777 or K-777, and the two model compounds - which showed optimum interactions with the enzymatic cleft of cryptopain-1 and thereby could aid the design of effective inhibitors to target the protease. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 6, 2021. ; https://doi.org/10.1101/332965doi: bioRxiv preprint https://doi.org/10.1101/332965 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_01_08_425976 ---- bioRxiv.org - the preprint server for Biology Skip to main content Home About Submit ALERTS / RSS Search for this keyword Advanced Search Subject Areas All Articles Animal Behavior and Cognition Biochemistry Bioengineering Bioinformatics Biophysics Cancer Biology Cell Biology Clinical Trials Developmental Biology Ecology Epidemiology Evolutionary Biology Genetics Genomics Immunology Microbiology Molecular Biology Neuroscience Paleontology Pathology Pharmacology and Toxicology Physiology Plant Biology Scientific Communication and Education Synthetic Biology Systems Biology Zoology View by Month 10_1101-436634 ---- RegTools: Integrated analysis of genomic and transcriptomic data for the discovery of splicing variants in cancer 1 RegTools: Integrated analysis of genomic and transcriptomic data for the discovery of splicing variants in cancer Kelsy C. Cotto1,2,†, Yang-Yang Feng2,†, Avinash Ramu3, Zachary L. Skidmore1,2, Jason Kunisaki2, Megan Richters1,2, Sharon Freshour1,2, Yiing Lin4, William C. Chapman4, Ravindra Uppaluri5,6, Ramaswamy Govindan1,7, Obi L. Griffith1,2,3,7*, Malachi Griffith1,2,3,7* † denotes co-first authors. * denotes corresponding authors. Correspondence to Obi L. Griffith (obigriffith@wustl.edu) and Malachi Griffith (mgriffit@wustl.edu). Affiliations: 1. Division of Oncology, Department of Medicine, Washington University School of Medicine, St. Louis, MO, USA 2. McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA 3. Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA 4. Department of Surgery, Washington University School of Medicine, St. Louis, MO, USA 5. Department of Surgery, Brigham and Women’s Hospital, Boston, MA, USA 6. Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA 7. Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO, USA Abstract Somatic mutations in non-coding regions and even in exons may have unidentified regulatory consequences which are often overlooked in analysis workflows. Here we present RegTools (www.regtools.org), a free, open-source software package designed to integrate analysis of somatic variants from genomic data with splice junctions from transcriptomic data to identify variants that may cause aberrant splicing. RegTools was applied to over 9,000 tumor samples with both tumor DNA and RNA sequence data. We discovered 235,778 events where a variant significantly increased the splicing of a particular junction, across 158,200 unique variants and 131,212 unique junctions. To characterize these somatic variants and their associated splice isoforms, we annotated them with the Variant Effect Predictor (VEP), SpliceAI, and Genotype- Tissue Expression (GTEx) junction counts and compared our results to other tools that integrate genomic and transcriptomic data. While certain events can be identified by the aforementioned tools, the unbiased nature of RegTools has allowed us to identify novel splice variants and previously unreported patterns of splicing disruption in known cancer drivers, such as TP53, CDKN2A, and B2M, as well as in genes not previously considered cancer-relevant, such as RNF145. Introduction .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 Alternative splicing of messenger RNA allows a single gene to encode multiple gene products, increasing a cell’s functional diversity and regulatory precision. However, splicing malfunction can lead to imbalances in transcriptional output or even the presence of novel oncogenic transcripts1. The interpretation of variants in cancer is frequently focused on direct protein- coding alterations2. However, most somatic mutations arise in intronic and intergenic regions, and exonic mutations may also have unidentified regulatory consequences3,4,5,6. For example, mutations can affect splicing either in trans, by acting on splicing effectors, or in cis, by altering the splicing signals located on the transcripts themselves7. Increasingly, we are identifying the importance of splice variants in disease processes, including in cancer8,9. However, our understanding of the landscape of these variants is currently limited, and few tools exist for their discovery. One approach to elucidating the role of splice variants has been to predict the strength of putative splice sites in pre-mRNA from genomic sequences, such as the method used by the SpliceAI tool10–13. With the advent of efficient and affordable RNA-seq, we are also seeing the complementary approach of evaluating alternative splicing events (ASEs) directly from RNA sequencing data. Various tools exist which allow the identification of significant ASEs from transcript-level data within sample cohorts, including SUPPA2 and SPLADDER14,15. Many of these tools have also evaluated the role of trans-acting splice mutations16. However, few tools are directed at linking specific aberrant RNA splicing events to specific genomic variants in cis to investigate the splice regulatory impact of these variants. Those few relevant tools that do exist have significant limitations that preclude them from broad applications. The sQTL-based approach taken by LeafCutter and other tools is designed for relatively frequent single-nucleotide polymorphisms. It is thus ill-suited to studying somatic variants, or any case in which the frequency of a particular variant is very low (often unique) in a given sample population17–19. Recent tools that have been created for large-scale analysis of cancer-specific data, such as MiSplice and Veridical, ignore certain types of ASEs, are tailored to specific analysis strategies and sets of hypotheses, or are otherwise inaccessible to the end-user due to issues such as lack of documentation, difficulty with installation and integration with existing pipelines, limited computing efficiency, or licensing issues20–22. To address these needs, we have developed RegTools, a free, open-source (MIT license) software package that is well-documented, modularized for ease of use, and designed to efficiently identify potential cis-acting splice-relevant variants in tumors (www.regtools.org). RegTools is a suite of tools designed to aid users in a broad range of splicing-related analyses. At the highest level, it contains three sub-modules: a variants module to annotate variant calls with respect to their potential splicing relevance, a junctions module to analyze aligned RNA-seq data and associated splicing events, and a cis-splice-effects module that integrates genomic variant calls and transcriptomic sequencing data to identify potential splice-altering variants. Each sub-module contains one or more commands, which can be used individually or integrated into regulatory variant analysis pipelines. To demonstrate the utility of RegTools in identifying potential splice-relevant variants from tumor data, we analyzed a combination of data available from the McDonnell Genome Institute (MGI) at Washington University School of Medicine and The Cancer Genome Atlas (TCGA) project. In .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 total, we applied RegTools to 9,173 samples across 35 cancer types. We contrasted our results with other tools that integrate genomic and transcriptomic data to identify potential splice altering variants, specifically Veridical, MiSplice, and SAVNet20,21,23. Novel junctions identified by RegTools were compared to data from The Genotype-Tissue Expression (GTEx) project to assess whether these junctions are present in normal tissues24. Variants significantly associated with novel junctions were processed through VEP and Illumina’s SpliceAI tool to compare our findings with splicing consequences predicted based on the variant information alone13,25. With this additional analysis, we were able to more easily identify both variants in known cancer drivers, whose splicing consequences have not been previously reported in the literature, and potentially novel cancer drivers, whose disruption relies on splice-altering mutations Results The RegTools tool suite supports splice regulatory variant discovery by the integration of genome and transcriptome data. RegTools is a suite of tools designed to aid users in a broad range of splicing-related analyses. The variants module contains the annotate command. The variants annotate command takes a VCF of somatic variant calls and a GTF of transcriptome annotations as input. RegTools does not have any particular preference for variant callers or reference annotations. Each variant is annotated by RegTools with known overlapping genes and transcripts, and is categorized into one of several user-configurable “variant types”, based on position relative to the edges of known exons. The variant type annotation depends on the stringency for splicing-relevance that the user sets with the “splice variant window” setting. By default, RegTools marks intronic variants within 2 bp of the exon edge as “splicing intronic”, exonic variants within 3 bp as “splicing exonic”, other intronic variants as “intronic”, and other exonic variants simply as “exonic.” RegTools considers only “splicing intronic” and “splicing exonic” as important. To allow for discovery of an arbitrarily expansive set of variants, RegTools allows the user to customize the size of the exonic/intronic windows individually (e.g. -i 50 -e 5 for intronic variants 50 bp from an exon edge and exonic variants 5 bp from an exon edge) or even consider all exonic/intronic variants as potentially splicing-relevant (e.g. -E or -I) (Figure 1A). The junctions module contains the extract and annotate commands. The junctions extract command takes an alignment file containing aligned RNA-seq reads, infers the exon-exon boundaries based on the CIGAR strings26, and outputs each “junction” as a feature in BED12 format. The junctions annotate command takes a file of junctions in BED12 format (such as the one output by junctions extract), a FASTA file containing the reference genome, and a GTF file containing reference transcriptome annotations and generates a TSV file, annotating each junction with: the number of acceptor sites, donor sites, and exons skipped, and the identities of known overlapping transcripts and genes. We also annotate the “junction type”, which denotes if and how the junction is novel (i.e. different compared to provided transcript annotations). If the donor is known, but the acceptor is not or vice-versa, it is marked as “D” or “A”, respectively. If both are known, but the pairing is not known, it is marked as “NDA”, whereas if both are .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 unknown, it is marked as “N”. If the junction is not novel (i.e. it appears in at least one transcript in the supplied GTF), it is marked as “DA” (Figure 1B). The cis-splice-effects module contains the identify command, which identifies potential splice- altering variants from sequencing data. The following are required as input: a VCF file containing variant calls, an alignment file containing aligned RNA-sequencing reads, a reference genome FASTA file, and a reference transcriptome GTF file. The identify pipeline internally relies on variants annotate, junctions extract, and junctions annotate to output a TSV containing junctions proximal to putatively splicing-relevant variants. The identify pipeline can be customized using the same parameters as in the individual commands. Briefly, cis-splice-effects identify first performs variants annotate to determine the splicing-relevance of each variant in the input VCF. For each variant, a “splice junction region” is determined by finding the largest span of sequence space between the exons that flank the exon associated with the variant. From here, junctions extract identifies splicing junctions present in the RNA-seq BAM. Next, junctions annotate labels each extracted junction with information from the reference transcriptome as described above and its associated variants based on splice junction region overlap (Figure 1C). For our analysis, we annotated the pairs of associated variants and junctions identified by RegTools, which we refer to as “events”, with additional information such as whether this association was identified by a comparable tool, the junction was found in GTEx, and whether the event occurred in a cancer gene according to Cancer Gene Census (CGC) (Figure 1C)24,27. Finally, we created IGV sessions for each event identified by RegTools that contained a bed file with the junction, a VCF file with the variant, and an alignment (BAM) file for each sample that contained the variant28. These IGV sessions were used to manually review candidate events to assess whether the association between the variant and junction makes sense in a biological context. RegTools is designed for broad applicability and computational efficiency. By relying on well- established standards for sequence alignments, annotation files, and variant calls and by remaining agnostic to downstream statistical methods and comparisons, our tool can be applied to a broad set of scientific queries and datasets. Moreover, performance tests show that cis- splice-effects identify can process a typical candidate variant list of 1,500,000 variants and a corresponding RNA-seq BAM file of 82,807,868 reads in just ~8 minutes (Supplementary Figure 1). Pan-cancer analysis of 35 tumor types identifies somatic variants that alter canonical splicing RegTools was applied to 9,173 samples over 35 cancer types. 32 of these cohorts came from TCGA while the remaining three were obtained from other projects being conducted at MGI. Cohort sizes ranged from 21 to 1,022 samples. In total, 6,370,631 variants (Figure 2A) and 2,387,989,201 junction observations (Figure 2B) were analyzed by RegTools. By comparing the number of initial variants per cohort to the number of statistically significant variants, we .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 were able to show that RegTools produces a prioritized list of potential splice relevant variants (Supplementary Figure 2). Additionally, when analyzing the junctions within each sample, we found that junctions present in the reference transcriptome are frequently seen within GTEx data while junctions observed from a sample’s own transcriptome data that were not present in the reference are rarely seen within GTEx (Supplementary Figure 3). 235,778 significant variant junction pairings were found for junctions that use a known donor and novel acceptor (D), novel donor and known acceptor (A), or novel combination of a known donor and a known acceptor (NDA), with novel here meaning that the junction was not found in the reference transcriptome (Methods, Figure 2C, Supplemental Files 1 and 2). While our analysis primarily focuses on variants in relation to novel splice events because of the potential importance of these events within tumor processes, we also wanted to assess how often a variant was significantly associated with a known junction. 5,157 variant junction pairings were found for junctions known to the reference (DA junctions) (Supplemental Files 3 and 4). This finding indicates that while splice variants usually result in a novel junction occurring, they sometimes alter the expression of known junctions. Generally, significant events were evenly split among each of the novel junction types considered (D, A, and NDA). The number of significant events increased as the splice variant window size increased, with both the E and I results being comparable in number. Notably, hepatocellular carcinoma (HCC) was the only cohort that had whole genome sequencing (WGS) data available and, as expected, it exhibited a marked increase in the number of significant events for its results within the “I” splice variant window. This observation highlights the low sequence coverage of intronic regions that occurs with WES which subsequently leads to underpowered discovery of potential splice altering variants within introns. Variants were analyzed across tumor types for how often they result in either a single or multiple novel junctions (Figure 3A). While a single variant resulting in a single novel junction is most commonly observed (72.27-83.78%), a single variant also commonly results in multiple junctions being created, either of the same type (6.56-10.94%) or of different types (9.66- 16.79%) (Figure 3B). Variants that are associated with multiple novel junctions of different types were further investigated to identify how often a particular junction type occurred with another (Figure 3C). Most commonly, we observed an alternate donor or acceptor site being used in conjunction with an exon skipping event. These events were particularly common within the default window (2 intronic bases or 3 exonic bases from the exon edge), as a SNV or indel within these positions has a high probability of disrupting the natural splice site, thus causing the splicing machinery to use a cryptic splice site nearby or skip the splice site entirely. The next most common event was an alternate donor site and an alternate acceptor site both being used as the result of a single variant. The combination of a novel acceptor site and novel donor site being used in conjunction with an exon-skipping event occurred the least and occurrence of this type of event remains fairly low, even as the search space increases within the larger splice variant windows. This finding indicates the low likelihood of a single variant resulting in simultaneous disruption of a splice acceptor and donor as well as complete skipping of an exon. Overall, this analysis highlights that there is evidence that a single variant can lead to multiple novel junctions being expressed. Tools that only allow for a single junction to be predicted or .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 associated with a variant therefore may not be completely describing the effect of the variant in question in up to ~27% of cases. RegTools identifies splice altering variants missed by other splice variant predictors and annotators To evaluate the performance of RegTools, we compared our results to those of SAVNet, MiSplice, Veridical, VEP, and SpliceAI13,20,21,23,25. These tools vary in their inputs and methodology for identifying splice altering variants (Figure 4A). Both VEP and SpliceAI only consider information about the variant and its genomic sequence context and do not consider information from a sample’s transcriptome. A variant is considered to be splice relevant according to VEP if it occurs within 1-3 bases on the exonic side or 1-8 bases on the intronic side of a splice site. SpliceAI does not have restrictions on where the variant can occur in relation to the splice site but by default, it predicts one new donor and acceptor site within 50 bp of the variant, based on reference transcript sequences from GENCODE. Like RegTools, SAVNet, MiSplice, and Veridical integrate genomic and transcriptomic data in order to identify splice altering variants. MiSplice only considers junctions that occur within 20 bp of the variant. Additionally, SAVNet, MiSplice, and Veridical filter out any transcripts found within the reference transcriptome. SAVNet, MiSplice, and Veridical employ different statistical methods for the identification of splice altering variants. In contrast to RegTools, none of the mentioned tools allow the user to set a custom window in which they wish to focus splice altering variant discovery (e.g. around the splice site, all exonic variants, etc.). These tools have different levels of code availability. MiSplice is available via GitHub as a collection of Perl scripts that are built to run via Load Sharing Facility (LSF) job scheduling. To run MiSplice without an LSF cluster, the authors mention code changes are required. Veridical is available via a subscription through CytoGnomix’s MutationForecaster. Similar to RegTools, SAVNet is available via GitHub or through a Docker image. However, SAVNet relies on splicing junction files generated by STAR29 whereas RegTools can use RNA-Seq alignment files from HISAT230, TopHat231, or STAR, thus allowing it to be integrated into bioinformatic workflows more easily. In their recent publications, SAVNet23, MiSplice20, and Veridical21,22 also analyzed data from TCGA, with only minor differences in the number of samples included for each study. VEP and SpliceAI results were obtained by running each tool on all starting variants for the 35 cohorts included in this study. In order to efficiently compare this data, an UpSet plot (Figure 4B) was created32. Only 343 variants are identified as splice altering by all six tools. Comparatively, MiSplice and SAVNet find few splice altering variants, potentially indicating that these tools are overlooking the complete set of variants that have an effect on splicing. In contrast, Veridical identifies by far the most splice altering variants across all tools, with 94.54 percent of its calls being found by it alone. SpliceAI and VEP called a large number of variants, either alone or in agreement, that none of the tools that integrate transcriptomic data from samples identify. This highlights a limitation of using tools that only focus on genomic data, particularly in a disease context where transcripts are unlikely to have been annotated before. RegTools addresses these short-comings by identifying what pieces of information to extract from a sample’s genome and transcriptome in a very basic, unbiased way that allows for generalization. Other .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 tools either only analyze genomic data, focus on junctions where either the canonical donor or acceptor site is affected (missing junctions that result from complete exon skipping), or consider only those variants within a very narrow distance from known splice sites. RegTools can include any kind of junction type, including exon-exon junctions that have ends that are not known donor/acceptor sites according to the GTF file (N junction according to RegTools), any distance size to make variant-junction associations, and any window size in which to consider variants. Due to these advantages, RegTools identified events missed by one or multiple of the tools to which we compared (Figure 4B; Supplementary Figures 4 and 5). Pan-cancer analysis reveals novel splicing patterns within known cancer genes and potential cancer drivers While efforts have been made to associate variants with specific cancer types, there has been little focus on identifying such associations in splice-altering variants, even those in known cancer genes. TP53 is a rare example whose splice-altering variants are well characterized in numerous cancer types33. As such, we further analyzed significant events to identify genes that had recurrent splice altering variants. Within each cohort, we looked for recurrent genes using two separate metrics: a binomial test p-value and the fraction of samples (see Methods). For ranking and selecting the most recurrent genes, each metric was computed by pooling across all cohorts. For assessing cancer-type specificity, each metric was then also computed using only results from a given cancer cohort. Since the mechanisms underlying the creation of novel junctions versus the disruption of existing splicing patterns may be different, analysis was performed separately for D/A/NDA junctions (Figure 5, Supplementary Figure 6, Supplementary File 5) and DA junctions (Supplementary Figure 7, Supplementary File 6), which allowed multiple test correction in accordance with the noise of the respective data. We identified 6,954 genes in which there was least one variant predicted to influence the splicing of a D/A/NDA junction. The 99th percentile of these genes, when ranked by either metric, are significantly enriched for known cancer genes, as annotated by the CGC (p=1.26E-19, ranked by binomial p-values, p=2.97E-24, ranked by fraction of samples; hypergeometric test). We also identified 3,643 genes in which there was least one variant predicted to influence the splicing of a DA (known) junction. The 99th percentile of these genes, when ranked by either metric, are also significantly enriched for known cancer genes, as annotated by the Cancer Gene Census (p=1.00E-04, ranked by binomial p-values, p=3.56E-07, ranked by fraction of samples; hypergeometric test). We also performed the same analyses using either the TCGA or MGI cohorts alone. The TCGA-only analyses gave very similar results to the combined analyses, with the 99th percentile of genes found in the D/A/NDA and DA analyses again being enriched for cancer genes (Supplementary Figures 8 and 9; Supplemental Files 5 and 6). Due to small cohort sizes, in the MGI-only analyses, we identified only 329 and 208 genes in the D/A/NDA and DA analyses, respectively. The 99th percentile of genes from these analyses, respectively, were not significantly enriched for cancer genes (Supplementary Figures 10 and 11; Supplemental Files 5 and 6). When analyzing D, A, and NDA junctions, we saw an enrichment for known tumor suppressor genes among the most splice disrupted genes, including several examples where splice .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 disruption is a known mechanism such as TP53, PTEN, CDKN2A, and RB1. Specifically, in the case of TP53, we identified 428 variants that were significantly associated with at least one novel splicing event. One such example is the intronic SNV (GRCh38, chr17:g.7673609C>A) that was identified in an OSCC sample and was associated with an exon skipping event and an alternate acceptor site usage event, with 23 and 41 reads of support, respectively (Supplemental Figure 12). The cancer types in which we find splice disruption of TP53 and other known cancer genes is in concordance with associations between genes and cancer types described by CGC and CHASMplus27,34. Our analysis’s recovery of known drivers, many of which with known susceptibilities to splicing dysregulation in cancer, indicates the ability of our method to identify true splicing effects that are likely cancer-relevant. Another cancer gene that we found to have a recurrence of splicing altering variants was B2M. Specifically, we identified six samples with intronic variants on either side of exon 2 (Figure 6). While mutations have been identified and studied within exon 2, we did not find literature that specifically identified intronic variants near exon 2 as a mechanism for disrupting B2M35. These mutations were identified by VEP to be either splice acceptor variant or a splice donor variant and were also identified by Veridical. MiSplice was able to predict one of the novel junctions for each variant but failed to predict additional novel junctions due to the limitation of that tool to only predict one novel acceptor and donor site per variant. Notably, 4 out of the 6 samples that these variants were found in are MSI-H (Microsatellite instability-high) tumors36. Mutations in B2M, particularly within colorectal MSI-H tumors, have been identified as a method for tumors to become incapable of HLA class I antigen-mediated presentation37. Furthermore, in a study of patients treated with immune checkpoint blockade (ICB) therapy, defects to B2M were observed in 29.4% of patients with progressing disease38. In the same study, B2M mutations were exclusively seen in pretreatment samples from patients who did not respond to ICB or in post- progression samples after initial response to ICB38. There are several genes that are responsible for the processing, loading, and presentation of antigens, and have been shown to be mutated in cancers39. However, no proteins can be substituted for B2M in HLA class I presentation, thus making the loss of B2M a particularly robust method for ICB resistance40. We also observe exonic variants and variants further in intronic regions that disrupt canonical splicing of B2M. These findings indicate that intronic variants that result in alternative splice products within B2M may be a mechanism for immune escape within tumor samples. We also identify recurrent splice altering variants in genes not known to be cancer genes (according to CGC), such as RNF145. RegTools identified a recurrent single base pair deletion that results in an exon skipping event of exon 8 (Supplementary Figure 13). This gene is a paralog of RNF139, which has been found to be mutated in several cancer types41. This variant junction association was found in STAD, UCEC, COAD, and ESCA tumors, all of which are considered to be MSI-H tumors36. After analyzing the effect of the exon skipping event on the mRNA sequence, we concluded that the reading frame remains intact, possibly leading to a gain of function event. Additionally, the skipping of exon 8 leads to the removal of a transmembrane domain and a phosphorylation site, S352, which could be important for the regulation of this gene42. Based on these findings, RNF145 may play a role similar to RNF139 and may be an important driver event in certain tumor samples. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9 While most of our analysis focused on splice altering variants that resulted in D, A, NDA junctions, we also wanted to investigate variants that shifted the usage of known donor and acceptor sites. Through this analysis, we identified CDKN2A, a tumor suppressor gene that is frequently mutated in numerous cancers43, to have several variants that led to alternate donor usage (Supplementary Figure 14). When these variants are present, an alternate known donor site is used that leads to the formation of the transcript ENST00000579122.1 instead of ENST00000304494.9, the transcript that encodes for p16ink4a, a known tumor suppressor. The transcript that results from use of this alternate donor site is missing the last twenty-eight amino acids that form the C-terminal end of p16ink4a. Notably, this removes two phosphorylation sites within the p16 protein, S140 and S152, which when phosphorylated promotes the association of p16ink4a with CDK444. This finding highlights the importance of including known transcripts in alternative splicing analyses as variants may alter splice site usage in a way that results in a known but pathogenic transcript product. Discussion Splice associated variants are often overlooked in traditional genomic analysis. To address this limitation, we created RegTools, a software suite for the analysis of variants and junctions in a splicing context. By relying on well-established standards for analyzing genomic and transcriptomic data and allowing flexible analysis parameters, we enable users to apply RegTools to a wide set of scientific methodologies and datasets. To ease the use and integration of RegTools into analysis workflows, we provide documentation and example workflows via (regtools.org) and provide a Docker image with all necessary software installed. In order to demonstrate the utility of our tool, we applied RegTools to 9,173 tumor samples across 35 tumor types to profile the landscape of this category of variants. From this analysis, we report 133,987 variants that cause novel splicing events that were missed by VEP or SpliceAI. Only 1.4 percent of these mutations were previously discovered by similar attempts, while 98.6 percent are novel findings. We demonstrate that there are splice altering variants that occur beyond the splice site consensus sequence, shift transcript usage between known transcripts, and create novel exon-exon junctions that have not been previously described. Specifically, we describe notable findings within B2M, RNF145, and CDKN2A. These results demonstrate the utility of RegTools in discovering novel splice-altering mutations and confirm the importance of integrating RNA and DNA sequencing data in understanding the consequences of somatic mutations in cancer. To allow further investigation of these identified events, we make all of our annotated result files (Supplemental Files 1-4) and recurrence analysis files (Supplemental Files 5-6) available. Understanding the splicing landscape is crucial for unlocking potential therapeutic avenues in precision medicine and elucidating the basic mechanisms of splicing. The exploration of novel tumor-specific junctions will undoubtedly lead to translational applications, from discovering novel tumor drivers, diagnostic and prognostic biomarkers, and drug targets, to identifying a previously untapped source of neoantigens for personalized immunotherapy. While our analysis .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 focuses on splice altering variants within cancers, we believe RegTools will play an important role in answering this broad range of questions by helping users extract splicing information from transcriptome data and linking it to somatic (or germline) variant calls. The computational efficiency of RegTools and increasing availability and size of such datasets may also allow for improved understanding of splice regulatory motifs that have proven difficult to accurately define such as exonic and intronic splicing enhancers and silencers. Any group with paired DNA and RNA-seq data for the same samples stands to benefit from the functionality of RegTools. Methods Software implementation RegTools is written in C++. CMake is used to build the executable from source code. We have designed the RegTools package to be self-contained in order to minimize external software dependencies. A Unix platform with a C++ compiler and CMake is the minimum prerequisite for installing RegTools. Documentation for RegTools is maintained as text files within the source repository to minimize divergence from the code. We have implemented common file handling tasks in RegTools with the help of open-source code from Samtools/HTSlib26 and BEDTools45 in an effort to ensure fast performance, consistent file handling, and interoperability with any aligner that adheres to the BAM specification. Statistical tests are conducted within RegTools using the RMath framework. Travis CI and Coveralls are used to automate and monitor software compilation and unit tests to ensure software functionality. We utilized the Google Test framework to write unit tests. RegTools consists of a core set of modules for variant annotation, junction extraction, junction annotation, and GTF utilities. Higher level modules such as cis-splice-effects make use of the lower level modules to perform more complex analyses. We hope that bioinformaticians familiar with C/C++ can re-use or adapt the RegTools code to implement similar tasks. Benchmarking Performance metrics were calculated for all RegTools commands. Each command was run with default parameters on a single blade server (Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz) with 10 GB of RAM and 10 replicates for each data point (Supplementary Figure 1). Specifically for cis-splice-effects identify, we started with random selections of somatic variants, ranging from 10,000-1,500,000, across 8 data subsets. Using the output from cis-splice-effects identify, variants annotate was run on somatic variants from the 8 subsets (range: 0-17,742) predicted to have a splicing consequence. The function junctions extract was performed on the HCC1395 tumor RNA-seq data aligned with HISAT to GRCh37 and randomly downsampled at intervals ranging from 10-100%. Using output from junctions extract, junctions annotate was performed for 7 data subsets ranging from 1,000-500,000 randomly selected junctions. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11 Benchmark tests revealed an approximately linear performance for all functions. Variance between real and CPU time is highly dependent on the I/O speed of the write-disk and could account for artificially inflated real time values given multiple jobs writing to the same disk at once. The most computationally expensive function in a typical analysis workflow was junctions extract, which on average processed 33,091 reads/second (CPU) and took an average of 43.4 real vs 41.7 CPU minutes to run on a full bam file (82,807,868 reads total). The function junctions annotate was the next most computationally intensive function and took an average of 33.0 real/8.55 CPU minutes to run on 500,000 junctions, processing 975 junctions/second (CPU). The other functions were comparatively faster with cis-splice-effects identify and variants annotate able to process 3,105 and 118 variants per second (CPU), respectively. To process a typical candidate variant list of 1,500,000 variants and a corresponding RNA-seq BAM file of 82,807,868 reads with cis-splice-effects identify takes ~ 8.20 real/8.05 CPU minutes (Supplementary Figure 1). Performance metrics were also calculated for the statistics script and its associated wrapper script that handles dividing the variants into smaller chunks for processing to limit RAM usage. This command, compare_junctions, was benchmarked in January 2020 using Amazon Web Services (AWS) on a m5.4xlarge instance, based on the Amazon Linux 2 AMI, with 64 Gb of RAM, 16 vCPUs, and a mounted 1 TB SSD EBS volume with 3000 IOPS. These data were generated from running compare_junctions on each of the included cohorts, with the largest being our BRCA cohort (1022 sample) which processed 3.64 events per second (CPU). Using RegTools to identify cis-acting, splice altering variants RegTools contains three sub-modules: “variants”, “junctions”, and “cis-splice-effects”. For complete instructions on usage, including a detailed workflow for how to analyze cohorts using RegTools, please visit regtools.org. Variants annotate This command takes a list of variants in VCF format. The file should be gzipped and indexed with Tabix46. The user must also supply a GTF file that specifies the reference transcriptome used to annotate the variants. The INFO column of each line in the VCF is populated with comma-separated lists of the variant-overlapping genes, variant-overlapping transcripts, the distance between the variant and the associated exon edge for each transcript (i.e. each start or end of an exon whose splice variant window included the variant) defined as min(distance_from_start_of_exon, distance_from_end_of_exon), and the variant type for each transcript. Internally, this function relies on HTSlib to parse the VCF file and search for features in the GTF file which overlap the variant. The splice variant window size (i.e. the maximum distance from the edge of an exon used to consider a variant as splicing-relevant) can be set by the options “- e ” and “-i ” for exonic and intronic variants, respectively. The variant type for each variant thus depends on the options used to set the splice variant window size. Variants captured by the window set by “-e” or “-i” are annotated as .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 12 “splicing_exonic” and “splicing_intronic”, respectively. Alternatively, to analyze all exonic or intronic variants, the “-E” and “-I” options can be used. Otherwise, the “-E” and “-I” options themselves do not change the variant type annotation, and variants found in these windows are labeled simply as “exonic” or “intronic”. By default, single exon transcripts are ignored, but they can be included with the “-S” option. By default, output is written to STDOUT in VCF format. To write to a file, use the option “-o ”. Junctions extract This command takes an alignment file containing aligned RNA-seq reads and infers junctions (i.e. exon-exon boundaries) based on skipped regions in alignments as determined by the CIGAR string operator codes. These junctions are written to STDOUT in BED12 format. Alternatively, the output can be redirected to a file with the “-o ”. RegTools ascertains strand information based on the XS tags set by the aligner, but can also determine the inferred strand of transcription based on the BAM flags if a stranded library strategy was employed. In the latter case, the strand specificity of the library can be provided using “-s ” where 0 = unstranded, 1 = first-strand/RF, 2 = second-strand/FR. We suggest that users align their RNA-seq data with HISAT230, TopHat231, or STAR29, as these are the aligners we have tested to date. If RNA-seq data is unstranded and aligned with STAR, users must run STAR with the --outSAMattributes option to include XS tags in the BAM output. Users can set thresholds for minimum anchor length and minimum/maximum intron length. The minimum anchor length determines how many contiguous, matched base pairs on either side of the junction are required to include it in the final output. The required overlap can be observed amongst separated reads, whose union determines the thickStart and thickEnd of the BED feature. By default, a junction must have 8 bp anchors on each side to be counted but this can be set using the option “-a ”. The intron length is simply the end coordinate of the junction minus the start coordinate. By default, the junction must be between 70 bp and 500,000 bp, but the minimum and maximum can be set using “-i ” and “-I ”, respectively. For efficiency, this tool can be used to process only alignments in a particular region as opposed to analyzing the entire BAM file. The option “-r :-” can be used to set a single contiguous region of interest. Multiple jobs can be run in parallel to analyze separate non-contiguous regions. Junctions annotate This command takes a list of junctions in BED12 format as input and annotates them with respect to a reference transcriptome in GTF format. The observed splice-sites used are recorded based on a reference genome sequence in FASTA format. The output is written to STDOUT in TSV format, with separate columns for the number of splicing acceptors skipped, number of splicing donors skipped, number of exons skipped, the junction type, whether the donor site is known, whether the acceptor site is known, whether this junction is known, the overlapping transcripts, and the overlapping genes, in addition to the chromosome, start, stop, junction name, junction score, and strand taken from the input BED12 file. This output can be .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 13 redirected to a file with “-o /PATH/TO/FILE”. By default, single exon transcripts are ignored in the GTF but can be included with the option “-S”. Cis-splice-effects identify This command combines the above utilities into a pipeline for identifying variants which may cause aberrant splicing events by altering splicing motifs in cis. As such, it relies on essentially the same inputs: a gzipped and Tabix-indexed VCF file containing a list of variants, an alignment file containing aligned RNA-seq reads, a GTF file containing the reference transcriptome of interest, and a FASTA file containing the reference genome sequence of interest. First, the list of variants is annotated. The splice variant window size is set using the options “- e”, “-i”, “-E”, and “-I”, just as in variants annotate. The splice junction region size (i.e. the range around a particular variant in which an overlapping junction is associated with the variant) can be set using “-w ”. By default, this range is not a particular number of bases but is calculated individually for each variant, depending on the variant type annotation. For “splicing_exonic”, “splicing_intronic”, and “exonic” variants, the region extends from the 3’ end of the exon directly upstream of the variant-associated exon to the 5’ end of the exon directly downstream of it. For “intronic” variants, the region is limited to the intron containing the variant. Single-exons can be kept with the “-S” option. The annotated list of variants in VCF format (analogous to the output of variants annotate) can be written to a file with “-v /PATH/TO/FILE”. The BAM file is then processed in the splice junction regions to produce the list of junctions. A file containing these junctions in BED12 format (analogous to the output of junctions extract) can be written using “-j /PATH/TO/FILE”. The minimum anchor length, minimum intron length, and maximum intron length can be set using “-a”, “-i”, and “-I” options, just as in junctions extract. The list of junctions produced by the preceding step is then annotated with the information presented in junctions annotate. Additionally, each junction is annotated with a list of associated variants (i.e. variants whose splice junction regions overlapped the junction). The final output is written to STDOUT in TSV format (analogous to the output of junctions annotate) or can be redirected to a file with “-o /PATH/TO/FILE”. Cis-splice-effects associate This command is similar to cis-splice-effects identify, but takes the BED output of junctions extract in lieu of an alignment file with RNA alignments. As with cis-splice-effects identify, each junction is annotated with a list of associated variants (i.e. variants whose splice junction regions overlapped the junction). The resulting output is then the same as cis-splice-effects identify, but limited to the junctions provided as input. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 Analysis Dataset Description 32 cancer cohorts were analyzed from TCGA. These cancer types are Adrenocortical carcinoma (ACC), Bladder Urothelial Carcinoma (BLCA), Brain Lower Grade Glioma (LGG), Breast invasive carcinoma (BRCA), Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), Cholangiocarcinoma (CHOL), Colon adenocarcinoma (COAD), Esophageal carcinoma (ESCA), Glioblastoma multiforme (GBM), Head and Neck squamous cell carcinoma (HNSC), Kidney Chromophobe (KICH), Kidney renal clear cell carcinoma (KIRC), Kidney renal papillary cell carcinoma (KIRP), Liver hepatocellular carcinoma (LIHC), Lung adenocarcinoma (LUAD), Lung squamous cell carcinoma (LUSC), Lymphoid Neoplasm Diffuse Large B cell Lymphoma (DLBC), Mesothelioma (MESO), Ovarian serous cystadenocarcinoma (OV), Pancreatic adenocarcinoma (PAAD), Pheochromocytoma and Paraganglioma (PCPG), Prostate adenocarcinoma (PRAD), Rectum adenocarcinoma (READ), Sarcoma (SARC), Skin Cutaneous Melanoma (SKCM), Stomach adenocarcinoma (STAD), Testicular Germ Cell Tumors (TGCT), Thymoma (THYM), Thyroid carcinoma (THCA), Uterine Carcinosarcoma (UCS), Uterine Corpus Endometrial Carcinoma (UCEC), and Uveal Melanoma (UVM). Three cohorts were derived from patients at Washington University in St. Louis. These cohorts are Hepatocellular Carcinoma (HCC), Oral Squamous Cell Carcinoma (OSCC), and Small Cell Lung Cancer (SCLC). Sample processing We applied RegTools to 35 tumor cohorts. Genomic and transcriptomic data for 32 cohorts were obtained from The Cancer Genome Atlas (TCGA). Information regarding the alignment and variant calling for these samples is described by the Genomic Data Commons data harmonization effort47. Whole exome sequencing (WES) mutation calls for these samples from MuSE48, MuTect249, VarScan250, and SomaticSniper51, were left-aligned, trimmed, and decomposed to ensure the correct representation of the variants across the multiple callers. Samples for the remaining three cohorts, HCC, SCLC, and OSCC, were sequenced at Washington University in St. Louis. Genomic data were produced by WES for SCLC and OSCC and whole genome sequencing (WGS) for HCC. Normal genomic data of the same sequencing type and tumor RNA-seq data were also available for all subjects. Sequence data were aligned using the Genome Modeling System (GMS)52 using TopHat2 for RNA and BWA-MEM53 for DNA. HCC and SCLC were aligned to GRCh37 while OSCC was aligned to GRCh38. Somatic variant calls were made using Samtools v0.1.126, SomaticSniper2 v1.0.251, Strelka V0.4.6.254, and VarScan v2.2.650,54 through the GMS. High-quality mutations for all samples were then selected by requiring that a variant be called by two of the four variant callers. Candidate junction filtering To generate results for 4 splice variant window sizes, we ran cis-splice-effects identify with 4 sets of splice variant window parameters. For our “i2e3” window (RegTools default), to examine intronic variants within 2 bases and exonic variants within 3 bases of the exon edge, we set “-i 2 -e 3”. Similarly, for “i50e5”, to examine intronic variants within 50 bases and exonic variants within 5 bases of the exon edge, we set “-i 50 -e 5”. To view all exonic variants, we simply set “- .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 E”, without “-i” or “-e” options. To view all intronic variants, we simply set “-I”, without “-i” or “-e” options. TCGA samples were processed with GRCh38.d1.vd1.fa (downloaded from the GDC reference file page at https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference- files) as the reference fasta file and gencode.v29.annotation.gtf (downloaded via the GENCODE FTP) as the reference transcriptome. OSCC was processed with Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa and Homo_sapiens.GRCh38.79.gtf (both downloaded from Ensembl). HCC and SCLC were processed with Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa and Homo_sapiens.GRCh37.87.gtf (both downloaded from Ensembl). Statistical filtering of candidate events We refer to a statistical association between a variant and a junction as an “event”. For each event identified by RegTools, a normalized score (norm_score) was calculated for the junction of the event by dividing the number of reads supporting that junction by the sum of all reads for all junctions within the splice junction region for the variant of interest. This metric is conceptually similar to a “percent-spliced in” (PSI) index, but measures the presence of entire exon-exon junctions, instead of just the inclusion of individual exons. If there were multiple samples that contained the variant for the event, then the mean of the normalized scores for the samples was computed (mean_norm_score). If only one sample contained the variant, its mean_norm_score was thus equal to its norm_score. This value was then compared to the distribution of samples which did not contain the variant to calculate a p-value as the percentage of the norm_scores from these samples which are at least as high as the mean_norm_score computed for the variant-containing samples. We performed separate analyses for events involving canonical junctions (DA) and those involving novel junctions which used at least one known splice site (D/A/NDA), based on annotations in the corresponding reference GTF. For this study, we filtered out any junctions which did not use at least one known splice site (N) and junctions which did not have at least 5 reads of evidence across variant-containing samples. The Benjamini-Hochberg procedure was then applied to the remaining events. Following correction, an event was considered significant if its adjusted p-value was ≤ 0.05. Annotation with GTEx junction data and other splice prediction tools Events identified by RegTools as significant were annotated with information from GTEx, VEP, SpliceAI, MiSplice, and Veridical. GTEx junction information was obtained from the GTEx Portal. Specifically, the exon-exon junction read counts file from the v8 release was used for data aligned to GRCh38 while the same file from the v7 release was used for the data aligned to .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 GRCh37. Mappings between tumor cohorts and GTEx tissues can be found in Supplemental File 7. We annotated all starting variants with VEP in the “per_gene” and “pick” modes. The “per_gene” setting outputs only the most severe consequence per gene while the “pick” setting picks one line or block of consequence data per variant. We considered any variant with at least one splicing-related annotation to be “VEP significant”. All variants were also processed with SpliceAI using the default options. A variant was considered to be “SpliceAI significant” if it had at least one score greater than 0.2, the developers’ value for high recall of their model. Variants identified by MiSplice20 were obtained from the paper supplemental tables and were lifted over to GRCh38. Variants identified by SAVNet23 were obtained from the paper supplemental tables and were lifted over to GRCh38. Variants identified by Veridical21,22 were obtained via download from the link reference within the manuscript and lifted over to GRCh38. Visual exploration of statistically significant candidate events IGV sessions were created for each event identified by RegTools that was statistically significant. Each IGV session file contained a bed file with the junction, a vcf file with the variant, and an alignment file for each sample that contained the variant. Additional information, such as the splice sites predicted by SpliceAI, were also added to these session files to enhance the exploration of these events. Events of interest were manually reviewed in IGV to assess whether the association between the variant and junction made sense in a biological context (e.g. affected a known splice site, altered a genomic sequence to look more like a canonical splice site, or the novel junction disrupted active or regulatory domains of the protein product). An extensive review of literature and visualizations of junction usage in the presence and absence of the variant were also used to identify novel, biologically relevant events. Identification of genes with recurrent splice altering variants For each cohort, we calculated a p-value to assess whether the splicing profile from a particular gene was significantly more likely to be altered by somatic variants. Specifically, we performed a 1-tailed binomial test, considering the number of samples in a cohort as the number of attempts. Success was defined by whether the sample had evidence of at least one splice-altering variant in that gene. The null probability of success, pnull was calculated as where s is the total number of base positions residing in any of the gene’s splice variant windows, V is the event that a somatic variant occurred at such a base position, and A is the event that this variant was deemed to be significantly associated with at least one junction in our analysis. The joint probability that both V and A occurred was estimated by dividing the total of events across all samples in which each junction was detected by s. The value of s was computed based on the exon and transcript definitions in the reference GTF used for performing RegTools analyses on a given cohort. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 We also calculated overall metrics, in order to rank genes. For each set of cohorts (e.g. TCGA- only, MGI-only, combined), an overall p-value was computed for each gene according to the above formula, pooling all of the samples across the included cohorts, and the fraction of samples was simply calculated by dividing the number of samples in which an event occurred within the given gene by the total number of samples, pooled across the included cohorts. The reference GTF used for analyzing the TCGA samples (i.e. gencode.v29.annotation.gtf) was used for all sets of cohorts. Code availability RegTools is open source (MIT license) and available at https://github.com/griffithlab/regtools/. All scripts used in the analyses presented here are also provided. For ease of use, a Docker container has been created with RegTools, R, and Python 3 installed (https://hub.docker.com/r/griffithlab/regtools/). This Docker container allows a user to run the workflow we outline at https://regtools.readthedocs.io/en/latest/workflow/. Docker is an open- source software platform that enables applications to be readily installed and run on any system. The availability of RegTools with all its dependencies as a Docker container also facilitates the integration of the RegTools software into workflow pipelines that support Docker images. Data availability Sequence data for each cohort analyzed in this study are available through dbGaP at the following accession IDs: phs000178 for TCGA cohorts, phs001106 for HCC, phs001049 for SCLC, and phs001623 for OSCC. Statistically significant events for D, A, and NDA junctions across the four variant splicing windows used are available via Supplemental Files 1 and 2. Statistically significant events for DA junctions are available as Supplemental Files 3 and 4. Complete results of gene recurrence analysis are available as Supplemental Files 5 and 6. Acknowledgments We thank the patients and their families for donation of their samples and participation in clinical trials. We would like to thank Donald Conrad for his initial idea to compare to variant effect predictor tools. Kelsy Cotto was supported by Siteman Cancer Center under fund number #3477-92400 and T32CA113275. Avinash Ramu was supported by the ‘Burroughs Wellcome Fund Institutional Program Unifying Population and Laboratory Based Sciences Award’ at Washington University. Malachi Griffith was supported by the National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) under Award Number R00HG007940. Malachi Griffith and Obi Griffith were supported by the NIH National Cancer Institute (NCI) under Award Numbers U01CA209936, U01CA231844, U01CA248235 U24CA237719. Malachi Griffith and Megan Richters were supported by the V Foundation for Cancer Research under Award Number V2018-007. The results published here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. Contributions K.C.C. and Y.-Y.F. were involved in all aspects of this study, including designing methodology, developing and testing the tool software, analyzing and interpreting data, and writing the .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 manuscript, with input from A.R., Z.L.S., M.R., S.F., J.K., O.L.G., and M.G. A.R. designed the tool and led software development efforts. Y.L., W.C.C., R.U., and R.G. provided unpublished tumor datasets and provided critical feedback on the manuscript. O.L.G. and M.G. supervised the study. All authors read and approved the final manuscript. Conflicts of Interest W. Chapman serves on the advisory board for Novartis Pharmaceutical and reports intellectual property with Pathfinder Therapeutics. R. Uppaluri reports grants and personal fees from Merck Inc. R. Govindan served as consultant for Horizon Pharmaceuticals and GenePlus. References 1. Chabot, B. & Shkreta, L. Defective control of pre-messenger RNA splicing in human disease. J. Cell Biol. 212, 13–27 (2016). 2. Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013). 3. Soemedi, R. et al. Pathogenic variants that alter protein code often disrupt splicing. Nat. Genet. 49, 848–855 (2017). 4. Supek, F., Miñana, B., Valcárcel, J., Gabaldón, T. & Lehner, B. Synonymous mutations frequently act as driver mutations in human cancers. Cell 156, 1324–1335 (2014). 5. Jung, H. et al. Intron retention is a widespread mechanism of tumor-suppressor inactivation. Nat. Genet. 47, 1242–1248 (2015). 6. Venables, J. P. Aberrant and alternative splicing in cancer. Cancer Res. 64, 7647–7654 (2004). 7. Climente-González, H., Porta-Pardo, E., Godzik, A. & Eyras, E. The Functional Impact of Alternative Splicing in Cancer. Cell Rep. 20, 2215–2226 (2017). 8. Chen, J. & Weiss, W. A. Alternative splicing in cancer: implications for biology and therapy. Oncogene 34, 1–14 (2015). 9. Xiong, H. Y. et al. RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015). 10. Yeo, G. & Burge, C. B. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004). .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 11. Fairbrother, W. G., Yeh, R.-F., Sharp, P. A. & Burge, C. B. Predictive identification of exonic splicing enhancers in human genes. Science 297, 1007–1013 (2002). 12. Wang, Z. et al. Systematic identification and analysis of exonic splicing silencers. Cell 119, 831–845 (2004). 13. Jaganathan, K. et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535–548.e24 (2019). 14. Kahles, A., Ong, C. S., Zhong, Y. & Rätsch, G. SplAdder: identification, quantification and testing of alternative splicing events from RNA-Seq data. Bioinformatics 32, 1840–1847 (2016). 15. Trincado, J. L. et al. SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. Genome Biol. 19, 40 (2018). 16. Kahles, A. et al. Comprehensive Analysis of Alternative Splicing Across Tumors from 8,705 Patients. Cancer Cell 34, 211–224.e6 (2018). 17. Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 50, 151–158 (2018). 18. Monlong, J., Calvo, M., Ferreira, P. G. & Guigó, R. Identification of genetic variants associated with alternative splicing using sQTLseekeR. Nat. Commun. 5, 4698 (2014). 19. Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science 352, 600–604 (2016). 20. Jayasinghe, R. G. et al. Systematic Analysis of Splice-Site-Creating Mutations in Cancer. Cell Rep. 23, 270–281.e3 (2018). 21. Viner, C., Dorman, S. N., Shirley, B. C. & Rogan, P. K. Validation of predicted mRNA splicing mutations using high-throughput transcriptome data. F1000Res. 3, (2014). 22. Shirley, B. C., Mucaki, E. J. & Rogan, P. K. Pan-cancer repository of validated natural and cryptic mRNA splicing mutations. F1000Res. 7, 1908 (2018). 23. Shiraishi, Y. et al. A comprehensive characterization of cis-acting splicing-associated .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 variants in human cancer. Genome Res. 28, 1111–1125 (2018). 24. GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580– 585 (2013). 25. McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016). 26. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078– 2079 (2009). 27. Sondka, Z. et al. The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nat. Rev. Cancer 18, 696–705 (2018). 28. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011). 29. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). 30. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015). 31. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013). 32. Conway, J. R., Lex, A. & Gehlenborg, N. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33, 2938–2940 (2017). 33. Surget, S., Khoury, M. P. & Bourdon, J.-C. Uncovering the role of p53 splice variants in human malignancy: a clinical perspective. Onco. Targets. Ther. 7, 57–68 (2013). 34. Tokheim, C. & Karchin, R. CHASMplus Reveals the Scope of Somatic Missense Mutations Driving Human Cancers. Cell Syst 9, 9–23.e8 (2019). 35. Bicknell, D. C., Kaklamanis, L., Hampson, R., Bodmer, W. F. & Karran, P. Selection for β2- microglobulin mutation in mismatch repair-defective colorectal carcinomas. Curr. Biol. 6, 1695–1697 (1996). 36. Bonneville, R. et al. Landscape of Microsatellite Instability Across 39 Cancer Types. JCO Precis Oncol 2017, (2017). 37. Kloor, M. et al. Immunoselective pressure and human leukocyte antigen class I antigen .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 machinery defects in microsatellite unstable colorectal cancers. Cancer Res. 65, 6418– 6424 (2005). 38. Sade-Feldman, M. et al. Resistance to checkpoint blockade therapy through inactivation of antigen presentation. Nat. Commun. 8, 1136 (2017). 39. Seliger, B., Maeurer, M. J. & Ferrone, S. Antigen-processing machinery breakdown and tumor growth. Immunol. Today 21, 455–464 (2000). 40. Güssow, D. et al. The human beta 2-microglobulin gene. Primary structure and definition of the transcriptional unit. J. Immunol. 139, 3132–3138 (1987). 41. Wang, L., Yin, W. & Shi, C. E3 ubiquitin ligase, RNF139, inhibits the progression of tongue cancer. BMC Cancer 17, 452 (2017). 42. Hornbeck, P. V. et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 43, D512–20 (2015). 43. Zhao, R., Choi, B. Y., Lee, M.-H., Bode, A. M. & Dong, Z. Implications of Genetic and Epigenetic Alterations of CDKN2A (p16(INK4a)) in Cancer. EBioMedicine 8, 30–39 (2016). 44. Gump, J., Stokoe, D. & McCormick, F. Phosphorylation of p16 INK4A Correlates with Cdk4 Association. J. Biol. Chem. 278, 6619–6622 (2003). 45. Quinlan, A. R. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr. Protoc. Bioinformatics 47, 11.12.1–34 (2014). 46. Li, H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics 27, 718–719 (2011). 47. GDC Data Processing. https://gdc.cancer.gov/about-data/gdc-data-processing. 48. Fan, Y. et al. Accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling for sequencing data. bioRxiv 055467 (2016) doi:10.1101/055467. 49. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013). .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 50. Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012). 51. Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012). 52. Griffith, M. et al. Genome Modeling System: A Knowledge Management Platform for Genomics. PLoS Comput. Biol. 11, e1004274 (2015). 53. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009). 54. Saunders, C. T. et al. Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics 28, 1811–1817 (2012). .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 Main Figures Figure 1: Flexible, streamlined discovery of cis-acting splice variants with RegTools modules and cis-splice-effects identify workflow. A) By default, variants annotate marks variants within 3bp on the exonic side and 2bp on the intronic side of an exon edge as potentially splicing-relevant. This “splice variant window” can be modified individually for the exonic side and intronic side using the “-e” and “-i” options, respectively. With cis-splice-effects identify, for each variant in the splice variant window, a “splice junction region” is determined by finding the largest span of sequence space between exons which flank the exon associated with the splicing-relevant variant. The splice junction region can also be set manually to contain the entire sequence space n bases upstream and downstream of the variant using the “-w” option. Junctions overlapping the splice junction region are associated with the variant. Using the -E option considers all exonic variants as potentially splicing-relevant, but is otherwise the same. The -I option considers all intronic variants and also limits the splice junction region to the intronic region in which the variant is found, excluding the flanking exons. B) Cis-splice-effects identify and the underlying junctions annotate command annotate splicing events based on whether the donor and acceptor site combination is found in the reference transcriptome GTF. In this example, there are two known transcripts (shown in blue) which overlap a set of junctions from RNAseq data (depicted as junction supporting reads .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 24 in red). Comparing the observed junctions to the reference junctions in the first transcript (top panel), RegTools checks to see if the observed donor and acceptor splice sites are found in any of the reference exons and also counts the number of exons, acceptors, and donors skipped by a particular junction. Double arrows represent matches between observed and reference acceptor/donor sites while single arrows show novel splice sites. These steps are repeated for the rest of the relevant transcripts, keeping track of whether there are known acceptor-donor combinations. Junctions with a known donor but novel acceptor or vice-versa are annotated as “D” or “A”, respectively. If both sites are known but do not appear in combination in any transcripts, the junction is annotated as “NDA”, whereas if both sites are unknown, the junction is annotated as “N”. If the junction is known to the reference GTF, it is marked as “DA”. C) The cis-splice-effects identify command relies on the variants annotate, junctions extract, and junctions annotate submodules. This pipeline takes variant calls and RNA-seq alignments along with genome and transcriptome references and outputs information about novel junctions and associated potential cis splice-altering sequence variants. RegTools is agnostic to downstream research goals and its output can be filtered through user-specific methods and thus can be applied to a broad set of scientific questions. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 25 .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 26 Figure 2. Overview of input data considered and significant events identified by RegTools for each tumor type. A) Summary of initial variants considered for analysis by RegTools per sample per tumor cohort. Each sample’s variant count is plotted and violin plots are overlaid for each cohort. B) Summary unique exon-exon junction observations for each sample. Each sample’s unique junction count is plotted and violin plots are overlaid for each cohort. C) Summary of significant junction types for each cohort across each of the variant window sizes that were used in this analysis. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 27 Figure 3. Splice regulatory variants often lead to the expression of multiple alternative junctions. A) A single variant can result in either one or more than one alternatively spliced junctions. Depicted is a variant resulting in a single novel transcript product (purple), a variant resulting in two novel transcript products that both use alternate donor sites (yellow), and a variant resulting in multiple junctions of different types (teal). B) Stacked bar graph visualizing how often a .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 28 variant leads to each of the categories mentioned above across the four RegTools variant windows used. This analysis is for all variants that RegTools identified as significant. C) Bar chart showing how often each of the described junction combinations occurs when a single variant results in multiple junction types across each of the RegTools splice variant windows used. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29 Figure 4. Comparison of RegTools with other tools that identify potential splice altering variants. A) Conceptual diagram of contrasting approaches used to identify splice regulatory tools/methods. A red dot indicates that the source only considers genomic data for making its calls, as opposed to a combination of genomic and transcriptomic data. B) UpSet plot comparing splice altering variants identified by RegTools to those identified by other splice variant predictors and annotators. Each tool and their total number of variant predictions are shown on the left side bar graph. The numbers of variants specific to each tool or shared between different combinations of tools are indicated by the bar graph along the top and the individual or connected dots. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 30 Figure 5. Pan-cancer analysis of cohorts from TCGA and MGI reveals genes recurrently disrupted by variants which cause non-canonical splicing patterns Results of analysis for recurrently disrupted genes in each cohort. Columns correspond to the 20 most frequently recurring genes, as ranked by fraction of samples. Genes are clustered by whether they were annotated by the CGC as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 31 another type of cancer-relevant gene. Shading corresponds to −log10(p value) and columns represent cancer types. Red marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 32 .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 33 Figure 6. Several SNVs in B2M associated with alternate acceptor and alternate donor usage. A) IGV snapshot of three intronic variant positions found to be associated with usage of an alternate acceptor and alternative donor site that leads to formation of novel transcript products. This result was found using the default splice variant window parameter (i2e3). B) Zoomed in view of the variants identified by RegTools that are associated with alternate acceptor and donor usage. Two of these variant positions flank the acceptor site and one flanks the donor site that are being affected. C) Sashimi plot visualizations for samples containing the identified variants that show alternate acceptor usage (red) or alternate donor usage (orange). .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 34 Supplemental Figures Supplementary Figure 1. Benchmarking of each RegTools command. The total CPU time (System Time + User Time) and real time are plotted against the number of entries processed for each available RegTools function using 10 total replicates. For the cis- splice-effects identify/cis-splice-effects associate/variants annotate workflows, the number of entries corresponds to the number of somatic variants, whereas the number of entries in the junctions extract/junctions annotate/compare_junctions workflows corresponds to the number of reads processed from a downsampled BAM file, the number of junctions processed, and the number of candidate variant junction pairings processed, respectively. For compare_junctions, candidate variant junction pairings were compared across the number of samples in that cohort, with the largest being 1022 samples that comprise our BRCA cohort. LOESS curves are fitted onto each plot. 34 of rt, .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 35 Supplementary Figure 2. Summary of variants analyzed by RegTools in each tumor cohort Summary of the starting number of high quality variants per sample, the number of initial variants considered for analysis by RegTools for each variant window used per tumor cohort, and the number of significant variants for each variant window used per tumor cohort. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 36 Supplementary Figure 3. Visualization of junctions across cohorts. Summary of the total junction read counts, unique junctions (all types), unique known (DA) junctions, unique known (DA) junctions not found in GTEx, unique D, A, NDA junctions, and unique D, A, NDA junctions not found in GTEx per sample per cohort. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 37 Supplementary Figure 4: Intronic SNV in CTTN associated with an exon skipping event. A) IGV snapshot of a single nucleotide variant (GRCh38, chr11:g.70407517G>C) within an intron of CTTN in LUAD sample TCGA-86-6851-01A. This variant is associated with an exon skipping event causing the formation of an NDA junction, JUNC00027688, which has 44 reads of support. The variant was identified by RegTools, VEP, and Veridical but no other tools. This result was found using the default splice variant window parameter (i2e3). B) Sashimi plot visualization of the novel junction. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 38 Supplementary Figure 5: Exonic SNV in LZTR1 associated with alternative donor usage. A) IGV snapshot of a single nucleotide variant (GRCh38, chr22:g.20995026G>C) within an exon of LZTR1 in LUAD sample TCGA-38-4631-01A. This variant is associated with the formation of an A junction, JUNC00075013, which has 49 reads of support. The variant was identified by RegTools, VEP, and SpliceAI but no other tools. This result was found using the default splice variant window parameter (i2e3). B) Sashimi plot visualization of the novel junction. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 39 Supplementary Figure 6. Pan-cancer analysis of cohorts from TCGA and MGI reveals genes recurrently disrupted by variants which cause non-canonical splicing patterns Results of analysis for recurrently disrupted genes in each cohort. A) Rows correspond to the 40 most frequently recurring genes, as ranked by binomial p-value. Genes are clustered by whether they were annotated by the CGC as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene. Shading corresponds to −log10(p value) and columns represent cancer types. Red marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows correspond to the 40 most frequently recurring genes, as ranked by fraction of samples. Shading corresponds to the fraction of samples and columns represent cancer types. Red marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 40 Supplementary Figure 7. Pan-cancer analysis of cohorts from TCGA and MGI reveals genes recurrently disrupted by variants which promote splicing of particular canonical junctions Results of analysis for recurrently disrupted genes in each cohort. A) Rows correspond to the 40 most frequently recurring genes, as ranked by binomial p-value. Genes are clustered by whether they were annotated by the CGC as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene. Shading corresponds to −log10(p value) and columns represent cancer types. Red marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows correspond to the 40 most frequently recurring genes, as ranked by fraction of samples. Shading corresponds to the fraction of samples and columns represent cancer types. Red marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 41 Supplementary Figure 8. TCGA pan-cancer analysis reveals genes recurrently disrupted by variants which cause non-canonical splicing patterns Results of analysis for recurrently disrupted genes in each TCGA cohort. A) Rows correspond to the 40 most frequently recurring genes, as ranked by binomial p-value. Genes are clustered by whether they were annotated by the CGC as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene. Shading corresponds to −log10(p value) and columns represent cancer types. Red marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows correspond to the 40 most frequently recurring genes, as ranked by fraction of samples. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 42 Shading corresponds to the fraction of samples and columns represent cancer types. Red marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. Supplementary Figure 9. TCGA pan-cancer analysis reveals genes recurrently disrupted by variants which promote splicing of particular canonical junctions Results of analysis for recurrently disrupted genes in each TCGA cohort. A) Rows correspond to the 40 most frequently recurring genes, as ranked by binomial p-value. Genes are clustered by whether they were annotated by the CGC as an oncogene (red), an oncogene and tumor suppressor gene (yellow), a tumor suppressor gene (green), or another type of cancer-relevant gene. Shading corresponds to −log10(p value) and columns represent cancer types. Red marks within cells indicate that the .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 43 gene was annotated by CHASMplus as a driver within a given TCGA cohort. B) Rows correspond to the 40 most frequently recurring genes, as ranked by fraction of samples. Shading corresponds to the fraction of samples and columns represent cancer types. Red marks within cells indicate that the gene was annotated by CHASMplus as a driver within a given TCGA cohort. Supplementary Figure 10. Analysis of HCC, OSCC, and SCLC cohorts reveals genes recurrently disrupted by variants which cause non-canonical splicing patterns Results of analysis for recurrently disrupted genes in each MGI cohort. A) Rows correspond to the 3 most frequently recurring genes, as ranked by binomial p-value. Shading corresponds to −log10(p value) and columns represent cancer types. B) Rows correspond to the 3 most frequently recurring genes, as ranked by fraction of samples. Shading corresponds to the fraction of samples and columns represent cancer types. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 44 Supplementary Figure 11. Analysis of HCC, OSCC, and SCLC cohorts reveals genes recurrently disrupted by variants which promote splicing of particular canonical junctions Results of analysis for recurrently disrupted genes in each TCGA cohort. A) Rows correspond to the 4 most frequently recurring genes, as ranked by binomial p-value. Shading corresponds to −log10(p value) and columns represent cancer types. B) Rows correspond to the 4 most frequently recurring genes, as ranked by fraction of samples. Shading corresponds to the fraction of samples and columns represent cancer types. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 45 Supplementary Figure 12: Intronic SNV in TP53 associated with alternative donor usage. A) IGV snapshot of a single nucleotide variant (GRCh38, chr17:g.7673609C>A) within an intron of TP53 in an OSCC sample. This variant is associated with an exon skipping event with 23 reads of support and an alternate acceptor site usage with 41 reads of support. This result was found using the default splice variant window parameter (i2e3). B) Sashimi plot visualization of the novel junction. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 46 Supplementary Figure 13: Intronic deletion in RNF145 associated with alternative donor usage. A) IGV snapshot of a single nucleotide variant (GRCh38, chr5:g.159169058delA) within an intron of RNF145 in COAD samples. This variant is associated with an exon skipping event with 8 and 6 reads of support for the samples shown. This result was found using the default splice variant window parameter (i2e3). B) Sashimi plot visualization of the novel junction. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 47 .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 48 Supplementary Figure 14: Several SNVs in CDKN2A associated with alternate donor usage. A) IGV snapshot of three variant positions in CDKN2A found to be associated with usage of an alternate donor site that leads to formation of an alternate known transcript. This result was found using the default splice variant window parameter (i2e3) for known (DA) junctions. B) Zoomed in view of the variants identified by RegTools that are associated with alternate donor usage. Two of these variant positions flank the donor site that is no longer being used. C) Sashimi plot visualizations for samples containing the identified variants that show alternate donor usage. .CC-BY-NC-ND 4.0 International licensea certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was notthis version posted January 5, 2021. ; https://doi.org/10.1101/436634doi: bioRxiv preprint https://doi.org/10.1101/436634 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_01_08_425952 ---- rdrugtrajectory: An R Package for the Analysis of Drug Prescriptions in Electronic Health Care Records JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. Reddoi: 10.18637/jss.v000.i00 rdrugtrajectory: An R Package for the Analysis of Drug Prescriptions in Electronic Health Care Records Anthony Nash University of Oxford Tingyee E. Chang University of Oxford Benjamin Wan Kings College London M. Zameel Cader University of Oxford Abstract Primary care electronic health care records are rich with patient and clinical infor- mation. Studying electronic health care records has resulted in marked improvements to national health care processes and patient-care decision making, and is a powerful supple- mentary source of data for drug discovery effort. We present the R package rdrugtrajec- tory, designed to yield demographic and patient-level characteristics of drug prescriptions in the UK Clinical Practice Research Datalink dataset. The package operates over Clin- ical Practice Research Datalink Gold clinical, referral and therapy datasets and includes features such as first drug prescriptions analysis, cohort-wide prescription information, cu- mulative drug prescription events, the longitudinal trajectory of drug prescriptions, and a survival analysis timeline builder to identify risks related to drug prescription switching. The rdrugtrajectory package has been made freely available via the GitHub repository. Keywords: EHR, electronic health care records, CPRD, Clinical Practice Research Datalink, prescriptions, R, therapeutics, drug discovery, clinical epidemiology. 1. Introduction The UK Clinical Practice Research Datalink (CPRD) service offers high quality longitudinal data on 50 million patients with up to 20 years of follow-up for 25% of those patients. The service provides drug treatment patterns, feasibility studies and health care resource use stud- ies. Patient electronic health care records (EHR) are stored as coded and anonymised data and sourced from over 1,800 primary care practices across England. CPRD holds informa- tion on consultation events, medical diagnoses, symptoms, prescriptions, vaccination history, laboratory tests, and referrals. CPRD can provide routine linkage to other health-related patient datasets, for example: Small area level data, such as patient and/or practice postcode .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint http://dx.doi.org/10.18637/jss.v000.i00 https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 2 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records linked deprivation measures; data from NHS digital which includes hospital episode statistic, outpatient and accident and emergency data; and cancer data from Public Health England. Evidence from EHRs is making an impact on primary care decision-making and best prac- tice Oyinlola et al. (2016). With nationwide longitudinal datasets more readily available, the evaluation of treatments over long timescales can contribute to clinical decision-making Hepp et al. (2017). For example, adverse events caused by prescription medication can be studied using retrospective data in situations where randomized clinical trials may prove impracti- cal Ghosh et al. (2019); Bally et al. (2017). This publication serves as an introduction to the rdrugtrajectory R package and whilst this publication is by no means a complete tutorial, we will expand on some of the main pack- age features, such as, how to: Isolate patients by first drug prescriptions at given clinical events; calculate time-invariant prescriptions; construct survival analysis timelines (compati- ble with Cox proportional hazard regression and Kaplan Meier curves), and; visualise patient prescription switching. For a comprehensive list of functions please visit the Github reposi- tory https://github.com/acnash/rdrugtrajectory. Almost all features can be controlled by covariates or stratified by some variable, for example, by gender, age, medical codes or treatment product codes. The example code, figures and data structures presented here mimic a small fraction of our own research. In the interest of patient confidentiality, the clinical data used in the analysis have been fabricated. We present a brief tour of some of the functions available, starting with a discussion on the CPRD data structure and how records must be formatted. A glossary of terms has been provided (Table 1) to assist the reader. 2. rdrugtrajectory package and data structures 2.1. rdrugtrajectory availability and installation rdrugtrajectory is free to download from the Github repository https://github.com/acnash/ rdrugtrajectory and holds an MIT license. Fabricated CPRD clinical and CPRD prescrip- tion records in addition to age, gender and index of multiple deprivation scores are included for test and tutorial purposes. Before installing the package, the following R dependencies are required: plyr, dplyr, foreach, doParallel, data.table, parallel, splus2R, rlist, reda, ggplot2, ggalluvial, stats, utils and useful. The latest rdrugtrajectory binary is install using: install.packages("path/to/tar/file", source = TRUE, repos=NULL) rdrugtrajectory was developed and tested on R version 4.0.1. Please consult the Github page for release notes, the latest version and up to date installation instructions. 2.2. CPRD product descirption Several rdrugtrajectory functions use the CPRD product.txt file for assigning a text descrip- tion to a prescription prodcode. The product.txt (and medical.txt for medcode description) is available in the CPRD Data Dictionary Windows software. It is important that the file .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://github.com/acnash/rdrugtrajectory https://github.com/acnash/rdrugtrajectory https://github.com/acnash/rdrugtrajectory https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 3 Term Description rdrugtrajectory An R packaged designed for the management of CPRD prescription data. clinical The ClinicalNNN.txt dataset presented in a rdrugtrajectory dataframe. referral The ReferralNNN.txt dataset presented in a rdrugtrajectory dataframe. therapy The TherapyNNN.txt dataset presented in a rdrugtrajectory dataframe. AdditionalNNN.txt The CPRD dataset of additional clinical information, for example, patient smoking status and alcohol comsumption. Data can be retrieved using CPRDLookups.R. modecode A CPRD identifier that denotes medical conditions, diagnosis and com- plaints made by a patient. medcodes are recorded in the ClinicalNNN.txt and ReferralNNN.txt files. prodcode A CPRD identifier that denotes treatment products, including drugs, foods, and medical apparatus. prodcodes are recorded in the Thera- pyNNN.txt files. patid A unique CPRD patient identifier. Used to link datasets. event Any procode or medcode in a patient’s EHR. eventdate The date of an event recorded by a general practitioner. Present in all three datasets and corresponding rdrugtrajectory dataframe. IMD Index of Multiple Deprivation score - a UK Government socioeconomic measurement based on postcode of the clinic or a patient’s registered ad- dress. Prescription A general time for any prodcode prescribed for treatment. medical history Indicates a combination of one or more sets of CPRD data, for example, the collection of all clinical and therapy EHR for patients with a medcode for migraine. product.txt A plain text file that contains all prodcodes with a description and comes bundled with the CPRD Data Dictionary. The file is used to link a prodcode with a description. Table 1: Table of frequently used terms. remains in plain text, with columns tab-delimited. The files can be simplified by removing all non-essential products. Finally, all the eleven columns that make up the product.txt file must be available, with the first column containing all prodcodes and the fourth column containing the product description. A simplified product.txt file, presented below, can be downloaded from the Github page. > library(rdrugtrajectory) > productDF <- read.csv("../RDrugTrajectory_Data/product.txt", + sep="\t", + header=FALSE) > head(productDF) V1 V2 V3 V4 V5 1 5 60153020 14958680 Atenolol 50mg tablets Atenolol 2 24 60152020 5354283 Atenolol 100mg tablets Atenolol 3 26 67920020 6869099 Atenolol 25mg tablets Atenolol .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 4 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records 4 49 58950020 4920857 Amitriptyline 25mg tablets Amitriptyline hydrochloride 5 65 68572020 4771731 Lisinopril 10mg tablets Lisinopril 6 78 68571020 4006669 Lisinopril 5mg tablets Lisinopril V6 V7 V8 V9 1 50mg Tablet Oral 2040000 2 100mg Tablet Oral 2040000 3 25mg Tablet Oral 2040000 4 25mg Tablet Oral 04030100/04070300/04070402 5 10mg Tablet Oral 2050501 6 5mg Tablet Oral 2050501 V10 1 Beta-adrenoceptor Blocking Drugs 2 Beta-adrenoceptor Blocking Drugs 3 Beta-adrenoceptor Blocking Drugs 4 Tricyclic And Related Antidepressant Drugs/Neuropathic Pain/Prophylaxis Of Migraine 5 Angiotensin-converting Enzyme Inhibitors 6 Angiotensin-converting Enzyme Inhibitors V11 V12 1 Feb-09 3059002 2 Feb-09 3059001 3 Feb-09 5070002 4 Feb-09 2776002 5 Feb-09 5250003 6 Feb-09 5250002 2.3. rdrugtrajectory package structure rdrugtrajectory contains three R files: (1) all functions related to data curating and search- ing reside within PRDDrugTrajectory.R; (2) analysis tools and timeline construction reside within CPRDDrugTrajectoryStats.R; and, (3) all utilities including input/output operations reside within CPRDDrugTrajectoryUtils.R. The packages contains several fabricated CPRD datasets: testClinicalDF, testTherapyDF, ageGenderDF, imdDF, and drugListDF. A de- scription of each, along with information on data types and structures are given below. 2.4. The CPRD EHR data structure The structure of CPRD Gold data may depend on whether the CPRD license holder per- forms intermediate data management steps before releasing data to the user. However, typ- ically, CPRD Gold data follows the CPRD Gold specification https://cprdcw.cprd.com/ _docs/CPRD_GOLD_Full_Data_Specification_v2.0.pdf. Currently, rdrugtrajectory sup- ports EHR data from the flat files ClinicalNNN.txt, ReferralNNN.txt, and TherapyNNN.txt. The Additional Clinical Details files (AdditionalNNN.txt) are currently supported using our re- leased R script CPRDLookups.R https://github.com/acnash/CPRD_Additional_Clinical ?. Patients are assigned a unique numerical patid value. The operations performed by rdrugtra- jectory requires the patid to identify patients and subset patient groups. We recommend that patid, medcode, prodcode are kept as character data throughout any preliminary data curating .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://cprdcw.cprd.com/_docs/CPRD_GOLD_Full_Data_Specification_v2.0.pdf https://cprdcw.cprd.com/_docs/CPRD_GOLD_Full_Data_Specification_v2.0.pdf https://github.com/acnash/CPRD_Additional_Clinical https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 5 steps. Medical events are recorded as codes and stored in the ClinicalNNN.txt and Refer- ralNNN.txt under the column header medcode. Prescription events, such as drug prescriptions are also recorded as codes and stored in the TherapyNNN.txt file under the column header prodcode and the sequences of repeat prescriptions are under the issueseq column header. Dates associated medical and prescription events, recorded by the General Practitioner, are stored under the column header eventdate. 2.5. Essential data types and data structures rdrugtrajectory can operate over CPRD Gold EHR clinical, referral and prescription data provided each dataset format is presented as separate R dataframes or combined into a rdrug- trajectory medical history dataframe. The construction of clinical, referral and prescription dataframes require, as a minimum, a patid and eventdate column, and either medcode or prod- code (for therapy data, issueseq is necessary), and presented in that order. Every record of medcode or prodcode must be accompanied by an eventdate entry (encoded as a Date class of the form YYYY-MM-DD). Patients can have duplicate events within the same data set and between data sets. Medical and prescription codes can be retrieved from the corresponding medical.txt and product.txt files which come bundled with the CPRD Data Dictionary Win- dows application. rdrugtrajectory comes packaged with fabricated EHR data in the structure of: > library(rdrugtrajectory) > #fabricated clinical data (referral data follows the same format) > names(testClinicalDF) [1] "patid" "eventdate" "medcode" "consid" > #fabricated prescription data > names(testTherapyDF) [1] "patid" "eventdate" "prodcode" "consid" "issueseq" Users can check if the structure of an EHR dataframe meets the requirements for this package by calling checkCPRDRecord; additional columns such as consultation identification number (consid) are not considered. In the following instance, a prescription dataset with the required columns and the optional consultation identification number is presented. > library(rdrugtrajectory) > #check the structure of testTherapy, specify that it is therapy data > checkCPRDRecord(df=testTherapyDF, dataType="therapy") [1] "The data.frame is appropriately formatted. Returning TRUE." [1] TRUE > #display the rdrugtrajectory EHR therapy dataframe > str(testTherapyDF, strict.width="wrap") .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 6 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records 'data.frame': 91647 obs. of 5 variables: $ patid : int 3515 3515 3515 3515 3515 3515 3515 3653 3653 3653 ... $ eventdate: Date, format: "2005-02-24" "2006-01-26" ... $ prodcode : int 83 83 83 707 707 707 707 297 297 297 ... $ consid : int 540850 540865 540892 541108 541114 541118 541133 571336 571345 571357 ... $ issueseq : int 0 0 0 0 0 0 0 0 1 2 ... Users can combine with the rdrugtrajectory EHR dataframes any number of patient and EHR data to act as covariates and stratifying variables, typically this can be done using the R cbind operation. For example, BMI and smoking status, both of which can be retrieved from the AdditionalNNN.txt dataset files using CPRDLookups.R, can be linked by searching for and binding with the record patid values. The rdrugtrajectory package contains several utility functions to retrieve CPRD data, including, patient year of birth, gender (male or female) and either patient-level or clinical-level index of multiple deprivation score (IMD). The patient age can be determined by adding 1800 to the value in yob column in the Patient CPRD EHR dataset and then subtracting that value (birth year) from the year of the CPRD database release. This data requires preliminary treatment before presenting to the rdrugtrajectory package. Patient age, gender and IMD score must be presented in a dataframe with the linked patient column patid, along with the columns age, gender, and score. Providing the patid column is preserved, patient characteristics can be presented in separate dataframe, for example: > library(rdrugtrajectory) > #patient age and gender as one dataframe > str(ageGenderDF, strict.width="wrap") 'data.frame': 3838 obs. of 3 variables: $ patid : int 1 2 3 4 5 6 7 8 9 10 ... $ yob : num 45 35 33 42 63 57 34 51 51 22 ... $ gender: int 2 2 1 2 2 1 2 2 2 1 ... > #clinic-level IMD score as one datafrmae > str(imdDF, strict.width="wrap") 'data.frame': 2126 obs. of 3 variables: $ patid : int 6 11 16 34 42 44 54 60 63 79 ... $ pracid: int 184 31 66 344 66 47 18 90 379 317 ... $ score : int 1 3 1 4 1 2 1 5 1 2 ... The patid patient identifier is fundamental in every operation performed by rdrugtrajectory. The examples presented here and those in the reference manual rely on searching and subset- ting EHR data using a list or vector of patient identifier. The function getUniquePatidList will retrieve an R List of patient identification numbers from any dataframe with a patid column. The aforementioned rdrugtrajectory EHR dataframes, clinical, referral and therapy, can be combined into a single dataframe. We refer to this dataset instance as the patient’s medical .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 7 history and can be constructed using constructMedicalHistory. This dataframe expects events to be in chronological order, and will introduce a new column, code and codetype to denote each of the combined events. The code (medcode and/or prodcode) can be distinguished by a codetype value of c (clinical events), r (referral events), and t (prescription events). Events are returned in chronological order using the eventdate data. The following code demonstrates how to retrieve a list of patient identifier from a prescription dataframe and from a medical history dataframe, followed by how to subset using base R operations and, finally, the medical history dataframe structure. > library(rdrugtrajectory) > #Retrieve patids from therapy data. > idList <- getUniquePatidList(testClinicalDF) > medHistoryDF <- constructMedicalHistory(testClinicalDF, NULL, testTherapyDF) [1] "Using clinical data." [1] "Using therapy data." [1] "Building with clinical and therapy data." > #Retrieve patid from medical history. > medHistoryIDList <- getUniquePatidList(medHistoryDF) > numOfPatients <- length(medHistoryIDList) > #Subset using the first 100 patients. > smallMedHistoryDF <- subset(medHistoryDF, + medHistoryDF$patid %in% medHistoryIDList[1:100]) > #Separate out the first 100 patient with a clinical record. > smallClinicalOnlyDF <- subset(smallMedHistoryDF, + smallMedHistoryDF$codetype == "c") > #Separate out the first 100 patient with a therapy record. > smallTherapyOnlyDF <- subset(smallMedHistoryDF, + smallMedHistoryDF$codetype == "t") > #Subset only or those patient records beyond 31st Jan 2010. > laterMedHistoryDF <- subset(medHistoryDF, + medHistoryDF$eventdate > as.Date("2010-01-31")) > #Medical history dataframe structure > str(medHistoryDF, strict.width="wrap") 'data.frame': 103336 obs. of 4 variables: $ patid : int 1 1 1 1 1 1 1 2 2 3 ... $ eventdate: Date, format: "2002-06-07" "2005-07-25" ... $ code : int 5767 5767 5767 707 707 707 707 5767 769 5767 ... $ codetype : chr "c" "c" "c" "t" ... The patid data can also be used to retrieve patient characteristics, for example, the gender of the patient using getGenderOfPatients: .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 8 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records > library(rdrugtrajectory) > idList <- getUniquePatidList(testTherapyDF) > #Only use half of the cohort. > idList <- idList[1:(length(idList)/2)] > #Get gender data by specific gender. > maleCode <- 1 > femaleCode <- 2 > malePatientsDF <- getGenderOfPatients(idList, ageGenderDF, maleCode) > femalePatientsDF <- getGenderOfPatients(idList, ageGenderDF, femaleCode) > #Get all gender data > allPatientsDF <- getGenderOfPatients(getUniquePatidList(testTherapyDF), + ageGenderDF) > #Structure of the patient gender data. > str(allPatientsDF, strict.width="wrap") 'data.frame': 3838 obs. of 2 variables: $ patid : int 1 2 3 4 5 6 7 8 9 10 ... $ gender: int 2 2 1 2 2 1 2 2 2 1 ... IMD data can be retrieved by combining getUniquePatidList and getIMDOfPatients func- tions: > library(rdrugtrajectory) > idList <- getUniquePatidList(testTherapyDF) > #Get patients with an IMD score of 1 or 2 > onePatientsDF <- getIMDOfPatients(idList, imdDF, 1) > twoPatientsDF <- getIMDOfPatients(idList, imdDF, 2) > #Get all IMD scores for all patients in testTherapyDF > allPatientsDF <- getIMDOfPatients(getUniquePatidList(testTherapyDF), imdDF) > #Structure of the patient gender data. > str(allPatientsDF, strict.width="wrap") 'data.frame': 2123 obs. of 2 variables: $ patid: int 6 11 16 34 42 44 54 60 63 79 ... $ score: int 1 3 1 4 1 2 1 5 1 2 ... The final example of EHR dataframe manipulation presented here demonstrates how to re- trieve all prescription records for patients prescribed a specific prescription treatment. For example, such an operation can be used to retrieve all prescription records for any patient prescribed amitriptyline. In addition, it is also possible to return only prescription records matching specific prescription treatments. Importantly, prescription prodcodes can be grouped into lists and used to collect those patients with at least one record that matches an element of that list. This approach is useful if the dose is not relevant to the study or the prescription is dispensed under multiple product names. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 9 > library(rdrugtrajectory) > #It is easy to retrieve a list of all unique prodcodes in the cohort. > prodCodesVector <- unique(testTherapyDF$prodcode) > reducedProdCodesVector <- prodCodesVector[1:10] > #All records are maintained for those patients with a matching prodcode. > therapyOfInterestDF <- getPatientsWithProdCode(testTherapyDF, + reducedProdCodesVector) > #Only those records that match are retained. > reducedTherapyOfInterestDF <- getPatientsWithProdCode(testTherapyDF, + reducedProdCodesVector, + removeExcessDrugs=TRUE) 3. EHR drug prescription results and discussion Having briefly demonstrated some basic operation on retrieving patient records by matching EHR dataframes against sets of patid values, we move on to showcase several operations available to the user. We begin by presenting examples of cohort prescription summary statistics followed by methods of dataset curating and stratifying by patient groups. We then present examples on how to search for patients prescribed with a first-line treatments, followed by presenting some of these patient groups as sequences of prescriptions. Finally, we demonstrate several examples of building time-lines. For futher examples, please see the Github page and reference manual. 3.1. Cohort summmary statistics getEventdateSummaryByPatient rdrugtrajectory can return summary based statistics on patient and cohort level prescription data with getEventdateSummaryByPatient and getPopulationDrugSummary, respectively. For example, a single patient (via getUniquePatidList and [] dataframe subsetting) pre- scription history returns the patient patid, number of prescription events, median number of days between events, fewest number of days between events, the most number of days between events (maxTime and longestDuration are the same), and record duration (number of days between the first and last prescription event on record): > library(rdrugtrajectory) > idList <- getUniquePatidList(testTherapyDF) > resultList <- getEventdateSummaryByPatient( + testTherapyDF[testTherapyDF$patid==idList[[1]],]) > str(resultList, strict.width="wrap") List of 2 $ TimeSeriesList: num [1:6] 336 652 2540 34 42 44 $ SummaryDF :'data.frame': 1 obs. of 7 variables: ..$ patid : int 3515 ..$ numberOfEvents : int 7 .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 10 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records ..$ medianTime : num 190 ..$ minTime : num 34 ..$ maxTime : num 2540 ..$ longestDuration: num 2540 ..$ recordDuration : int 3648 - attr(*, "class")= chr "EventdateSummaryObj" getPopulationDrugSummary This approach can be extended across the cohort of patients with getPopulationDrugSummary. The returning PopulationEventdateSummary S3 object is a list of three elements. The first element is the SummaryDF dataframe derived from calling getEventdateSummaryByPatient per patient, with the set of statistics retrievable through the accompanied patid. The second element is the TimeSeriesList, which holds a vector per patient of the number of days between consecutive prescription events. Vectors can be accessed using the patid element name: > library(rdrugtrajectory) > resultList <- getPopulationDrugSummary(df = testTherapyDF, + prodCodesVector = NULL) > str(resultList, strict.width="wrap", list.len = 5) List of 2 $ SummaryDF :'data.frame': 3838 obs. of 7 variables: ..$ patid : int [1:3838] 3515 3653 3756 3813 435 553 731 891 1781 1991 ... ..$ numberOfEvents : int [1:3838] 7 21 1 1 13 2 15 2 23 79 ... ..$ medianTime : num [1:3838] 190 60 0 0 28.5 ... ..$ minTime : num [1:3838] 34 34 0 0 11 ... ..$ maxTime : num [1:3838] 2540 1623 0 0 322 ... .. [list output truncated] $ TimeSeriesList:List of 3838 ..$ 3515: num [1:6] 336 652 2540 34 42 44 ..$ 3653: num [1:20] 890 222 182 301 539 ... ..$ 3756: num 0 ..$ 3813: num 0 ..$ 435 : num [1:12] 26 23 24 24 32 322 31 29 11 51 ... .. [list output truncated] - attr(*, "class")= chr "PopulationEventdateSummary" > #Get all patids for patients younger than 40. > ageIDList <- getUniquePatidList(ageGenderDF[ageGenderDF$yob < 40,]) > timeSeriesList <- resultList[[2]] > #Get all patids of available data. > recordPatids <- names(timeSeriesList) > #Get time data for the intersect of those patids of patients < 40 and the patids > #of available data. > subTimeList <- timeSeriesList[intersect(ageIDList, recordPatids)] > str(subTimeList, strict.width="wrap", list.len = 5) .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 11 List of 640 $ 2 : num 0 $ 3 : num 0 $ 7 : num 25 $ 10 : num 0 $ 15 : num 0 [list output truncated] 3.2. Curating drug prescription records There is no direct link between a prescription event and a medcode in the CPRD data. The relationship between the two can be inferred from the event dates of the prescription and clinical events, in addition, to information provided by the consultation ID and the prescription issue number. matchDrugWithDisease rdrugtrajectory provides several methods for curating prescription datasets with the aim of es- tablishing a relationship between prescription and clinical events. The matchDrugWithDisease function returns a subset of all prescription events with an established relationship between therapy and clinical event. To what degree these patients are included in the search is con- trolled with a function argument. There are three scenarios: all patients with a record of a specific prescription event and specific clinical event, at any point; all patients with a record of a specific prescription event on the same date as a specific clinical event; and, all patients with a record of a specific prescription event on the same date as a specific clinical event and clear from additional clinical events on that day. One would expect fewer patients as the stringency of the search criteria is increased: > library(rdrugtrajectory) > prodcodes <- unique(testTherapyDF$prodcode) > amitriptylineCodes <- prodcodes[1:5] > propranololCodes <- prodcodes[6:11] > medcodeList <- unique(testClinicalDF$medcode) > headacheCodes <- medcodeList[1:10] > amitriptylineResult1 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medcodeList = headacheCodes, + drugcodeList = amitriptylineCodes, + severity = 1) > amitriptylineResult2 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medcodeList = headacheCodes, + drugcodeList = amitriptylineCodes, + severity = 2) > amitriptylineResult3 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medcodeList = headacheCodes, .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 12 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records + drugcodeList = amitriptylineCodes, + severity = 3) > propranololResult1 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medcodeList = headacheCodes, + drugcodeList = propranololCodes, + severity = 1) > propranololResult2 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medcodeList = headacheCodes, + drugcodeList = propranololCodes, + severity = 2) > propranololResult3 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medcodeList = headacheCodes, + drugcodeList = propranololCodes, + severity = 3) getGenderOfPatients The example presented, demonstrates how to identify patients prescribed amitriptyline and patients prescribed propranolol (there is patient overlap, easily controlled for by subsetting) whilst controlling for clinical overlap with or without consideration for off topic clinical events. With the identified patients, we can, for example, stratify by gender: > library(rdrugtrajectory) > library(ggplot2) > ami1Gender <- getGenderOfPatients(amitriptylineResult1, ageGenderDF) > ami2Gender <- getGenderOfPatients(amitriptylineResult2, ageGenderDF) > ami3Gender <- getGenderOfPatients(amitriptylineResult3, ageGenderDF) > prop1Gender <- getGenderOfPatients(propranololResult1, ageGenderDF) > prop2Gender <- getGenderOfPatients(propranololResult2, ageGenderDF) > prop3Gender <- getGenderOfPatients(propranololResult3, ageGenderDF) > amiDF <- data.frame(Freq=c(nrow(ami1Gender[ami1Gender$gender==1, ]), + nrow(ami2Gender[ami2Gender$gender==1, ]), + nrow(ami3Gender[ami3Gender$gender==1, ]), + nrow(ami1Gender[ami1Gender$gender==2, ]), + nrow(ami2Gender[ami2Gender$gender==2, ]), + nrow(ami3Gender[ami3Gender$gender==2, ]) + ), + Search=c("Prescribed","With headache","No comorbidities", + "Prescribed","With headache","No comorbidities"), + Drug="Amitriptyline", + Gender=c("Male","Male","Male", + "Female","Female","Female") + ) .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 13 > propDF <- data.frame(Freq=c(nrow(prop1Gender[prop1Gender$gender==1, ]), + nrow(prop2Gender[prop2Gender$gender==1, ]), + nrow(prop3Gender[prop3Gender$gender==1, ]), + nrow(prop1Gender[prop1Gender$gender==2, ]), + nrow(prop2Gender[prop2Gender$gender==2, ]), + nrow(prop3Gender[prop3Gender$gender==2, ]) + ), + Search=c("At any time","With clinical","Clinical & No comorbidities", + "At any time","With clinical","Clinical & No comorbidities"), + Drug="Propranolol", + Gender=c("Male","Male","Male", + "Female","Female","Female") + ) > drugPrescriptionDF <- rbind(amiDF, propDF) > ggPrescriptionAmi <- ggplot(drugPrescriptionDF[ + drugPrescriptionDF$Drug=="Amitriptyline",], + aes(x=Search, y=Freq, fill=Gender)) + + geom_bar(stat="identity", position=position_dodge()) + + theme_bw() + xlab("Search critera (severity)") + ylab("Patient count") + + theme(axis.text.x = element_text(angle=45,hjust=1)) + + ggtitle("Amitriptyline") > ggPrescriptionProp <- ggplot(drugPrescriptionDF[ + drugPrescriptionDF$Drug=="Propranolol",], + aes(x=Search, y=Freq, fill=Gender)) + + geom_bar(stat="identity", position=position_dodge()) + + theme_bw() + xlab("Search critera (severity)") + ylab("Patient count") + + theme(axis.text.x = element_text(angle=45,hjust=1)) + + ggtitle("Propranolol") > Filtering through prescription events can also be controlled by a date range. For example, if one was calculating the number of patients prescribed amitriptyline per year from 2000 to 2004 and matched to a headache event, one can apply a date range: > library(rdrugtrajectory) > library(ggplot2) > prodcodes <- unique(testTherapyDF$prodcode) > amitriptylineCodes <- prodcodes[1:5] > #Clinical event of interest are headaches. > medcodeList <- unique(testClinicalDF$medcode) > #Medcodes can be refined further. > headacheCodes <- medcodeList[1:10] > #Dataframes defined for binned dates are constructed by providing all the > #patients to consider and the binned start and stop date. > date2000DF <- data.frame(patid=unlist(getUniquePatidList(testTherapyDF)), + start=as.Date(as.character("2000-01-01")), + stop=as.Date(as.character("2000-12-31"))) .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 14 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records 0 500 1000 1500 2000 No c om or bi di tie s Pr es cr ib ed W ith h ea da ch e Search critera (severity) P a tie n t co u n t Gender Female Male AmitriptylineA 0 250 500 750 1000 At a ny ti m e Cl in ica l & N o co m or bi di tie s W ith c lin ica l Search critera (severity) P a tie n t co u n t Gender Female Male PropranololB Figure 1: The number of patients prescribed (A) amitriptyline or (B) propranolol. The criteria to match against clinical data is indicated: at any time, with a clinical record, and with a clinical record clear off topic clinical events. > date2001DF <- data.frame(patid=unlist(getUniquePatidList(testTherapyDF)), + start=as.Date(as.character("2001-01-01")), + stop=as.Date(as.character("2001-12-31"))) > date2002DF <- data.frame(patid=unlist(getUniquePatidList(testTherapyDF)), + start=as.Date(as.character("2002-01-01")), + stop=as.Date(as.character("2002-12-31"))) > date2003DF <- data.frame(patid=unlist(getUniquePatidList(testTherapyDF)), + start=as.Date(as.character("2003-01-01")), + stop=as.Date(as.character("2003-12-31"))) > date2004DF <- data.frame(patid=unlist(getUniquePatidList(testTherapyDF)), + start=as.Date(as.character("2004-01-01")), + stop=as.Date(as.character("2004-12-31"))) > #Retrieve prescription frequencies per binned range > amitResult2000 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 15 + medcodeList = headacheCodes, + drugcodeList = amitriptylineCodes, + severity = 1, + dateDF = date2000DF) > amitResult2001 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medcodeList = headacheCodes, + drugcodeList = amitriptylineCodes, + severity = 1, + dateDF = date2001DF) > amitResult2002 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medcodeList = headacheCodes, + drugcodeList = amitriptylineCodes, + severity = 1, + dateDF = date2002DF) > amitResult2003 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medcodeList = headacheCodes, + drugcodeList = amitriptylineCodes, + severity = 1, + dateDF = date2003DF) > amitResult2004 <- matchDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medcodeList = headacheCodes, + drugcodeList = amitriptylineCodes, + severity = 1, + dateDF = date2004DF) > #The number of patids returned by matchDrugWithDisease is equal to the number > #of patients with a drug - disease match per year > dataDF <- data.frame(Year=c("2000","2001","2002","2003","2004"), + Count=c(length(amitResult2000),length(amitResult2001), + length(amitResult2002),length(amitResult2003), + length(amitResult2004))) > ggPrescriptionYear <- ggplot(dataDF, aes(x=Year, y=Count)) + + geom_bar(stat = "identity") + theme_bw() getPatientsWithFirstDrugWithDisease Unlike matchDrugWithDisease which retrieves patients with a prescription event matching clinical criteria at any time within a CPRD EHR record, getPatientsWithFirstDrugWithDisease identifies patients with a first prescription event that matches a desired clinical event. Please note, care must be taken when searching for medication with off-label uses. For example, beta-blockers are frequently prescribed to treat hypertension and arrhythmia, however, the beta-blocker propranolol is also prescribed to treat migraine. Without in depth analysis into the patient history, patients propranolol with records for hypertension or arrhythmia in addi- tion to migraine on a matching eventdate with the first propranolol prescription, could result .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 16 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records 0 100 200 300 2000 2001 2002 2003 2004 Year C o u n t Figure 2: The number of patients prescribed amitriptyline from the start of the year 2000 to the end of 2004, stratified in year intervals. in a misleading disease-drug association. In cases where a health care professional suggests a change in the patient’s lifestyle choices, that patient may have several clinical events free from prescriptions before the first prescription of interest is prescribed. Using basic subsetting one can calculate the number of clinical events before the patient’s first prescription intervention (Figure 3 A). Further more, we can stratify patients into subgroups (Figure 3 B): > library(rdrugtrajectory) > library(ggplot2) > #A vector of prescriptions of interest. > drugList <- unique(testTherapyDF$prodcode) > sampleDrugs <- drugList[1:8] > #A vector of clinical events to match prescriptions against. > medCodes <- unique(testClinicalDF$medcode) > sampleMedCodes <- medCodes[1:30] > #Returns the subset of the first prescription event prescribed on the same > #eventdate as those clinical events of interest .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 17 > firstDF <- getPatientsWithFirstDrugWithDisease(clinicalDF = testClinicalDF, + therapyDF = testTherapyDF, + medCodesVector = sampleMedCodes, + drugCodesVector = sampleDrugs) > #Ensure the only clinical data are for those with an assume first-drug-disease > firstClinicalDF <- subset(testClinicalDF, + testClinicalDF$patid %in% getUniquePatidList(firstDF)) > #Only keep the diseases of interest > firstClinicalDF <- subset(firstClinicalDF, + firstClinicalDF$medcode %in% sampleMedCodes) > #Only keep the prescriptions of interest > firstDF <- subset(firstDF, firstDF$prodcode %in% sampleDrugs) > idList <- getUniquePatidList(firstClinicalDF) > beforeResultDF <- data.frame(patid=unlist(idList), Freq=0) > for(id in idList) { + #Retrieve the clinical/therapy data for each patients, one by one. + indClinicalDF <- subset(firstClinicalDF, firstClinicalDF$patid == id) + indTherapyDF <- subset(firstDF, firstDF$patid == id) + #Get the first event date on record; this will match a clinical date. + firstEventDate <- indTherapyDF$eventdate[1] + clinicalBeforeTherapyDF <- subset(indClinicalDF, + indClinicalDF$eventdate < firstEventDate) + #Number of clinical complaints before first prescription. + nComplaints <- nrow(clinicalBeforeTherapyDF) + beforeResultDF[beforeResultDF$patid==id,]$Freq <- nComplaints + } > ggBefore <- ggplot(beforeResultDF, aes(x=Freq)) + + geom_histogram(binwidth=1, color="black", fill="white") + + ylab("Patients") + xlab("Clinical events before prescription") + + theme_bw() > #Note: not every patient will have a clinical IMD score. > imdIDsDF <- getIMDOfPatients(idList = idList, + imdDF = imdDF) > #Only work with those with an IMD score. > imdResultsDF <- subset(beforeResultDF, + beforeResultDF$patid %in% getUniquePatidList(imdIDsDF)) > imdResultsDF <- imdResultsDF[order(imdResultsDF$patid),] > imdIDsDF <- imdIDsDF[order(imdIDsDF$patid),] > imdResultsDF <- cbind(imdResultsDF, IMD_score=as.factor(imdIDsDF$score)) > ggBeforeIMD <- ggplot(imdResultsDF, + aes(x=Freq, fill=IMD_score)) + + geom_histogram(binwidth=1) + theme_bw() + + ylab("Patients") + xlab("Clinical events before prescription") getMultiPrescriptionSameDayPatients The function getMultiPrescriptionSameDayPatients returns all prescription events for .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 18 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records 0 100 200 0 10 20 Clinical events before prescription P a tie n ts A 0 50 100 150 0 10 20 Clinical events before prescription P a tie n ts IMD_score 1 2 3 4 5 B Figure 3: The number of clinical events before the first treatment across the whole cohort (A), and by IMD score (B). those patients prescribed more than two prescriptions on the same date. All events of those pa- tients without a prescription prodcode event can be removed. Combining getMultiplePrescriptionSameDayPatients with getPatientsWithFirstDrugWithDisease or matchDrugWithDisease is useful for filter- ing patients for specific prescription patterns. For example, to retrieve all patient prescription records if specific prescriptions are (a) never recorded together on the same date and (b) are used as a first line treatment for a given complaint: > library(rdrugtrajectory) > prodcodesVector = unique(testTherapyDF$prodcode)[1:8] > #ensure only patients with specific prescriptions are returned providing a > #patient is prescribed those drugs on different dates, never on the same date. > uniqueTherapyDF <- getMultiPrescriptionSameDayPatients(df = testTherapyDF, + prodCodesVector = prodcodesVector, + removePatientsWithoutDrugs = TRUE) > #Ensure that the patients (patid) in the therapy and clinical dataframes > #are the same. Subsetting might not be enough. > reducedClinicalDF <- subset(testClinicalDF, .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 19 + testClinicalDF$patid %in% getUniquePatidList(uniqueTherapyDF)) > #Specific medcodes have not been provided. All medcodes in the clinical > #dataframe are considered. This is possible if one either one is not interested > #in the nature of the clinical complaint or the clinical dataframe has been > #adjusted to only include clinical complaints of interest. > firstDF <- getPatientsWithFirstDrugWithDisease(clinicalDF = reducedClinicalDF, + therapyDF = uniqueTherapyDF, + drugCodesVector = sampleDrugs) In the above example, patients with more than one prescription on the same date or without a prescription at all (from the set of desired prescription prodcodes) were removed from the cohort. This reduced the number of patients from 3838 patients to 2930. Next, only those patients with a first line treatment (first prescription event on the same date as a clinical event) were kept, reducing the sample size to 587 patients. removePatientsByDuration Longitudinal EHR cohort studies often requires careful time-related consideration. Currently, rdrugtrajectory presents two functions that identify prescription records of patients that match two time constraints. The first, removePatientsByDuration, removes all patients with prescription events that are no more than n years between consecutive events or removes patients if the duration between the first and last prescription event on record is less than n years. > library(rdrugtrajectory) > df <- removePatientsByDuration(minObsYr = 5, + minBreakYr = 2, + therapyDF = testTherapyDF) getBurnInPatients The second time-related function, getBurnInPatients identifies all patient prescription records with at least n days free from prescription events before a specific prescription event. This is useful if one requires a period of time free from prescription intervention before a given prescription event: > library(rdrugtrajectory) > drugOfInterestVector <- c(83,49,297,1888,940,5) > patientList <- getBurnInPatients(df = testTherapyDF, + startCodesVector = drugOfInterestVector, + periodDaysBefore = 172) > burnInTherapyDF <- subset(testTherapyDF, + testTherapyDF$patid %in% patientList) In the above example, from a cohort of 3838 patients, 426 patients had a period of up to 172 days free from of prescription events before the first prescription prodcode specified via the startCodesVector argument. The functionality relies on the patient having prescription events before the burn-in period (required to define whether the patient had a CPRD record early .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 20 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records enough before the burn-in period began). For example, this patient had over three years of prescription events before the prescription of interest (from 2003-05-29 to 2007-10-17 with over 172 days free from exposure before the prescription event of interest prodcode 297: > head(burnInTherapyDF[burnInTherapyDF$patid == 332412,], n=9) [1] patid eventdate prodcode consid issueseq <0 rows> (or 0-length row.names) 3.3. First drug prescriptions getFirstDrugPrescription A patient’s first prescription event on CPRD record can be identified by supplying getFirstDrugPrescription with a list of prescription prodcodes. The functions returns FirstDrugObject, an R S3 ob- ject of type List. Only the first prescription event to match anyone one of the prescription prodcodes provided is identified. The first element of FirstDrugObject contains a named list of patid vectors. Each vector contains the patids of all those patients that share the same first prescription prodcode. The list element is named after the corresponding prescription prodcode. The second element in FirstDrugOject, like the first, is a list of Date vectors, each named after the corresponding prescription prodcode. Each Date vector contains the eventdate of the prescription event for the patient identified by the patid in the identical position of the preceding List. The third list element contains a table of prescription frequencies for each first prescription prodcode on record. The prodcode is accompanied by a product description providing a file of CPRD prescription products has been provided. Below we demonstrate how to retrieve information on first-line treatment: > library(rdrugtrajectory) > library(ggplot2) > #An adjusted data dictionary file. > fileLocation <- "product.txt" > #Without supplying a vector of product files all prodcodes in the therapy > #dataset are considered. > resultFDO <- getFirstDrugPrescription(df = testTherapyDF, + idList = NULL, + prodCodesVector = NULL, + descriptionFile = fileLocation) > patidList <- resultFDO[[1]] > eventdateList <- resultFDO[[2]] > drugFrequencyDF <- resultFDO[[3]] > drugFrequencyDF <- drugFrequencyDF[order(drugFrequencyDF$Frequency, + decreasing = TRUE), ] > ggFreq <- ggplot(data=drugFrequencyDF, aes(x=description, y=Frequency)) + + geom_bar(stat="identity") + theme_bw() + + theme(axis.text.x = element_text(angle=45, hjust=1)) + + xlab("Drug product description") .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 21 > #The structure of the FirstDrugObject. > str(resultFDO, strict.width="wrap", list.len = 5) 0 500 1000 Am itr ip ty lin e 10 m g ta bl et s Am itr ip ty lin e 25 m g ta bl et s Am itr ip ty lin e 50 m g ta bl et s At en ol ol 1 00 m g ta bl et s At en ol ol 2 5m g ta bl et s At en ol ol 5 0m g ta bl et s Ca nd es ar ta n 2m g ta bl et s Ca nd es ar ta n 4m g ta bl et s Li sin op ril 1 0m g ta bl et s Li sin op ril 2 .5 m g ta bl et s Li sin op ril 5 m g ta bl et s Pr op ra no lo l 1 0m g ta bl et s Pr op ra no lo l 4 0m g ta bl et s Pr op ra no lo l 8 0m g m od ifie d− re le as e ca ps ul es Pr op ra no lo l 8 0m g ta bl et s To pi ra m at e 25 m g ta bl et s Ve nl af ax in e 37 .5 m g ta bl et s Ve nl af ax in e 75 m g m od ifie d− re le as e ca ps ul es Ve nl af ax in e 75 m g m od ifie d− re le as e ta bl et s Drug product description F re q u e n cy Figure 4: The frequency of first line treatment prescription. getAgeGroupByEvents In the next example we explore stratifying first-line prescription events by patient character- istics, such as, age, gender, IMD, and number of medcodes (for instance, by comorbidities) or prodcodes (for instance, to separate those patients by additional prescriptions), or by any additional clinical event retrieved using CPRDLookups.R ?. rdrugtrajectory provides several utility functions to stratify patients (see reference manual for further information). The func- tion getAgeGroupByEvents calculates the number of first-line prescription events by patient age. By specifying a set of patids and eventdates from the FirstDrugObject, we can calculate the number of first-line prescriptions by age-group for patients linked with a specified medical condition: > library(rdrugtrajectory) > fileLocation <- "product.txt" .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 22 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records > resultFDO <- getFirstDrugPrescription(df = testTherapyDF, + idList = NULL, + prodCodesVector = NULL, + descriptionFile = fileLocation) > patidList <- resultFDO[[1]] > eventdateList <- resultFDO[[2]] > names(ageGenderDF) <- c("patid","age","gender") > #The age-groups: [18,25), [25,30), [30,35), ..., [60,60+). > ageGroupVector <- c(18,25,30,35,40,45,50,55,60) > #CPRD database release year. > ageAtYear <- "2017" > ageGroupList <- getAgeGroupByEvents(idList = as.list(patidList[1:2]), + eventdateList = eventdateList[1:2], + ageDF = ageGenderDF, + ageGroupVector = ageGroupVector, + ageAtYear = ageAtYear) > ageGroupList [[1]] 18-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60+ 1 103 94 106 131 165 182 153 185 240 [[2]] 18-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60+ 1 45 39 35 23 43 34 32 18 25 In the above example, the age of each patient (ageDF) was provided using year-of-birth calcu- lated against the release year of the CPRD Gold database (explained above). By providing the database release year (in ageAtYear) and the first prescription eventdate (in eventdateList), the age of each patient is adjusted against the prescription eventdate year. Finally, by using a list slice on idList and eventdateList, (individual prescriptions can be specified using their prodcode, for example, eventdateList$‘105‘), first prescription prescriptions frequencies by age-group are retrievable (Figure 5). > library(ggplot2) > ageGroupDrugDF <- data.frame(Age=names(ageGroupList[[1]]), + Count=unlist(ageGroupList[[1]]), + Drug="Amitriptyline 10mg") > ggAmitriptyline <- ggplot(ageGroupDrugDF, aes(x=Age, y=Count)) + + geom_bar(stat="identity") + + theme_bw() + ggtitle("Amitriptyline 10mg") + + theme(axis.text.x = element_text(angle=45, hjust=1)) + + xlab("Age-group") + ylab("Frequency") .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 23 0 50 100 150 200 250 18 −2 4 25 −2 9 30 −3 4 35 −3 9 40 −4 4 45 −4 9 50 −5 4 55 −5 9 60 + Age−group F re q u e n cy Amitriptyline 10mg Figure 5: The distribution of Amitriptyline 10mg as a first-line treatment by age-group. 3.4. Prescription sequences mapDrugTrajectory Identifying patient prescription trajectories in longitudinal EHRs remains our biggest motiva- tor behind the development of rdrugtrajectory. Therefore, we developed mapDrugTrajectory to identify the chronological of patient prescription events. We restrict the calculation to only look for prescription prodcodes as supplied to groupingList as a named list (named prodcode vectors). The required number of grouped-prescription events is defined by specifying the minDepth and the number of those changes to display is controlled by maxDepth maximum number. By keeping minDepth and maxDepth the same, only patients with a valid number of prescription changes are displayed (Figure 6 (A) and (C)). Patient records with fewer than minDepth number of changes to prescription sequences are ignored (Figure 6 (B)). For further information please refer to the reference manual. In the code below, mapDrugTrajectory returns patients with at least first five grouped pre- scriptions. prodcodes that have not been grouped are ignored. Duplication of prodcodes (those from the same group) do not count as a change in treatment: .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 24 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records Figure 6: The distribution of grouped prodcodes across three patients. (A) Five groups of valid prescription prodcodes, (B) only three groups, (C) five valid groups, in addition to prodcodes 101 and 1 which are ignored. > library(ggplot2) > library(ggalluvial) > structureList <- list(Amitriptyline = c(83,49,1888), + Propranolol = c(707,297,769), + Topiramate = c(11237), + Venlafaxine = c(470,301,39359), + Lisinopril = c(78,65,277), + Atenolol = c(5,24,26), + Candesartan = c(531) + ) > resultList <- mapDrugTrajectory(df = testTherapyDF, + minDepth = 5, + maxDepth = 5, + groupingList = structureList, + removeUndefinedCode = TRUE) > df <- resultList[[3]] > ggSwitch <- ggplot(df, + aes(y = Freq, axis1 = FirstDrug, axis2 = Switch1, + axis3 = Switch2, axis4 = Switch3, axis5 = Switch4)) + + geom_alluvium(aes(fill = FirstDrug), width = 1/12) + + geom_stratum(width = 1/12, fill = "black", color = "grey") + + geom_label(stat = "stratum", infer.label = TRUE) + + scale_fill_brewer(type = "qual", palette = "Set1") + + theme_bw() + theme(legend.position = "none") + + scale_x_discrete(limits = c("First Drug", "1st Switch", "2nd Switch", + "3rd Switch","4th Switch"), + expand = c(.05, .05)) + + ggtitle("Migraine Preventative Switching Among Patients") .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 25 Venlafaxine Propranolol Lisinopril Atenolol Amitriptyline Candesartan Venlafaxine Propranolol Lisinopril Atenolol Amitriptyline TopiramateCandesartan Venlafaxine Propranolol Lisinopril Atenolol Amitriptyline Topiramate Candesartan Venlafaxine Propranolol Lisinopril Atenolol Amitriptyline TopiramateCandesartan Venlafaxine Propranolol Lisinopril Atenolol Amitriptyline 0 100 200 300 First Drug 1st Switch 2nd Switch 3rd Switch 4th Switch F re q Migraine Preventative Switching Among Patients Figure 7: Prescription pattern switching of seven different migraine preventatives. A patient required a a minimum of five changes in prescriptions (including the initial prescription) and, equally, the display was set to five changes in prescription. 3.5. Prescription timeline construction rdrugtrajectory contains several functions that transforms patient data into a format com- patible with mean cumulative function (MCF) semi-parametric estimates, prescription per- sistence, prescription incidence, and survival analysis. generateMCFOneGroup Prescription events are binned into weekly units to increase the statistical power at each time point. The user presents a group at a time, for example, all clinical events of male patients with a first-line prescription of amitriptyline for a migraine. The clinical data has already been refined using the steps for first-line prescription, as described above. The function generateMCFOneGroup accepts a dataframe or events, the MCF start date (eventdates are adjusted so all patient records in the dataset begin at the same time), and the minimum number of events per patients (by default this is two events). The following example presents the calculation of first prescription events, the assignment of gender and the calculation of .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 26 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records MCF of prescription (therapy dataframe) burden of amitriptyline and propranolol: > library(rdrugtrajectory) > fileLocation <- "product.txt" > resultList <- getFirstDrugPrescription(df = testTherapyDF, + idList = NULL, + prodCodesVector = NULL, + descriptionFile = fileLocation) > patidList <- resultList[[1]] > eventdateList <- resultList[[2]] > drugFrequencyDF <- resultList[[3]] > drugFrequencyDF <- drugFrequencyDF[order(drugFrequencyDF$Frequency, + decreasing = TRUE), ] > amitriptylinePatid <- patidList$`83` > propranololPatid <- patidList$`707` > maleCode <- 1 > malePatidsDF <- getGenderOfPatients(idList = getUniquePatidList(testTherapyDF), + genderDF = ageGenderDF, + genderCodeVector = maleCode) > amitriptylineMalePatids <- subset(amitriptylinePatid, + amitriptylinePatid %in% malePatidsDF$patid) > propranololMalePatids <- subset(propranololPatid, + propranololPatid %in% malePatidsDF$patid) > amiMaleTherapyDF <- subset(testTherapyDF, + testTherapyDF$patid %in% amitriptylineMalePatids) > propMaleTherapyDF <- subset(testTherapyDF, + testTherapyDF$patid %in% propranololMalePatids) > amiMaleMCFDF <- generateMCFOneGroup(therapyDF = amiMaleTherapyDF, + startDateCharVector = "2000-01-01", + minRecords = 2) > propMaleMCFDF <- generateMCFOneGroup(therapyDF = propMaleTherapyDF, + startDateCharVector = "2000-01-01", + minRecords = 2) > amiMaleMCFDF <- cbind(amiMaleMCFDF, Drug = "Amitriptyline") > propMaleMCFDF <- cbind(propMaleMCFDF, Drug = "Propranolol") > drugMCFDF <- rbind(amiMaleMCFDF, propMaleMCFDF) > resultMCF <- reda::mcf(reda::Recur(week, id, No.) ~ Drug, data = drugMCFDF) > mcfPlot <- reda::plot(resultMCF, conf.int=TRUE) + + ggplot2::xlab("Weeks") + ggplot2::theme_bw() + ggplot2::ggtitle("") getFirstDrugIncidenceRate Prescription incidence be calculated with getFirstDrugIncidenceRate. The following code demonstrates how to use a FirstDrugObject to calculate incidence rates for a set of prodcodes. The study observation starts from the enrollmentDate and ends at the studyEndDate: > library(rdrugtrajectory) .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 27 0 100 200 300 0 250 500 750 Weeks M C F E st im a te s Drug Amitriptyline Propranolol Figure 8: MCF of drug prescriptions of patients with a first drug prescription for either amitriptyline or propranolol, stratified by gender. The dotted lines indicate a 95% confidence interval. > fileLocation <- "product.txt" > drugList <- unique(testTherapyDF$prodcode) > requiredProds <- drugList[1:10] > firstDrugObject <- getFirstDrugPrescription(df = testTherapyDF, + idList = NULL, + prodCodesVector = requiredProds, + descriptionFile = fileLocation) > medhistoryDF <- constructMedicalHistory(testClinicalDF, NULL, testTherapyDF) > patidList <- unlist(firstDrugObject$patidList) > resultMatrix <- getFirstDrugIncidenceRate(firstDrugObject = firstDrugObject, + medHistoryDF = medhistoryDF, + enrollmentDate = as.Date("2000-01-01"), + studyEndDate = as.Date("2016-12-31")) > incidenceDF <- as.data.frame(t(resultMatrix), stringsAsFactors = TRUE) The above example returns an incidence rate of 0.11 per 17 person years over a cohort of .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 28 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records 3838 patients. For a detailed description please see Detail for getFirstDrugIncidenceRate in the reference manual. getDrugPersistence Prescription persistence is calculated as the fraction of patients with a prescription for a specific treatment N-days after the first prescription event. For example, if we wanted to calculate the fraction of patients with a prescription 365-days after their first prescription, with a 30-day buffer either side, one specifies a duration of 395-days and a preceding buffer of 60-days (therefore, capturing the range 335 to 395, 30-days either side of one calender year): > library(rdrugtrajectory) > patientList <- getDrugPersistence(therapyDF = testTherapyDF, + idList = NULL, + prodcodeList = NULL, + duration = 395, + buffer = 60, + endOfRecordDate = "2017-12-31") Of 3838 patient therapy records, 954 patients had a prescription 365 (+/- 30) days after the first prescription event on record, resulting in a crude fraction of only 0.25 patients. getDrugPersistence only observes events recorded precisely duration days after the first prescription. The buffer can be used to identify patients who received a prescription shortly after the end of the duration, but more importantly, to ensure patients actively undergoing treatment (indicated by a prescription shortly before the desired duration days) are included. As the buffer is reduced, the fraction of prescription persistence is reduced until the algorithm attempts to only identify patients with a prescription exactly duration of days after the first prescription. Future software updates will incorporate repeat prescription data to increase the accuracy of the calculation. 4. Closing remarks and future work rdrugtrajectory is an R package which has the potential for exciting applications such as im- proving clinical decision-making, identifying possible new treatments and analysing outcomes from existing treatments. We have demonstrated several functions, some of which detail sorting and matching records whilst others demonstrate fundamental statistical analysis. We used fabricated clinical and prescription dataframes, along with the age, gender and index of multiple deprivation score of each patient and presented analyses of cohort-wide prescrip- tion patterns, first-line treatment distributions, how to stratify by patient characteristics, and some basic tools to assist longitudinal analysis of prescriptions. The descriptions presented in this publication are not substitutes for the material in the reference manual. We recommend the reader consults the R ? help command or reference manual before running a function. In particular, functions related to the construction of timelines for survival analysis (time dependent/independent Cox regression, Kaplan Meier survival curves and mean cumulative function) or a matrix for drug incidence rate requires fine tuning of several parameters. .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ Journal of Statistical Software 29 0.0 0.1 0.2 0.3 0 25 50 75 100 125 Buffer size (n days before 365) F ra ct io n o f p re sc ri p tio n p e rs is te n ce Figure 9: The fraction of prescription persistence adjusted by a buffer number of days before a calender year. As the buffer approaches the value of duration the fraction approaches 1. The latest release of rdrugtrajectory along with source code and reference manual is available for download from https://github.com/acnash/rdrugtrajectory. Whilst active members of the scientific research community we will continue to add new features to rdrugtrajectory whilst making necessary improvements to existing features. Acknowledgements Oxford Science Innovation, NIHR Oxford Biomedical Research Centre and NIHR Oxford Health Biomedical Research Centre (Informatics and Digital Health theme, grant BRC-1215- 20005). Thanks to Dr Michelle Hardy for assistance with the article. References Bally M, Dendukuri N, Rich B, Nadeau L, Helin-Salmivaara A, Garbe E, Brophy JM (2017). “Risk of Acute Myocardial Infarction with NSAIDs in Real World Use: Bayesian Meta- .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint https://github.com/acnash/rdrugtrajectory https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 30 rdrugtrajectory: Analysing Drug Prescriptions in Electronic Health Care Records Analysis of Individual Patient Data.” British Medical Journal, 357, j1909. doi:10.1136/ bmj.j1909. Ghosh RE, Crellin E, Beatty S, Donegan K, Myles P, Williams R (2019). “How Clinical Practice Research Datalink data are used to support pharmacovigilance.” Therapeutic Advances in Drug Safety, 10, 1–7. doi:10.1177/2042098619854010. Hepp Z, Dodick DW, Varon SF, Chia J, Matthew N, Gillard P, Hansen RN, Devine EB (2017). “Persistence and Switching Patterns of Oral Migraine Prophylactic Medications Among Patients with Chronic Migraine: A Retrospective Claims Analysis.” Cephalalgia, 37(5), 470–485. doi:10.1177/0333102416678382. Oyinlola JO, Campbell J, Kousoulis AA (2016). “Is Real World Evidence Influencing Practice? A Systematic Review of CPRD Research in NICE Guidance.” BMC Health Service Research, 16(299), 1–12. doi:10.1186/s12913-016-1562-8. Affiliation: Nuffield Department of Clinical Neurosciences Medical Sciences Division University of Oxford Oxford UK OX3 9DU E-mail: anthony.nash@ndcn.ox.ac.uk Journal of Statistical Software http://www.jstatsoft.org/ published by the Foundation for Open Access Statistics http://www.foastat.org/ MMMMMM YYYY, Volume VV, Issue II Submitted: yyyy-mm-dd doi:10.18637/jss.v000.i00 Accepted: yyyy-mm-dd .CC-BY-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425952doi: bioRxiv preprint http://dx.doi.org/10.1136/bmj.j1909 http://dx.doi.org/10.1136/bmj.j1909 http://dx.doi.org/10.1177/2042098619854010 http://dx.doi.org/10.1177/0333102416678382 http://dx.doi.org/10.1186/s12913-016-1562-8 mailto:anthony.nash@ndcn.ox.ac.uk http://www.jstatsoft.org/ http://www.foastat.org/ http://dx.doi.org/10.18637/jss.v000.i00 https://doi.org/10.1101/2021.01.08.425952 http://creativecommons.org/licenses/by-nd/4.0/ 10_1101-2021_01_07_425637 ---- A mammalian methylation array for profiling methylation levels at conserved sequences A mammalian methylation array for profiling methylation levels at conserved sequences Adriana Arneson 1,2, Amin Haghani 3, Michael J. Thompson 4, Matteo Pellegrini 4, Soo Bin Kwon1,2, Ha Vu1,2, Caesar Z. Li 5, Ake T. Lu 3, Bret Barnes 6, Kasper D. Hansen 7,8, Wanding Zhou 9, Charles E. Breeze 10, Jason Ernst 1,2,11-15#, Steve Horvath 3,5# Affiliations 1 Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA 90095, USA 2 Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, California, USA; 3 Dept. of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095, USA; 4 Molecular, Cell and Developmental Biology, University of California Los Angeles, Los Angeles, CA 90095, USA; 5 Dept. of Biostatistics, Fielding School of Public Health, University of California Los Angeles, Los Angeles, CA 90095, USA; 6 Illumina, Inc, 5200 Illumina Way, San Diego, CA 92122, USA; 7 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA; 8 Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland, USA; 9 Van Andel Research Institute, Grand Rapids, Michigan, USA; 10 Altius Institute for Biomedical Sciences, Seattle, WA, USA; 11 Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research at University of California, Los Angeles, Los Angeles, California, USA; 12 Computer Science Department, University of California, Los Angeles, Los Angeles, California, USA; .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 2 13 Department of Computational Medicine, University of California, Los Angeles, Los Angeles, California, USA. 14 Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, California, USA; 15 Molecular Biology Institute, University of California, Los Angeles, Los Angeles, California, USA. # Joint senior authorship Correspondence: shorvath@mednet.ucla.edu and jason.ernst@ucla.edu SUMMARY Infinium methylation arrays are widely used to robustly measure methylation of DNA in humans. However, such arrays are not available for the vast majority of non-human mammals. Moreover, even if species-specific arrays were available, probe differences between them would confound cross-species comparisons. To address these challenges, we developed the Mammalian Methylation Array, a single custom Infinium array that measures cytosine methylation levels of over 35 thousand CpG sites that are well conserved across species within the mammalian class. By design, the probes on the array tolerate cross-species mutations. To design the array, we developed the Conserved Methylation Array Probe Selector (CMAPS) algorithm, which takes as input a multi-species sequence alignment and probe design constraints. A greedy search algorithm was used to identify oligonucleotide sequences (probes) with high coverage across different mammalian species. We annotate the probes on the array with respect to genes in 159 different species and provide details on the sequence context including CpG island status and chromatin states. Our calibration experiments demonstrate the high fidelity of this array in humans, rats, and mice. The mammalian methylation array has several strengths: it applies to all mammalian species even those that have not yet been sequenced, it provides deep coverage of specific cytosines facilitating the development of highly robust epigenetic biomarkers, and it covers highly conserved CpGs which greatly increases the probability that biological insights .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 3 gained in one species will readily translate to others. The mammalian methylation array is expected to find many applications in preclinical studies, comparative biology, and epigenetic studies of aging and development. Introduction Methylation of DNA by the attachment of a methyl group to cytosines is one of the most widely studied epigenetic modifications in vertebrates, due to its implications in regulating gene expression across many biological processes including disease (Ooi et al., 2007; Robertson, 2005; Smith and Meissner, 2013). A variety of different assays have been proposed for measuring DNA methylation including microarray based methylation arrays (Bibikova et al., 2009, 2011) and sequencing based assays such as whole genome bisulfite sequencing (WGBS)(Cokus et al., 2008; Lister et al., 2009) and reduced representation bisulfite sequencing (RRBS)(Meissner et al., 2005). Despite the availability of sequencing based assays, array based technology remains widely used for measuring DNA methylation due to its low-cost and high reproducibility and reliability(Pidsley et al., 2016). The first human methylation array (Illumina Infinium 27K) was introduced by Illumina Inc in 2009 (Bibikova et al., 2009), which were followed by the 450K(Bibikova et al., 2011) and EPIC arrays with larger coverage(Pidsley et al., 2016). More recently, Illumina released a mouse methylation array (Infinium Mouse Methylation BeadChip) that profiles over 285k markers across diverse murine strains. It will probably not be economical to develop similar methylation arrays for less frequently studied mammalian species (e.g. elephants or marine mammals) due to insufficient demand. Moreover, even if costs were no impediment, species-specific arrays would likely be sub-optimal in comparative studies across different species as the measurement platforms would be different. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 4 To address these challenges, we developed a single mammalian methylation array designed to be used to measure DNA methylation across mammals. The array targets CpGs for which the CpG and flanking sequence are highly conserved across many mammals so that the methylation of many of these CpGs can be measured in each mammal. The design repurposes the degenerate base technology (originally used by Illumina Infinium probes to tolerate within- human variation) to tolerate cross-species mutations across mammalian species. To select the specific probe sequences including tolerated mutations that appear on the array we developed the Conserved Methylation Array Probe Selector (CMAPS). CMAPS takes as input a multiple sequence alignment to a reference genome and a set of probe design constraints, and selects a set of probe sequences including tolerated mutations, which can be used to query methylation in many species. We apply CMAPS to select over 35 thousand CpGs for the mammalian methylation array, which we complemented with close to two thousand known human biomarker CpGs. We characterize the CpGs on the mammalian methylation array with various genomic annotations. Further, we use calibration data to evaluate the fidelity of individual probes in humans, mice, and rats. CMAPS has led to the design of the mammalian methylation array, which will facilitate the study of cytosine methylation at conserved loci across all mammal species. Results Designing the Mammalian Methylation Array The CMAPS algorithm is designed to select a set of Illumina Infinium array probes such that for a target set of species many probes are expected to work in each species (Methods). Array probes are sequences of length 50bp flanking a target CpG based on the human reference genome. Selecting sequences present in the human reference genome increases the likelihood that measurements in other species will transfer to human. The mammalian methylation array adapts the degenerate base technology for tolerating human SNPs so that probes can tolerate a .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 5 limited number of cross-species mutations. The CMAPS algorithm is provided as input a multiple- species sequence alignment to a reference genome. CMAP uses these inputs to then select the CpGs to target on the array. As part of selecting the CpGs, CMAP also selects the probe sequence design to target them including the specific set of degenerate bases. For designing the mammal methylation array, CMAPS was applied to the subset of 62 mammals within a 100-way alignment of 99 vertebrate genomes with human genome(Haeussler et al., 2019), but we note the CMAPS method is general. In designing a probe for a CpG, CMAPS considers multiple different options. One option is the type of probe. Illumina’s current methylation array technology allows up to two types of probes: Infinium I and Infinium II. The latter is newer technology requiring only one silica bead to query the methylation of a CpG, while the former requires two beads. By only requiring one bead Infinium II probes allow under fixed array capacity limits interrogating more CpGs, though Infinium I probes are better able to query CpGs in CpG rich regions (Bibikova et al., 2011). Another option for each of these two types of probes is whether the probe is on the forward or reverse genomic strand, giving four total combinations of options for probe type and strand for each CpG. In addition, CMAPS has options for the position and nucleotides identity of tolerated mutation across correspond to degenerate bases. The array degenerate base technology allows for potentially up to three degenerate bases per probe sequence, which are positions that can be designed to tolerate variation in the sequence being interrogated. For some probes fewer than three degenerate bases could be designed, which was determined based on a design score computed by Illumina for each probe and in the case of Infinium II probes also the number of CpGs within the probe sequence. CMAPS uses a greedy algorithm to select the tolerated mutations for each combination of probe type and strand. The algorithm aims to maximize the number of species in the alignment the probe is expected to work based on just local alignment information that is without considering how uniquely mappable the probe is across the genome. A probe for a CpG .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 6 is expected to work in a non-human species based on local alignment information if there are no differences in the alignment between the human genome sequence and the other species excluding those accounted for by the probe’s degenerate bases (Figure 1a, Methods). For each CpG site in the human genome, CMAPS retained for further consideration the Infinium I probe out of the two options (forward or reverse of the CpG) which had the greater number of species for which the probe was expected to work, and likewise for Infinium II. We next applied a series of rules to identify a reduced subset of candidate probes. First, we included all 36,133 Infinium II probes that were expected to work in mouse (based on the mm10 genome), which maximizes the expected array utility for one of the most widely used model organisms. For the remaining set of CpG not selected in the previous step, we sorted them in descending order of the number of species for which an Infinium II probe was expected to work. We then added the top 16,867 CpG sites for a total of 53,000 CpG sites. Next, we ranked the CpGs targeted on the Illumina EPIC array (Pidsley et al., 2016) in descending order of the number of species for which a probe targeting the CpG is expected to work. For this the probe was required to be of the same probe type and strand as on the EPIC array, but used the degenerate bases picked by the CMAPS algorithm. The probe was allowed to differ in terms of degenerate base positions, as EPIC probes typically do not account for degenerate bases across species. For this we selected the top 3,000 CpG sites ranked sites that had not already been picked based on the earlier criteria. Lastly, we sorted the CpG sites in descending order of number of species they can target and picked the top 4,000 CpGs targeted by Infinium I probes that had not already been included. The Infinium I probes were selected to allow querying CpG dense regions such as CpG islands, as CpGs do not count towards the limited number of positions of variation as for Infinium II probes. This resulted in a set targeting 60,000 CpGs (Figure 1b). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 7 For some of these 60,000 CpGs, the sequence of the probe targeting it can map to multiple locations in a genome, which could result in a confounded signal coming from multiple CpG sites. This issue is compounded by individual probes corresponding to multiple sequences reflecting different possible combinations of the degenerate bases. To identify a subset of probes less susceptible to such confounders, for 16 high quality genomes, we computed for each probe how many of its versions map uniquely in that genome (see Methods). We then filtered CpGs down by requiring all versions of a probe targeting it map uniquely in at least 80% of the species they are expected to target out of the 16 high quality genomes, unless the probe is expected to target at least 40 mammals from the alignment, in which case the mapping criterion was discarded. This reduced the set of candidate CpGs to 35,989 CpGs. We added probes targeting 1986 CpGs to the mammalian methylation array based on their utility for human biomarker studies (Supplementary Data). These probes, which were previously implemented in human Illumina Infinium arrays (EPIC, 450K, 27K), were selected due to their utility for human biomarker studies estimating age, blood cell counts, or the proportion of neurons in brain tissue(Guintivano et al., 2013; Hannum et al., 2013; Horvath, 2013; Horvath and Levine, 2015; Horvath et al., 2018; Houseman et al., 2012; Levine et al., 2018). The final manufactured mammalian methylation array measures cytosine levels of 37,492 cytosines: 37,488 of these cytosines are followed by a guanine (CpGs) and 4 are followed by another nucleotide (non-CpGs). The probe identifiers (cg numbers) of 86 of these cytosines ends with either ".1" or ".2", i.e. these are duplicate probes for 43 genomic locations. A detailed analysis of the Infinium probe context of the mammalian array and relation to human and mouse arrays is presented in Supplementary Figure S1. The mammalian methylation array focus on highly conserved regions led to a an array that is distinct from other currently available Infinium arrays that focus on specific species. For example, the mammalian array only shares .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 8 3107 probes with the Illumina MouseMethylation array and only 7111 CpGs with the Illumina EPIC array. Mappability analysis All 37488 CpGs profiled on the mammalian methylation array apply to humans, but only a subset of these CpGs applies to other species. When conducting analyses in a specific species it can thus be desirable to restrict analyses to the subset of CpG that apply in that species. One approach for doing this is simply omit CpGs whose detection p-values from normalization methods (Methods) are insignificant. This approach has the advantage of being applicable to species that have not yet been sequenced. Mapping sequences to genomes has the added benefit of providing a candidate position of the sequence in the target genome from which other information about the CpG can be inferred such as the nearest gene or CpG island status. We have mapped the array CpGs to 159 species, which also provides a candidate position from which a gene for the CpG can be associated. As expected, the closer a species is to humans, the more CpGs map to the genome of this species. Over 30k CpGs on the array map to most placental mammalian genomes (eutherians, Figure 2a, Supplementary Data). Roughly 15K CpGs map to most non-placental mammalian genomes (marsupials), such as kangaroos or opossums. Far fewer CpGs map to egg laying mammalian genomes (monotremes), such as platypus (Figure 2). A CpG that is adjacent to a given gene in humans may not map to a position adjacent the corresponding (orthologous) gene in another species. Between 15k to 22k CpGs (over 70%) were assigned to human orthologous species based on their mapped position in most phylogenetic orders (rodents, bats, carnivores, Figure 2b,c and Supplementary Data). These numbers surrounding orthologous genes are probably overly conservative (i.e. lower than the true numbers) because we found the majority of CpGs (about 58%) that do not map to orthologous .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 9 genes in the non-human species are located in intergenic regions outside of promoters (Methods), which suggests that one of the gene assignments was inaccurate. Chromosome and gene region coverage of array We analyzed the chromosome and gene region coverage of the mammalian methylation array for human and mouse. The mammalian methylation has substantial coverage of all chromosomes (human, 235-3938; and mouse, 687-3179 probes per chromosome), with the exception of chrY that only has 2 probes in both species (Supplementary Figure S2a). When we assign the probes to the closest gene neighbor, around 80% of the probes are proximal to a gene in both of these species (Supplementary Figure S2b). The remaining 20% of probes are neither aligned to a promoter nor a gene body. The distribution of gene region and the distances to transcriptional start sites are comparable between human and mouse (Supplementary Figure S2b). CpGs on the mammalian array cover 6871 human and 5659 mouse genes when each CpGs is assigned uniquely to its closest gene neighbor (Supplementary Figure S2c). The gene coverage is uneven: while on average a gene is covered by 2 CpGs some genes are covered by as many as 150 CpGs. In mouse, 73% of CpGs (21,664) were assigned to a human orthologous genes (Supplementary Figure S2d), suggesting many CpG measurements from the array in mice will be informative to humans (and vice versa). Gene sets represented in mammalian array We analyzed gene set enrichments of all genes that are represented on the mammalian array using GREAT(McLean et al., 2010). Significant gene sets covered implicated gene sets that were found to play a role in development, growth, transcriptional regulation, metabolism, cancer, mortality, aging, and survival (Supplementary Figure S3). We also used the TissueEnrich(Jain and Tuteja, 2019) software to analyze gene expression (Methods). The majority of mammalian .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 10 methylation array probes (~65%) are adjacent to genes that are expressed in all considered human and mouse tissue (Supplementary Figure S4a,b). However, the mammalian array also contains CpGs that are adjacent to genes that are expressed in a tissue-specific manner, notably testis and cerebral cortex (Supplementary Figure S4c). CpG island and methylation status We analyzed the CpG island and DNA methylation properties of CpGs on the mammalian array. In general, an average of 5563 (18%) of probes in the mammalian array are located in CpG island depending on the species (Figure 3a). We used a CpG island detection algorithm (gCluster software (Li et al., 2020)) that additionally provided several species-level quantitative measures for each CpG island including the length, GC content, and CpG density that we provide as a resource (Supplementary Data). We also analyzed the DNA methylation levels in human for fractional methylation called from whole genome bisulfite sequencing data across 37 human tissues(Roadmap Epigenomics Consortium et al., 2015)5 (Supplementary Figure 5). This confirmed that the mammalian methylation array target CpGs across a wide range of fractional methylation levels. Chromatin state annotation of array probes We analyzed the overlap of human CpG’s targeted on the mammal methylation array with chromatin states for 127 cell and tissues. The CpGs cover all available chromatin states including different types of promoters (including bivalent promoters), regions repressed by polycomb group proteins, transcription start and end site, and enhancer regions (Figure 3b). Among enhancers, CpG’s had greater overlap with brain and neurosphere than other tissue groups. In addition to analyzing the array CpG’s overlap for cell and tissue specific chromatin states, we also analyzed them for a universal chromatin state annotation, which provides a single annotation to the genome .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 11 per position based on data from more than 100 cell and tissue types (Vu and Ernst, 2020) (Supplementary Figure S6). This revealed the greatest enrichment for bivalent promoter states and also strong enrichment for other promoter states and a state associated with polycomb repression. While the mammalian methylation array was specifically designed to profile CpGs in highly conserved stretches of DNA based on sequence conservation, we assessed whether there was also evidence of conservation at the functional genomics level using human-mouse LECIF scores (Kwon and Ernst, 2020). The human-mouse LECIF quantifies evidence of conservation between human and mouse at the functional genomics level using chromatin state and other functional genomic annotations. In general, probes on the array had higher LECIF score than regions that align between human and mouse in general (Figure 3c). Mammalian array study of calibration data To validate the accuracy of the mammalian methylation array we applied it to synthetic DNA methylation samples for three species: human (n=10 arrays), mouse (n=20), and rat (n=15), where the methylation levels were known. The DNA samples from human, mouse and rat were engineered such that the fractional methylation at all CpG sites in their genomes approximately 0%, 25%, 50%, 75% and 100% (Methods). The calibration data thus allow us to define a benchmark annotation measure “ProportionMethylated” (with ordinal values 0, 0.25, 0.5, 0.75, 1). The distribution of the intensity of the probes in each human sample is roughly centered around the benchmark measure (ProportionMethylated) (Figure 4a). However, as expected, the distributions in the mouse and rat samples of all the probes show somewhat different patterns in these two species compared to the human samples likely because many probes in the design of our array do not map to these genomes (Figure 4b-c). We also evaluate these for each species after removing the probes that were not designed to map to that species, and normalizing the .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 12 array data using the SeSaMe package, which defines beta (relative intensity) values for each probe (Zhou et al., 2018). After this procedure, we see sharper peaks close to 0 and 1, though the quantification of absolute methylation levels are somewhat degraded around the beta value 0.75 as we move away from humans (Figure 4d-f). Additionally, for each species, DNA methylation levels of each CpG we computed the correlation with the benchmark variable "ProportionMethylated" across the arrays. High positive correlations would be evidence for the accuracy of the array, which is indeed what we observe. CpGs that map to the human, mouse, and rat genome have a median Pearson correlation of r=0.986 with an interquartile range of [0.96,0.99], r=0.959 with IQR=[0.92,0.98], and r=0.956 with IQR=[0.91,0.98] with the benchmark variable ProportionMethylated in the respective species. The numbers of CpGs on the mammalian array that pass a given correlation threshold (irrespective of the mappability to a given species) are reported in Table 1. We also compare the SeSaMe normalization with the "noob" normalization that is implemented in the minfi R package (Aryee et al., 2014; Triche et al., 2013) (Table 1). We find that SeSaMe slightly outperforms minfi when it comes to the number of CpGs that exceed a given correlation threshold with ProportionMethylated. Comparison with the human EPIC methylation array study in calibration data We compared the mammalian methylation to the human EPIC methylation array, which profiles 866k CpGs in the human genome, for non-human samples. Some of the EPIC array probes are expected to apply to the mouse and rat genomes as well (Needhamsen et al., 2017). To facilitate a comparison between the mammalian methylation array and the human EPIC array for non-human samples we applied the latter to calibration data from mouse (n=15 arrays) and rat (n=10). The same engineered DNA data methylation data were analyzed on the human EPIC array as on the mammalian methylation array above. In particular, we were able to correlate each .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 13 CpG on the EPIC array with a benchmark measure (ProportionMethylated) in mice and rats (Table 1). Only 2356 (out of 866k) CpGs on the human EPIC exceed a correlation of 0.90 with ProportionMethylated in mice. By contrast, 24050 CpGs on the mammalian array exceed the same correlation threshold in mice. Similarly, the mammalian array outperforms the EPIC array in rats: only 6159 CpGs on the EPIC array exceed a correlation of 0.90 with ProportionMethylated compared with 22427 CpGs on the mammalian array. The results are similar for the correlation thresholds of 0.85 and 0.95 (Table 1). The EPIC array contains 5574 CpGs that were also prioritized by the CMAPS algorithm based on high levels of conservation, excluding the 1986 CpGs from human biomarker studies. Out of these 5574 shared CpGs, 4341 and 3948 CpGs map to the mouse and rat genome, respectively. While human EPIC probes target the same CpG, the corresponding mammalian probe is typically different from EPIC probe due to differences in probe type (type I versus type II probe), DNA strand, or the handling of mutations across species degenerate bass. In the following comparison, we limited the analysis to the 4341 and 3948 probes when analyzing calibration data from mice or rats, respectively. We find that the mammalian array probes are better calibrated than the corresponding EPIC array probes when applied to mouse and rat calibration data according to two different analysis that focus on shared CpGs between the two platforms. First, the mammalian array outperforms the EPIC array when considering mean methylation levels across the shared CpGs (Figure 5). Second, when correlating each of the shared CpGs with the benchmark value ProportionMethylated we observe median correlation of 0.72 for both mice and rat calibration data generated on the EPIC array. For the same probes we observe median correlations of 0.94 and 0.93 for mice and rat calibration data generated on the mammalian array (SeSaMe normalization), respectively. We are distributing the methylation data and results from our calibration data analysis in three species (Supplementary Data). These calibration results will .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 14 allow users to select cytosines whose methylation have a high correlation with the benchmark data in human, mice or rat. DISCUSSION The mammalian methylation array, which was enabled by the CMAPS algorithm for selecting conserved probes, is applicable to all mammals and hence drives down the cost per chip through economies of scale. The mammalian methylation array has unique strengths: it applies to all mammalian species even those that have not yet been sequenced, it provides deep coverage of specific cytosines which is a prerequisite for developing robust epigenetic biomarkers, and its focus on highly conserved CpGs increases the chances that findings in one species will translate to those in another species. We expect that the mammalian methylation array is particularly well suited for DNA methylation based biomarker studies in mammals. Our calibration data demonstrate that the array largely leads to high quality measurements in three species: human, mouse and rat. Our calibration data shows that the mammalian methylation array greatly outperforms the human EPIC chip when it comes to high fidelity measurement applications to mice and rats. The array thus should be preferable for most non- human applications unless high-fidelity measurements are not needed in which case the larger content of the EPIC array may make it preferable. The mammalian methylation array has several limitations. First, only a fraction of genes in a given species are represented by CpGs. Second, it focuses on CpGs in highly conserved stretches of DNA and hence does not cover parts that are specific to a given species. Third, it provides worse coverage in more distal species, particularly in marsupials than in placental mammals (eutherians). Finally, the calibration data suggests there are some shifts in the absolute methylation levels detected for intermediate methylation levels, but the relative order is preserved. The correct relative ordering of beta values is of primary importance in most statistical tests and analyses. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 15 Several software tools have been adapted for use with the mammalian methylation array that range from normalization to higher level gene enrichment analysis. Software tools for generating normalized data include SeSaMe and the minfi R package (Aryee et al., 2014; Zhou et al., 2018). The eFORGE software (Breeze et al., 2019), which has been adapted for the use with the mammalian array, facilitates chromatin state analysis and transcription factor binding site analysis. Many researchers will be interested in genome coordinates of the mammalian CpGs in different species. Toward this end, we provide genome coordinates in 159 species. This list of species will increase as more high quality genomes become available. Detailed gene annotations in many species are available including details on gene region (e.g. exon, promoter, 5 prime untranslated region) and CpG island status (Supplementary Data). For human and mice we provide chromatin state annotations (Ernst and Kellis, 2012; Gorkin et al., 2020; Roadmap Epigenomics Consortium et al., 2015; Vu and Ernst, 2020) and the LECIF score on evidence of conservation at the functional genomics level between human and mouse(Kwon and Ernst, 2020). In other articles, we will describe the application of the mammalian methylation array to many different mammalian species. These upcoming studies will demonstrate that the mammalian methylation array is useful for many applications that involve mammalian species. Methods Conserved Methylation Array Probe Selector (CMAPS) Given a multi-species sequence alignment and reference genome, for each CG site and each of the four different possible probe designs, CMAPS computes an estimate of the number of species from the alignment that could be targeted if the use of degenerate base technology is optimized for tolerated mutations. The four probe designs involve each combination of probe type (Infinium I vs. Infinium II), and whether the probe sequence is on the forward or reverse DNA strand. For .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 16 each probe option, CMAPS conducts a greedy search to select tolerated mutations, including position and allele that maximize species coverage for the probe. The maximum number of degenerate bases that can be included in a probe is a function of a design score provided by Illumina Inc. For Infinium II probes only, CpGs present in the probe sequence count as if they are a degenerate base. More specifically, the algorithm for determining the number of species and selecting the mutations to handle performs the following steps for each probe design: 1. Let M be the maximum number of degenerate bases that can be designed into a specific probe, based on the design score 2. For each species s in the alignment, let Ms be the number of mismatches in the alignment between that species and the human reference sequence of the probe a. If Ms > M or the species does not have the target CpG, continue to next species b. If Ms <= M, i. For each mismatch in species s, add each degenerate position to a multiset P ii. add the species to a set F of feasible species to target with this probe 3. For all |P| choose M combinations of possible degenerate positions: a. For each unique position in the combination i. For each possible alternate nucleotide count the number of species in F that contain that alternate nucleotide ii. Pick the top k alternate nucleotides based on the count in i., where k is the number of occurrences of the current position in S b. Compute the number of species that match the human reference when accounting for the degenerate substitutions handled in a 4. Select the combination of positions in S that maximizes 3.b .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 17 Our procedure for selecting the specific targeted CpG and probe designs are described in the main text. We note that 30 of the CpGs selected for the mammalian methylation array based on the conservation criteria (using the sequence alignment) overlap with the 1986 human biomarker CpGs. The design of the probes targeting them could differ however. The probe names of different probes targeting the same CpG are distinguished by extensions ".1" and ".2". For example cg00350702.1 and cg00350702.2 target the same cytosine but use different probe chemistry. The array contains four probes that measure cytosines that are not followed by a guanine selected by human biomarkers, which are indicated with a "ch" instead of a "cg". The CMAPS algorithm was applied with human hg19 as the reference genome and using the Multiz alignment of 99 vertebrates with the hg19 human genome downloaded from the UCSC Genome Browser (Haeussler et al., 2019; Rosenbloom et al., 2015). For the purpose of designing the mammalian array, only the 62 mammalian species in this alignment were considered and 16 for the mappability analysis. However, the current version of the mappability analysis provides genome coordinates for 159 species. The mammalian methylation array includes an additional 62 human SNP markers (whose probe names start with "rs" for human studies), which can be used to detect plate map errors when dealing with multiple tissue samples collected from the same person. Finally, the mammalian array also adopted a standard suite of probes from the Illumina EPIC array for measuring bisulfite conversion efficiency in humans. Mapping probes to genomic coordinates We used two different approaches for mapping probes to genomes. The first approach (BSbolt software) was primarily used in designing the array. Subsequently, we adopted a second mappability approach (QUASR software) that allowed us to map more probes to more species. Mappability Approach 1: BSbolt .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 18 For version 1 of our mappability analysis (i.e. for designing the array), we applied the BSbolt mapping approach to 16 high quality genomes from: Baboon (papHam1), Cat (felCat5), Chimp (panTro4), Cow (bosTau7), Dog(canFam3), Gibbon(nomLeu3), Green Monkey (chlSab1), Horse, (equCab2), Human (hg19), Macacque (macFas5), Marmoset(calJac3), Mouse (mm10), Rabbit (oryCun2), Rat (rn5), Rhesus Monkey (rheMac3), Sheep (oviAri3). We utilized the BSBolt software (Farrell et al., 2020) package from https://github.com/NuttyLogic/BSBolt to perform the alignments. For each species’ genome sequence, BSBolt creates an ‘in silico’ bisulfite-treated version of the genome. As many of the currently available genomes are in a low quality assembly state (e.g. thousands of contigs or scaffolds), we used the utility “Threader” (which can be found in BSBolt’s forebear BSseeker2(Guo et al., 2013) as a standalone executable) to reformat these fasta files into concatenated and padded pseudo-chromosomes. The set of nucleotide sequences of the designed probes, which includes degenerate base positions, was explicitly expanded into a larger set of nucleotide sequence representing every possible combination of those degenerate bases. For Infinium I probes, which have both a methylated and an unmethylated version of the probe sequence, only the methylated version was used as BSBolt’s version of the genome treats all CG sites as methylated. The initial 37K probe sequences resulted in a set of 184,352 sequences to be aligned against the various species genomes. We then ran BSBolt with parameters Align -M 0 –DB [path to bisulfite- treated genome] -BT2 bowtie2 -BT2-p 4 -BT2-k 8 -BT2-L 20 -F1 [Probe Sequence File] –O [Alignment Output File] –S to align the enlarged set of probe sequences to each prepared genome. As we were not interested in the final BSBolt style output, we made a small modification to the code to retain its temporary output of alignment results in "sam" format. From these files, we collected only alignments where the entire length of the probe perfectly matched to the genome .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://github.com/NuttyLogic/BSBolt https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 19 sequence (i.e. the CIGAR string ‘50M’ and flag XM=0”). Then, for each genome we collapsed all the sequence variant alignments for each probeID down to a list of loci for that genome and for that probe. Mappability Approach 2: QUASR For version 2 of our mappability analysis, we aligned the probe sequences to all available mammalian genomes in ENSEMBL and NCBI Refseq databases using the QUASR package (Gaidatzis et al., 2015). The fasta sequence files for each genome were downloaded from these public databases. The alignment assumed that the DNA has been subjected to a bisulfite conversion treatment. For each species’ genome sequence, QUASR creates an in-silico-bisulfite- treated version of the genome. The probes were aligned to these bisulfite treated genome sequences, which does not consider C-T as a mismatch. The alignment was ran with QUASR (a wrapper for Bowtie2) with parameters -k 2 --strata --best -v 3 and bisulfite = "undir” to align the enlarged set of probe sequences to each prepared genome. From these files, we collected the best candidate unique alignment to the genome. Additionally, the estimated CpG coordinates at the end of each probe was used to extract the sequence from each genome fasta files and exclude any probes with mismatches in the target CpG location. Genomic loci annotations Gene annotations (gff3) for each genome considered were also downloaded from the same sources as the genome. Following the alignment, the CpGs were annotated to genes based on the distance to the closest transcriptional start site using the Chipseeker package(Yu et al., 2015). Genomic location of each CpG was categorized as either intergenic region, 3’ UTR, 5’ UTR, promoter (minus 10 kb to plus 100 bp from the nearest TSS), exon, or intron. The unique .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 20 region assignment is prioritized as follows: exons, promoters, introns, 5’ UTR, 3' UTR, and intergenic. Additional genomic annotations, including human ortholog ENSEMBL ID, were extracted from the BioMart ENSEMBL database(Yates et al., 2020). The candidate gene for each probe was compared with human orthologous ENSEMBL ID to examine the similarity of the alignment with the human. For each probe, we examined if the assigned species ENSEMBL ID is identical to human-to-other-species-orthologous ENSEMBL ID in human mappability file. Orthologous comparison with human was done for genomes that could be matched to human genome by “targetSpecies_homolog_associated_gene_name" in Biomart using getLDS() function. Cell and tissue specific chromatin state annotations were based on the 25-state ChromHMM model based on imputed data for 12-marks (Ernst and Kellis, 2015; Roadmap Epigenomics Consortium et al., 2015). The chromatin state annotations from a ChromHMM model that was not specific to a single cell or tissue type were from (Vu and Ernst, 2020). We also provide in the annotation files of the array ChromHMM chromatin state annotations for mouse from (Gorkin et al., 2020). The human-mouse LECIF score was from (Kwon and Ernst, 2020). CpG island annotation We called CpG islands using the “gCluster” algorithm(Gómez-Martín et al., 2018). This algorithm uses clustering methods to identify the sequences that have high G+C content and CpG density with the default parameters. Besides CpG island status, this algorithm calculated several other attributes including length, GC content, and CpG density for each defined island. The outcome of this algorithm was a BED file that was used to annotate the probes using the “annotatr” package in R by checking the overlap of the aligned probes and CpG island genomic coordinates. Human DNA methylation distribution .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 21 We downloaded the fraction methylated values based on whole genome bisulfite sequencing data from 37 different cells and tissues types from the Roadmap Epigenomics Consortium (http://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/WGBS/FractionalMethylation.t ar.gz)(Roadmap Epigenomics Consortium et al., 2015). For each CpG, we averaged the fractional methylation values across the Roadmap samples. GREAT analysis We applied the GREAT analysis software tool(McLean et al., 2010) to conduct gene set enrichments for genes near CpGs on the array in human and mouse. The GREAT software performs both a binomial test (over genomic regions) and a hypergeometric test over genes when using a whole genome background. We performed the enrichment based on default settings (Proximal: 5.0 kb upstream, 1.0 kb downstream, plus Distal: up to 1,000 kb) for gene sets associated with GO terms, MSigDB, PANTHER and KEGG pathway. To avoid large numbers of multiple comparisons, we restricted the analysis to the gene sets with between 10 and 3,000 genes. We report nominal P values and two adjustments for multiple comparisons: Bonferroni correction and the Benjamini-Hochberg false discovery rate. Tissue enrichment analysis The enrichment of tissue specific genes was done by TissueEnrich R package(Jain and Tuteja, 2019) using teEnrichment() function limited to human protein atlas(Uhlén et al., 2015) and mouse ENCODE(Yue et al., 2014) databases. Normalization methods .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint http://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/WGBS/FractionalMethylation.tar.gz http://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/WGBS/FractionalMethylation.tar.gz https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 22 R software scripts implementing normalization methods can be accessed through our webpage (see the section on Data availability). Two software scripts are currently available for extracting beta values from raw signal intensities, based on Minfi and SeSAMe, respectively. Both methods use the noob method (Triche et al., 2013) for background subtraction. The two scripts evaluate each probe's hybridization and extension performance using normalization control probes and Infinium-I probe out-of-band measurements (the pOOBAH method (Zhou et al. 2018), respectively. Users can use the detection p-values for each CpG to filter out non-significant methylation readouts from probes unlikely to work in the target species. Calibration data We generated methylation data on two different platforms: the mammalian methylation array (HorvathMammalMethylChip40) and the human EPIC methylation array. The DNA samples from each species were enzymatically manipulated so that they would exhibit 0%, 25%, 50%, 75% and 100% percent methylation at each CpG location, respectively. We purchased premixed DNA standards from EpigenDx Inc (products 80-8060H-PreMixHuman, 80- 8060M-PreMixMouse, and Standard80-8060R-PreMixRat Premixed Calibration Standard). The variable “ProportionMethylated” (with ordinal values 0, 0.25, 0.5, 0.75, 1) can be interpreted as a benchmark for each CpG that maps to the respective genome. Thus, the DNA methylation levels of each CpG are expected to have a high positive correlation with ProportionMethylated across the arrays measurement from a given species. The mammalian array was applied to synthetic DNA data from 3 species: human (n=10 mammalian arrays), mouse (n=20), and rat (n=15). Similarly, the human EPIC array was applied to calibration data from of mouse (n=15 EPIC arrays) and rat (n=10). Thus, we applied 3 EPIC arrays and 2 EPIC arrays per value (0, 0.25, 0.5, 0.75, .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 23 1) of ProportionMethylated in our mouse and rat studies, respectively. The EPIC array data were normalized using the noob method (R function preprocessNoob in minfi). Data availability The mammalian methylation array (HorvathMammalMethylChip40) is registered at the NCBI Gene Expression Omnibus (GEO) as platform GPL28271 . The chip manifest file, calibration data, supplementary data, and R software scripts are or will be available from available https://github.com/shorvath/MammalianMethylationConsortium/ or the Gene Expression Omnibus. Acknowledgements and Funding This work was supported by the Paul G. Allen Frontiers Group (SH) and NSF CAREER award #1254200, National Institutes of Health (DP1DA044371) and a JCCC-BSCRC Ablon Scholars Award (JE). Conflict of Interest Statement The Regents of the University of California is the sole owner of a provisional patent application directed at this invention for which AA, JE and SH are named inventors. SH is a founder of the non-profit Epigenetic Clock Development Foundation, which plans to license several patents from his employer UC Regents, and distributes the mammalian methylation array. Bret Barnes is an employee for Illumina Inc which manufactures the mammalian methylation array. The other authors declare no conflicts of interest. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://github.com/shorvath/MammalianMethylationConsortium/ https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 24 No. CpGs whose correlation with the ProportionMethylation > threshold Species Threshold Mammal+Sesame Mammal+Minfi EPIC+Minfi Mouse 0.85 27,868 26,944 4,550 Mouse 0.90 24,050 22,207 2,356 Mouse 0.95 16,444 12,797 604 Rat 0.85 26,425 25,779 17,650 Rat 0.90 22,427 20,989 6,159 Rat 0.95 15,101 12,848 819 Human 0.85 36,438 35,761 NA Human 0.90 34,547 33,402 NA Human 0.95 30,327 28,445 NA Table 1. Correlating DNA methylation levels with calibration data. We evaluated the Mammalian Methylation Array with two different software methods for normalization: SeSaMe and Minfi (noob normalization). The EPIC array data were only normalized with the noob normalization method in Minfi. As indicated in the first column, the DNA samples came from three species: human (n=10 arrays), mouse (n=20), and rat (n=15). For each species, the “artificial” chromosomes exhibited on average 0%, 25%, 50%, 75% and 100% percent methylation at each CpG location. Thus, the variable “ProportionMethylated” (with ordinal values 0, 0.25, 0.5 ,0.75, 1) can be considered as benchmark/gold standard. The table reports the number of CpGs for which the Pearson correlation with the ProportionMethylation was greater than the correlation threshold (second column). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 25 Figures b Figure 1. Overview of mammalian methylation array design process. (a) Toy example of multiple sequence alignment at a CpG site considered by the CMAPS algorithm. The orange coloring highlights the CpG being targeted. Positions where other species have alignment that matches the human sequence are in dark blue; positions where other species have alignment that does not match the human sequence are in neon yellow; positions where other species have no alignment are in grey. (b) Flowchart detailing the selection of probes on the array by the CMAPS algorithm. A small fraction of probes designed were dropped during the manufacturing process. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 26 Figure 2. CpG and gene coverage of probes on the mammalian methylation array across different phylogenetic orders. (a) Probe localization based on the QUASR package (Gaidatzis et al., 2015). The rows correspond to different phylogenetic orders. The phylogenetic orders are ordered based on the phylogenetic tree and increasing distance to human. The boxplots report the median number of mapped probes across species from the given phylogenetic order. (b) The number of probes mapped to human orthologous genes for a subset of genomes (Methods). (c) Percentage of the probes associated with human orthologous genes among mapped probes in these species. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 27 Figure 3. CpG island and chromatin state analysis of mammalian methylation probes. We characterize the CpGs located on the mammalian methylation array regarding (a) CpG island status in different phylogenetic orders, (b) chromatin state analysis, and (c) LECIF score of evidence of human-mouse conservation at the functional genomics level. (a) The boxplots report the median number (and interquartile range) of CpGs that map to CpG islands in mammalian species of a given phylogenetic order (x-axis). The notch around the median depicts the 95% confidence interval. (b) The heatmap visualizes the ChromHMM chromatin state annotations of the location of the CpGs on the array (rows) in different human tissues (columns)(Ernst and Kellis, 2012, 2015). The colors correspond to 25 human chromatin states as detailed in the right panel. The probes in the left panel heatmap are ordered by the chromatin state with the maximum median frequency across 127 human cell and tissue types. The right panel indicates the distribution of chromatin states in each tissue type represented on the mammalian methylation array. (c) Comparison of distribution of LECIF score for probes on the array and aligning bases between human and mouse. The LECIF score has been binned as shown on the x-axis, and the fraction of probes or aligning bases with scores in that bin are shown on the y-axis. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 28 Figure 4. Distribution of probe intensities within sample, colored by the expected percentage of methylation at each site. (a-c) Distribution of beta values (relative intensity) of all probes on the array before normalization for (a) human samples, (b) mouse samples, and (c) rat samples. (d-f) Distribution of probe intensity after Sesame normalization and restricting probes to those that CMAPS designed to (d) the human genome in human samples, (e) the mouse genome in mouse samples, and (f) the rat genome in rat samples. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 29 Figure 5. Calibration data: mean methylation across probes shared between the human EPIC array and the mammalian array. The mammalian methylation array contained 5574 probes targeting the same CpG that can also be found on the human EPIC array that were not included based on being human biomarkers. However, the mammalian array probes were engineered differently than EPIC probes so that they would more likely work across mammals. By applying both array types to calibration data, we are able to compare the calibration of the overlapping probes in mice (a,b) and rats (c,d). Upper panels (a,b) and lower panels (c,d) present the results for the mammalian array and the EPIC array, respectively. The benchmark measure (ProportionMethylated, x-axis) versus the mean value across roughly 4341 CpGs that map to mice (a,c) and roughly 3948 CpGs that map to rats (b,d). The mean methylation (y-axis) was formed across a subset of CpGs that i) are present on the human EPIC array, ii) present on the mammalian array, and iii) apply to the respective species according to the mappability analysis genome coordinate file. 0.0 0.2 0.4 0.6 0.8 1.0 0 .2 0 .4 0 .6 0 .8 Mouse,MammalArray,Sesame cor=0.96, p=2.2e-11 ProportionMethylated m e a n M e th .I n te rs e c tM a m m a l. E P IC .M a p s T o M o u s e a 0.0 0.2 0.4 0.6 0.8 1.0 0 .2 0 .4 0 .6 0 .8 Rat,MammalArray,Sesame cor=0.96, p=1.5e-08 ProportionMethylated m e a n M e th .I n te rs e c tM a m m a l. E P IC .M a p s T o R a t b 0.0 0.2 0.4 0.6 0.8 1.0 0 .2 0 .4 0 .6 0 .8 Mouse DNA, EPIC array cor=0.79, p=0.0013 ProportionMethylated m e a n M e th .E P IC .P ro b e s T h a tM a p T o M o u s e c 0.0 0.2 0.4 0.6 0.8 1.0 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 Rat DNA, EPIC array cor=0.79, p=0.11 ProportionMethylated m e a n M e th .E P IC .P ro b e s T h a tM a p T o R a t d .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 30 Supplementary Figures Supplementary Figure S1: Comparison of probe context between the Illumina EPIC, 450K and the Mammalian Methylation array: (a) Analysis of CpG and non-CpG (CH) probes, (b) color channel assignment, (c) type I and type II probes, and (d) next base reveals similar percentages across probes from these three array platforms. Color channel assignment and probe basepair context are important for DNA methylation array analysis and the similarity between these different arrays can facilitate extension of published analysis and normalization methods. Analysis of type I and type II probes shows a slightly lower percentage of type I probes for the mammalian methylation array. Type I probes assay DNA methylation using one color channel and two bead .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 31 types, i.e. one unmethylated bead type and one methylated bead type. Conversely, type II probes assay DNA methylation using one bead type and two color channels indicating methylated and unmethylated cytosines. Adjustment for DNA methylation signal detected by these different probe types is one of the most important steps in DNA methylation array normalization, and a sufficient number of type I probes were included in the Mammalian Methylation array to facilitate the extension of published data normalization methods. (e) Comparison of shared and non-shared probes between the Mammalian Methylation array and MouseMethylation array loci reveals 3107 shared probes. (f) Comparison of shared and non-shared probes between the EPIC, 450k and the Mammalian methylation array. Comparative analysis was performed using Illumina probe IDs, which are unique to each probe. Intersection of IDs between arrays reveals over 5,000 probes that are common to all platforms (center). These probes can be used to follow up published human epigenome-wide association study (EWAS) results in model organisms such as mouse (Mus musculus) or rat (Rattus norvegicus), or across a range of other species, including all primates and other mammals. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 32 Supplementary Figure S2. Chromosome and gene region analysis of mammalian methylation probes in humans and mice. The analysis is based on mapping probes on the mammalian methylation array to the human (hg19) and mouse (mm10) genome using QUASR package(Gaidatzis et al., 2015). (a) The number of probes per human and mouse chromosome. (b) The left panel reports the percentage of probes that are located in different gene regions (promoters, 5' UTR, 3' UTR, introns, exons) in humans and mice. The right panel reports the distribution of the probes relative to the nearest transcriptional start site. (c) Histogram of CpG number in different gene regions in human and mouse genomes (as defined in the legend of panel d). (d) Alignment to orthologous genes between humans and mice. The colors indicate the mapped gene region in the mouse genome. The unique region assignment are prioritized as follows: exons, promoters, introns, 5' UTR, 3' UTR. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 33 Summary Figure S3. GREAT gene set enrichment analysis of all probes on the mammalian methylation array. The figure shows the top enriched pathway based on gene-level enrichment analysis for genes proximal to probes using GREAT7. The two columns correspond to enrichment analysis for human (hg19) and mouse (mm10) genomes, respectively, using the whole genome as background. The top five enriched datasets from each category (Canonical pathways, diseases, gene ontology, human and mouse phenotypes, and upstream regulators) were selected and further filtered for significance at p < 10-5. The category is indicated by the shape, the number of genes by the size of the shape, and the significance of the enrichment is indicated by the color scale. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 34 Supplementary Figure S4. Human and mouse tissue-specific probes on mammalian methylation array. Characterization of the tissue specificity of CpG probes on the mammalian methylation array using the human protein atlas(Uhlén et al., 2015) and mouse ENCODE gene expression data(Yue et al., 2014). The left and right panels report results for human and mouse genomes, respectively. Each probe is mapped to the closest gene while other genes in the flanking region are ignored in this analysis. The number of genes (a) and the number of CpG probes (b) versus a categorical measure of tissue specificity. The categories on the y-axis have the following definitions. The following categories are defined in the TissueEnrich software "Tissue Enriched" labels genes with an expression level greater than 1 (TPM or FPKM) that also have at least five-fold higher expression levels in a particular tissue compared to all other tissues. "Group Enriched" labels genes with an expression level greater than 1 (TPM or FPKM) that also have at least five-fold higher expression levels in a group of 2-7 tissues compared to all other tissues, and that are not considered Tissue Enriched. "Tissue Enhanced" labels genes with an expression level greater than 1 (TPM or FPKM) that also have at least five-fold higher expression levels in a particular tissue compared to the average levels in all other tissues, and that are not considered Tissue Enriched or Group Enriched. (c) The number of tissue-enriched genes represented on mammalian array vs background in human and mouse transcriptome. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 35 Supplementary Figure S5. Distribution of DNA methylation levels. Distribution of average fractional methylation across 37 cell and tissue types(Roadmap Epigenomics Consortium et al., 2015) at CpG sites on the array (blue) and all sites in the genome (red). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 36 Supplementary Figure S6: Mammalian Methylation Array enrichment for Universal Chromatin State Annotations. (Left) Distribution of probe overlap with a universal chromatin state annotation by the stacked modeling approach of ChromHMM applied to data from more than 100 cell or tissue types(Vu and Ernst, 2020). (Right) The same as left, but showing the fold enrichments of the state relative to a uniform background. The strongest enrichment is seen for some bivalent promoter states. A full characterization of the states can be found in (Vu and Ernst, 2020). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 37 References Aryee, M.J., Jaffe, A.E., Corrada-Bravo, H., Ladd-Acosta, C., Feinberg, A.P., Hansen, K.D., and Irizarry, R.A. (2014). Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30, 1363–1369. Bibikova, M., Le, J., Barnes, B., Saedinia-Melnyk, S., Zhou, L., Shen, R., and Gunderson, K.L. (2009). Genome-wide DNA methylation profiling using Infinium® assay. Epigenomics 1, 177– 200. Bibikova, M., Barnes, B., Tsan, C., Ho, V., Klotzle, B., Le, J.M., Delano, D., Zhang, L., Schroth, G.P., Gunderson, K.L., et al. (2011). High density DNA methylation array with single CpG site resolution. Genomics 98, 288–295. Breeze, C.E., Reynolds, A.P., van Dongen, J., Dunham, I., Lazar, J., Neph, S., Vierstra, J., Bourque, G., Teschendorff, A.E., Stamatoyannopoulos, J.A., et al. (2019). eFORGE v2.0: updated analysis of cell type-specific signal in epigenomic data. Bioinformatics 35, 4767–4769. Cokus, S.J., Feng, S., Zhang, X., Chen, Z., Merriman, B., Haudenschild, C.D., Pradhan, S., Nelson, S.F., Pellegrini, M., and Jacobsen, S.E. (2008). Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature 452, 215–219. Ernst, J., and Kellis, M. (2012). ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216. Ernst, J., and Kellis, M. (2015). Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nat. Biotechnol. 33, 364–376. Farrell, C., Thompson, M., Tosevska, A., Oyetunde, A., and Pellegrini, M. (2020). BiSulfite Bolt: A BiSulfite Sequencing Analysis Platform. BioRxiv 2020.10.06.328559. Gaidatzis, D., Lerch, A., Hahne, F., and Stadler, M.B. (2015). QuasR: quantification and annotation of short reads in R. Bioinformatics 31, 1130–1132. Gómez-Martín, C., Lebrón, R., Oliver, J.L., and Hackenberg, M. (2018). Prediction of CpG Islands as an Intrinsic Clustering Property Found in Many Eukaryotic DNA Sequences and Its Relation to DNA Methylation. Methods Mol. Biol. Clifton NJ 1766, 31–47. Gorkin, D.U., Barozzi, I., Zhao, Y., Zhang, Y., Huang, H., Lee, A.Y., Li, B., Chiou, J., Wildberg, A., Ding, B., et al. (2020). An atlas of dynamic chromatin landscapes in mouse fetal development. Nature 583, 744–751. Guintivano, J., Aryee, M.J., and Kaminsky, Z.A. (2013). A cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression. Epigenetics 8, 290–302. Guo, W., Fiziev, P., Yan, W., Cokus, S., Sun, X., Zhang, M.Q., Chen, P.-Y., and Pellegrini, M. (2013). BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data. BMC Genomics 14, 774. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 38 Haeussler, M., Zweig, A.S., Tyner, C., Speir, M.L., Rosenbloom, K.R., Raney, B.J., Lee, C.M., Lee, B.T., Hinrichs, A.S., Gonzalez, J.N., et al. (2019). The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 47, D853–D858. Hannum, G., Guinney, J., Zhao, L., Zhang, L., Hughes, G., Sadda, S., Klotzle, B., Bibikova, M., Fan, J.-B., Gao, Y., et al. (2013). Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell 49, 359–367. Horvath, S. (2013). DNA methylation age of human tissues and cell types. Genome Biol. 14, R115. Horvath, S., and Levine, A.J. (2015). HIV-1 Infection Accelerates Age According to the Epigenetic Clock. J. Infect. Dis. 212, 1563–1573. Horvath, S., Oshima, J., Martin, G.M., Lu, A.T., Quach, A., Cohen, H., Felton, S., Matsuyama, M., Lowe, D., Kabacik, S., et al. (2018). Epigenetic clock for skin and blood cells applied to Hutchinson Gilford Progeria Syndrome and ex vivo studies. Aging 10, 1758–1775. Houseman, E.A., Accomando, W.P., Koestler, D.C., Christensen, B.C., Marsit, C.J., Nelson, H.H., Wiencke, J.K., and Kelsey, K.T. (2012). DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics 13, 86. Jain, A., and Tuteja, G. (2019). TissueEnrich: Tissue-specific gene enrichment analysis. Bioinforma. Oxf. Engl. 35, 1966–1967. Kwon, S.B., and Ernst, J. (2020). Learning a genome-wide score of human-mouse conservation at the functional genomics level. BioRxiv 2020.09.08.288092. Levine, M.E., Lu, A.T., Quach, A., Chen, B.H., Assimes, T.L., Bandinelli, S., Hou, L., Baccarelli, A.A., Stewart, J.D., Li, Y., et al. (2018). An epigenetic biomarker of aging for lifespan and healthspan. Aging 10, 573–591. Li, X., Chen, F., and Chen, Y. (2020). Gcluster: a simple-to-use tool for visualizing and comparing genome contexts for numerous genomes. Bioinforma. Oxf. Engl. 36, 3871–3873. Lister, R., Pelizzola, M., Dowen, R.H., Hawkins, R.D., Hon, G., Tonti-Filippini, J., Nery, J.R., Lee, L., Ye, Z., Ngo, Q.-M., et al. (2009). Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315–322. McLean, C.Y., Bristor, D., Hiller, M., Clarke, S.L., Schaar, B.T., Lowe, C.B., Wenger, A.M., and Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 28, 495–501. Meissner, A., Gnirke, A., Bell, G.W., Ramsahoye, B., Lander, E.S., and Jaenisch, R. (2005). Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 33, 5868–5877. Needhamsen, M., Ewing, E., Lund, H., Gomez-Cabrero, D., Harris, R.A., Kular, L., and Jagodic, M. (2017). Usability of human Infinium MethylationEPIC BeadChip for mouse DNA methylation studies. BMC Bioinformatics 18, 486. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 39 Ooi, S.K.T., Qiu, C., Bernstein, E., Li, K., Jia, D., Yang, Z., Erdjument-Bromage, H., Tempst, P., Lin, S.-P., Allis, C.D., et al. (2007). DNMT3L connects unmethylated lysine 4 of histone H3 to de novo methylation of DNA. Nature 448, 714–717. Pidsley, R., Zotenko, E., Peters, T.J., Lawrence, M.G., Risbridger, G.P., Molloy, P., Van Djik, S., Muhlhausler, B., Stirzaker, C., and Clark, S.J. (2016). Critical evaluation of the Illumina MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling. Genome Biol. 17, 208. Roadmap Epigenomics Consortium, Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Heravi-Moussavi, A., Kheradpour, P., Zhang, Z., Wang, J., et al. (2015). Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330. Robertson, K.D. (2005). DNA methylation and human disease. Nat. Rev. Genet. 6, 597–610. Rosenbloom, K.R., Armstrong, J., Barber, G.P., Casper, J., Clawson, H., Diekhans, M., Dreszer, T.R., Fujita, P.A., Guruvadoo, L., Haeussler, M., et al. (2015). The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 43, D670–D681. Smith, Z.D., and Meissner, A. (2013). DNA methylation: roles in mammalian development. Nat. Rev. Genet. 14, 204–220. Triche, T.J., Weisenberger, D.J., Van Den Berg, D., Laird, P.W., and Siegmund, K.D. (2013). Low-level processing of Illumina Infinium DNA Methylation BeadArrays. Nucleic Acids Res. 41, e90. Uhlén, M., Fagerberg, L., Hallström, B.M., Lindskog, C., Oksvold, P., Mardinoglu, A., Sivertsson, Å., Kampf, C., Sjöstedt, E., Asplund, A., et al. (2015). Proteomics. Tissue-based map of the human proteome. Science 347, 1260419. Vu, H., and Ernst, J. (2020). Universal annotation of the human genome through integration of over a thousand epigenomic datasets. BioRxiv 2020.11.17.387134. Yates, A.D., Achuthan, P., Akanni, W., Allen, J., Allen, J., Alvarez-Jarreta, J., Amode, M.R., Armean, I.M., Azov, A.G., Bennett, R., et al. (2020). Ensembl 2020. Nucleic Acids Res. 48, D682–D688. Yu, G., Wang, L.-G., and He, Q.-Y. (2015). ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383. Yue, F., Cheng, Y., Breschi, A., Vierstra, J., Wu, W., Ryba, T., Sandstrom, R., Ma, Z., Davis, C., Pope, B.D., et al. (2014). A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364. Zhou, W., Triche, T.J., Jr, Laird, P.W., and Shen, H. (2018). SeSAMe: reducing artifactual detection of DNA methylation by Infinium BeadChips in genomic deletions. Nucleic Acids Res. 46, e123–e123. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.07.425637doi: bioRxiv preprint https://doi.org/10.1101/2021.01.07.425637 http://creativecommons.org/licenses/by-nc/4.0/ 10_1101-2021_01_08_425887 ---- Auto-CORPus: Automated and Consistent Outputs from Research Publications Auto-CORPus: Automated and Consistent Outputs from Research Publications Yan Hu1,a, Shujian Sun1,a, Thomas Rowlands2, Tim Beck2,3,b, and Joram M. Posma1,3,b 1 Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, SW7 2AZ, United Kingdom 2 Department of Genetics and Genome Biology, University of Leicester, LE1 7RH, United Kingdom 3 Health Data Research (HDR) UK, United Kingdom a These authors contributed equally. b These authors contributed equally. � Abstract Motivation: The availability of improved natural lan- guage processing (NLP) algorithms and models enable researchers to analyse larger corpora using open source tools. Text mining of biomedical literature is one area for which NLP has been used in recent years with large untapped potential. However, in order to generate cor- pora that can be analyzed using machine learning NLP algorithms, these need to be standardized. Summarizing data from literature to be stored into databases typically requires manual curation, especially for extracting data from result tables. Results: We present here an automated pipeline that cleans HTML files from biomedical literature. The output is a single JSON file that contains the text for each section, table data in machine-readable format and lists of phenotypes and abbreviations found in the article. We analyzed a total of 2,441 Open Access articles from PubMed Central, from both Genome-Wide and Metabolome-Wide Association Studies, and developed a model to standardize the section headers based on the Information Artifact Ontology. Extraction of table data was developed on PubMed articles and fine-tuned using the equivalent publisher versions. Availability: The Auto-CORPus package is freely available with detailed instructions from Github at https://github.com/jmp111/AutoCORPus/. information artefact ontology | natural language processing | text standard- ization Correspondence: timbeck [at] leicester.ac.uk and jmp111 [at] ic.ac.uk Introduction Natural language processing (NLP) is a branch of artificial intelligence that uses computers to process, understand and use human language. NLP is applied in many different fields including language modelling, speech recognition, text min- ing and translation systems. In the biomedical realm, NLP has been applied to extract for example medication data from electronic health records and patient clinical history from clinical notes, to significantly speed up processes that would otherwise be extracted manually by experts (1). Biomedical publications, unlike structured electronic health records, are semi-structured and this makes it difficult to extract and inte- grate the relevant information (2). The format of research ar- ticles differs between publishers and sections describing the same entity, for example statistical methods, can be found in different locations in the document in different publica- tions. Both unstructured text and semi-structured document elements, such as headings, main texts and tables, can con- tain important information that can be extracted using text mining (3). The development of the genome-wide association study (GWAS) has been led to by the on-going revolution in high- throughput genomic screening and a deeper understanding of the relationship between genetic variations and diseases/traits (4). In a typical GWAS, researchers collect data from study participants, use single nucleotide polymorphism (SNP) ar- rays to detect the common variants among participants, and conduct statistical tests to determine if the association be- tween the variants and traits is significant. The results are mostly represented in publication tables, but can also be found in the main text, and there are multiple community ef- forts to store these reported associations in queryable, on- line databases (5, 6). These efforts involve time-intensive and costly manual data curation to transcribe results from the publications, and supplementary information, into databases. Summary-level GWAS results are generally reported in the literature according to community norms (e.g. a SNP asso- ciated to a phenotype with a probability value), hence NLP algorithms can be trained to recognize the formats in which data are reported to facilitate faster and scalable information extraction that is less prone to human error. Development of effective automatic text mining algorithms for GWAS literature can also potentially benefit other fields in biomedical research as the body of biomedical literature grows every day. Yet previous attempts of mining scientific literature focused mainly on information extraction from ab- stracts and some on the main text, while for the most part ignoring tables. To facilitate the process of preparing a cor- pus for NLP tasks such as named-entity recognition (NER), text classification or relationship extraction, we have devel- oped an Automated pipeline for Consistent Outputs from Research Publications (Auto-CORPus) as a Python package. The main aims of Auto-CORPus are: • To provide clean text outputs for each publication sec- tion with standardized section names Hu and Sun, et al. | bioRχiv | January 8, 2021 | 1–10 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint https://github.com/jmp111/AutoCORPus/ timbeck@leicester.ac.uk jmp111@ic.ac.uk https://doi.org/10.1101/2021.01.08.425887 http://creativecommons.org/licenses/by-nc-nd/4.0/ • To represent each publication’s tables in a JavaScript Object Notation (JSON) format to facilitate data im- port into databases • To use the text outputs to find abbreviations used in the text We exemplify the package on a corpus of 1,200 Open Access GWAS publications whose data have been manually added to the GWAS Central database to list phenotypes, SNPs and P-values found in the cleaned text (Figure 1). In addition, we also include data on 1,200+ Metabolome-Wide Association Studies (MWAS) to ensure the methods are not biased towards one domain. MWAS focus on small molecules, some of which are end-products of cellular regulatory processes, that are the response of the human body to genetic or environmental variations (7). Materials and Methods Data. Hypertext Markup Language (HTML) files for 1,200 Open Access GWAS publications whose data exists in the GWAS Central database (5) were downloaded from PubMed Central (PMC) in March 2020. A further 1,241 Open Access publications of MWAS on cancer, gastrointestinal diseases, metabolic syndrome, sepsis and neurodegenerative, psychi- atric, and brain illnesses were also downloaded in the same format. Publisher versions of ca. 10% of these publications were downloaded in July 2020 to test the algorithms on pub- lications with different HTML formats. The GWAS dataset was randomly divided into 700 training publications to de- velop algorithms, and a test set of the remaining 500 publica- tions. Processing. HTML files were loaded using the Beautiful- soup4 HTML parser package (v4.9.0). Beautifulsoup4 was used to convert HTML files to tree-like structures with each branch representing a HTML section and each leaf a HTML element. After HTML files were loaded, all superscripts, subscripts, and italics were converted to plain text. Auto- CORPus extracts h1, h2 and h3 tags for titles and headings, and p tags for paragraph texts using the default configura- tion. The headings and paragraphs are saved in a structured JavaScript Object Notation (JSON) file for each HTML file. Tables are extracted from the document using a different set of configuration files (separate configurations for different ta- ble structures can be defined and used) and saved in a new JSON model that ensures tables of all formats and origin, not only restricted to GWAS publications, can be described in the same structured model, so that these can be used as in- put to rule-based or deep learning algorithms for data extrac- tion. The data cells are stored in the “result” key, and their corresponding section name and header names are stored in “section_name” and “columns” keys respectively. Therefore, extracting relationships between cells only requires simple rules. Fig. 1. Workflow of the Auto-CORPus package. 2 | bioRχiv Hu and Sun, et al. | Auto-CORPus .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425887 http://creativecommons.org/licenses/by-nc-nd/4.0/ Ontologies for entity recognition. The Information Arti- fact Ontology (IAO) was created to serve as a domain-neutral resource for the representation of types of information con- tent entities such as documents, databases, and digital im- ages (8). We used the v2020-06-10 model (9) in which 37 different terms exist that describe headers typically found in biomedical literature. The extracted headers in the JSON file were first mapped to the IAO terms using the Lexical OWL Ontology Matcher (10). We use fuzzy matching using the fuzzywuzzy package (v0.17.0) to map headers to the pre- ferred section header terms and synonyms, with a similarity threshold of 0.8. This threshold was evaluated by confirming all matches were accurate by two independent researchers. After the direct IAO mapping and fuzzy matching, unmapped headers still exist. To map these headings, we developed a new method using a directed graph (digraph) for representa- tion since headers are not repeated within a document, are se- quential and have a set order that can be exploited. Digraphs consist of nodes (entities, headers) and edges (links between nodes) and the weight of the nodes and edges is propor- tional to the number of publications in which these are found. While digraphs from individual publications are acyclic, the combined graph can contain cycles hence digraphs opposed to directed acyclic graphs are used. Unmapped headers are assigned a section based on the digraph and the headers in the publication that could be mapped (anchor points). For example, at this point in this article the main headers are ‘ab- stract’ followed by ‘introduction’ and ‘materials and meth- ods’ that could make up a digraph. Another article with head- ers ‘abstract’, ‘background’ and ‘materials and methods’ has two anchor points that match the digraph, and the unmapped header (‘background’) can be inferred from appearing in be- tween the anchor points in the digraph (‘abstract’, ‘materials and methods’): ‘introduction’. We use this process to eval- uate new potential synonyms for existing terms and identify new potential terms for sections found in biomedical litera- ture. We used the Human Phenotype Ontology (HPO) to identify disease traits in the full texts. The HPO was developed with the goal to cover all common phenotypic abnormalities in hu- man monogenic diseases (11). Use cases: regular expression algorithms. Abbrevia- tions in the full text are found using an adaptation of a previ- ously published methodology (12) based on regular expres- sions using the abbreviations package (v0.2.5). The brief principle of it is to find all brackets within a corpus. If the number of words in a bracket is <3 it considers if it could be an abbreviation. It searches the characters within the brackets in the text on either side of the brackets one by one. The first character of one of these words must contain the first charac- ter within that bracket. And the other characters within that bracket must be contained by other words followed by the previous word whose first character is the same as the first character in that bracket. We combine the output of the pack- age with abbreviations defined in the abbreviations section (if found) from the IAO/digraph model. For phenotype entity recognition, first any abbreviations in paragraphs extracted from the full text are replaced by their definition. This text is then tokenized using the spacy pack- age (v2.3) (model en_core_web_sm) and compared against phenotypes and their synonyms defined by HPO for disease traits matching. P-values and SNPs were identified in the full text and tables based on regular expressions as they have a standard form. Pairs of P-value-SNP associations are found in the text using dependency parse trees (13). Use cases: deep learning-based named-entity recog- nition. The first example of a use case is to recognize the assay with which the data was acquired, however no ex- isting models exist for this purpose. We fine-tuned a pre- existing model trained for biomedical NER, the biomedi- cal Bidirectional Encoder Representations from Transform- ers (bioBERT) (14), using part of our corpus where only MWAS assays were tagged. We applied our fine-tuned model only on the paragraphs in the materials and methods sec- tions to recognize the assays used. A second bioBERT-based model was fine-tuned on phenotypes, which already exist in the data, and enriched in phenotypes associated with the MWAS publications. This model was applied on only the abstract and paragraphs from the results section. The third example was applied only on paragraphs from the results and discussion sections using an existing model specifically trained to recognize chemical entities, ChemListem (v0.1.0) (15). Use cases: paragraph classification. It is possible un- mapped headers are mapped to multiple sections if the an- chor points are far apart. In order to test the applicability of a machine learning model to classify paragraphs we trained a random forest classifier on a dataset consisting of 1,242 ab- stract paragraphs and 936 non-abstract paragraphs. 80% of the data was used for training and the remainder as the test set. Results The order of sections in biomedical literature. A total of 21,849 headers were extracted from the 2,441 publica- tions, mapped to IAO (v2020-06-10) terms and visualized by means of a digraph with 372 unique nodes and 806 directed edges (Figure 2A). The major unmapped node is ‘associated data’, which is a header specific for PMC articles that ap- pears at the beginning of each article before the abstract. The main structure of biomedical articles that were analyzed is: abstract → introduction → materials → results → discus- sion → conclusion → acknowledgements → footnotes sec- tion → references. IAO has separate definitions for ‘mate- rials’ (IAO:0000633), ‘methods’ (IAO:0000317) and ‘statis- tical methods’ (IAO:0000644) sections, hence they are sepa- rate nodes in the graph and introduction is also often followed by headers to reflect the methods section (and synonyms). There is also a major directed edge from introduction directly to results, with materials and methods placed after the discus- sion and/or conclusion sections. Hu and Sun, et al. | Auto-CORPus bioRχiv | 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425887 http://creativecommons.org/licenses/by-nc-nd/4.0/ All unmapped headers were investigated and evaluated whether some could be used as synonym for existing cate- gories. The digraph was also inspected by means of visual- izing individual ego-networks which show the edges around a specific node mapped to an existing IAO term. Figure 2B shows the ego-network for abstract, and four main categories and one potential new synonym (precis, in red) were iden- tified. The majority of unmapped headers (in purple), that follow the abstract, relate to a document that is written as one coherent whole, with specific headers for each section or a general header for the full/main text. An additional four unmapped headers relate to ‘materials and methods’ in their broader sense and these are data, data description, par- ticipants and sample. The remaining two categories of un- mapped headers to/from abstract can be classified as new sections ‘graphical abstract’ and ‘highlights’. These head- ers were found alongside, and appear to be distinct from, the (textual) abstract. Based on the digraph, we then assigned data and data descrip- tion to be synonyms of the materials section, and participants and sample as a new category termed ‘participants’ which is related to, but deemed distinct from, the existing patients sec- tion (IAO:0000635). The same process was applied to ego- networks from other nodes linked to existing IAO terms to add additional synonyms to simplify the digraph. Figure 2C shows the resulting digraph with only existing and newly pro- posed section terms. New proposed elements for the IAO. Each existing IAO term contains one or more synonyms and extracted head- ers were first mapped directly to these terms. Any headers that could not be mapped directly are mapped in the second step using fuzzy matching (e.g. the typographical error ‘ex- peremintal section’ in PMC4286171 is correctly mapped to the methods section). The last step involves mapping remain- ing unmapped headers to existing terms based on the digraph and using the structure (anchor headers) of the publication. Headers that can be mapped to existing terms in the second and third steps, are included as synonyms in the model. The existing categories for which new potential synonyms were identified are listed in Table 1a and 1b with their existing synonyms and newly identified synonyms. From the analysis of ego-networks four new potential cate- gories were identified: disclosure, graphical abstract, high- lights and participants. Table 2 details the proposed defini- tion and synonyms for these categories. In the digraph in Figure 2C this section is located towards the end of a pub- lication and in some instances is followed by the conflict of interest section. Table data extraction with different configurations. PMC articles are standardized which makes data extraction more straightforward, however some publications are not deposited into PMC or other repositories and can only be found via publisher websites. While the package has been developed using a large set of PMC articles, we compared the Auto-CORPus output for PMC articles with the output for the equivalent articles made available by the publishers. We found no differences in how headers were extracted and paragraphs were classified based on the digraph. However, the representation of tables does differ substantially between publishers, hence a model developed on PMC articles alone will fail to extract the data. We circumvent this issue by defin- ing configuration files for different table formats and we com- pare the accuracy of the data represented in the JSON format (Figure 3) between PMC and publisher versions of the same papers. Using the default (PMC) configuration on non-PMC arti- cles none of the 302 tables are represented accurately in the JSON. Auto-CORPus allows to use a variety of configura- tion files (a single file, or all as batch) to be used to extract data from tables. One configuration file, different to the de- fault, correctly represented the data in JSON format of 93% (280) of tables. The remaining 22 tables could be repre- sented correctly using 8 different configuration files. When the right configuration file is used for non-PMC articles, all tables (100%) are represented identically to the JSON output from the matching PMC version. Use cases. The extracted paragraphs were classified as one (or more) categories based on the digraph. This is the purpose of the Auto-CORPus package, to prepare a corpus for analy- sis so that different sections can be used for specific purposes. We detail how these standardized texts can be used for entity recognition. Paragraph classification. While many headers can be mapped using fuzzy matching plus the digraph structure, some headers remain unmapped (e.g. the headers in purple in Figure 2B: full text, main text, etc.) while others can be assigned to multiple (possible) sections. The choice of as- signing multiple categories to unmapped headers based on the digraph is deliberate as it is to ensure the algorithm does not wrongly assign it to only one (e.g. ‘materials’ over ‘meth- ods’). The next step is to perform the paragraph classification using NLP algorithms to learn from the word usage and con- text. We show that random forests can be used to this end by training it to distinguish between abstracts and other para- graphs. 435 paragraphs from the test set were predicted us- ing a random forest trained on 1,743 paragraphs. For the test set, we obtained an F1-score of 0.90 for classifying abstracts (precision = 0.91, recall = 0.90) and 0.88 for classifying non- abstracts (precision = 0.87, recall = 0.88). Abbreviation identification. The abbreviation detection algo- rithm searches through each paragraph using a rule-based ap- proach to find all abbreviations used. Auto-CORPus then investigates whether a paragraph is mapped to the abbrevia- tions category and, if found, it combines these two lists of ab- breviations found in the publication. For example, when ap- plied on an MWAS publication (16) which contains a header titled “ABBREVIATIONS” the algorithm combines the 9 ab- breviations listed by the authors and with a further 7 identi- fied from the text (Figure 4), including an abbreviation used with two spellings in the text. 4 | bioRχiv Hu and Sun, et al. | Auto-CORPus .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425887 http://creativecommons.org/licenses/by-nc-nd/4.0/ Fig. 2. Digraph generated from analyzing section headers from 2,441 Open Access publications from PubMed Central. (A) digraph of the v2020-06-10 IAO model consists of 372 unique nodes, of which 24 could be directly mapped to section terms (in orange) and the remainder are unmapped headers (in grey), and 806 directed edges. Relative node sizes and edge widths are directly proportional to the number of publications with these (subsequent) headers. Blue edges indicate the edge with the highest weight from the source node, edges that exist in fewer than 1% of publications are shown in light grey and the remainder in black. (B) Unmapped nodes connected to ‘abstract’ as ego node, excluding corpus specific nodes, grouped into different categories. Unlabeled nodes are titles of paragraphs in the main text. (C) Final digraph model used in Auto-CORPus to classify paragraphs after fuzzy matching. This model includes new (proposed) section terms and each section contains new synonyms identified in this analysis. ‘Associated Data’ is included as this is a PMC-specific header found before abstracts and can be used to indicate the start of most articles. Rule-based extraction of GWAS summary-level data. GWAS Central relies on curated data extracted manually from pub- lications or other databases. We investigated whether a rule-based approach to recognize phenotypes, SNPs and P- values can correctly identify data from publications con- tained within the database. A rule-based approach by ap- plying the HPO on the 500 GWAS publications from the test set, identified a total of 9,599 unique disease traits (major and minor) in these publications. 949 traits are recorded for these publications in GWAS Central and the rule-based approach found 449 with a perfect match. For 65% of the publica- tions all traits were correctly identified. SNPs have standard- ized formats, hence rule-based approaches are well suited for their identification. Likewise, P-values in GWAS publica- tions are typically represented using scientific notation and can also be identified using rule-based methods. A total of 26,031 SNP/P-value pairs were found across the main text and tables of the 500 publications. For 62.4% of publications all associations recorded in the GWAS Central database are also found using this approach. While 57.6% of these pub- lications present results (SNP/P-value pairs) only in tables, and 94.3% of pairs are found in tables, 276 associations were identified from the main text that are not represented in ta- bles. 2,673 pairs match those recorded in the database (total of 6,969 pairs for these publications), however many associ- ations in the database are not represented in main text/tables but in supplementary materials. Auto-CORPus includes a separate function to convert csv/tsv data to table JSON for- mat (Figure 3), as summary-level results are often saved in these file formats as part of the supplementary information. Named-entity recognition. Three different deep learning models were used for NER on specific paragraphs of publica- tions. A pre-trained biomedical entity recognition algorithm (14) was fine-tuned using the results from the rule-based approach applied on GWAS data. Example sentences that contain HPO terms were used to fine-tune the transformer model and then applied on 928 MWAS publications from four broad and distinct phenotypes (cancer, gastrointestinal diseases, metabolic syndrome, and neurodegenerative, psy- chiatric and brain illnesses). The fine-tuned deep learning algorithm obtained accuracies between 0.76 and 0.97, aver- aging around 82.3% (Table 3). We then fine-tuned the same base model for recognizing as- says in text by training on sentences identified from the text that contain assays routinely used in MWAS. The first pass consisted of a rule-based approach, with fuzzy matching, to find sentences with terms and these were then used to fine- tune the deep learning model. Figure 5 shows the result- ing output in JSON format for one MWAS publication (16). Hu and Sun, et al. | Auto-CORPus bioRχiv | 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425887 http://creativecommons.org/licenses/by-nc-nd/4.0/ Category (IAO identifier) Existing synonyms (IAO v2020-06-10) New synonyms identified a abstract (IAO:0000315) abstract precis acknowledgements (IAO:0000324) acknowledgements, acknowledgments acknowledgement, acknowledgment, acknowledgments and disclaimer author contributions (IAO:0000323) author contributions, contributions by the authors authors’ contribution, authors’ contributions, authors’ roles, contributorship, main authors by consortium and author contributions discussion (IAO:0000319) discussion, discussion section discussions footnote (IAO:0000325) endnote, footnote footnotes introduction (IAO:0000316) background, introduction introductory paragraph methods (IAO:0000317) experimental, experimental procedures, experimental section, materials and methods, methods analytical methods, concise methods, experimental methods, method, method validation, methodology, methods and design, methods and procedures, methods and tools, methods/design, online methods, star methods, study design, study design and methods references (IAO:0000320) bibliography, literature cited, references literature cited, reference, references, reference list, selected references, web site references supplementary material (IAO:0000326) additional information, appendix, supplemental information, supplementary material, supporting information additional file, additional files, additional information and declarations, additional points, electronic supplementary material, electronic supplementary materials, online content, supplemental data, supplemental material, supplementary data, supplementary figures and tables, supplementary files, supplementary information, supplementary materials, supplementary materials figures, supplementary materials figures and tables, supplementary materials table, supplementary materials tables Table 1a. Newly identified synonyms for existing IAO terms (00003xx) from the digraph mapping of 2,441 publications. Elements in italics have previously been submitted by us for inclusion into IAO and added in the latest release (v2020-12-09). Lastly, we applied a domain specific algorithm for recogniz- ing chemical entities in the text and tables (15) to identify metabolites in the same publication (Figure 5). Discussion The analysis of our corpus of 2,441 Open Access publica- tions has resulted in identifying well over 100 new synonyms for existing terms used in biomedical literature to indicate what a paragraph is about. In addition, we identified four new potential categories not previously included in the IAO. We previously submitted a subset of synonyms reported here and one of the new categories for inclusion in the IAO. These have been accepted by the IAO and are included in the lat- est release (v2020-12-09), hence we presented our analyses using the previous version of IAO that does not include part of our work. In the latest release, the ‘graphical abstract’ section has been added (IAO:0000707) based on our contri- bution. Also, a new ‘research participants’ (IAO:0000703) section has been added as contribution by others in the same release; therefore synonyms found here for the new category ‘participants’ section will be proposed in future as synonyms for the ‘research participants’ section. While the disclosure section appears to be distinct from the conflict of interest sec- tion due to a directed edge in the digraph, its synonyms could also be proposed to be part of the existing conflict of interest section in IAO. Standardization of text for NLP is an important step in preparing a corpus. Auto-CORPus outputs a JSON file of cleaned text, with standardized headers as well as all data presented in tables in JSON format. Standardizing headers is important because some sections are more important than 6 | bioRχiv Hu and Sun, et al. | Auto-CORPus .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425887 http://creativecommons.org/licenses/by-nc-nd/4.0/ Category (IAO identifier) Existing synonyms (IAO v2020-06-10) New synonyms identified a abbreviations (IAO:0000606) abbreviations, abbreviations list, abbreviations used, list of abbreviations, list of abbreviations used abbreviation and acronyms, abbreviation list, abbreviations and acronyms, abbreviations used in this paper, definitions for abbreviations, glossary, key abbreviations, non-standard abbreviations, nonstandard abbreviations, nonstandard abbreviations and acronyms author information (IAO:0000607) author information, authors’ information biographies, contributor information availability (IAO:0000611) availability, availability and requirements availability of data, availability of data and materials, data archiving, data availability, data availability statement, data sharing statement conclusion (IAO:0000615) concluding remarks, conclusion, conclusions, findings, summary conclusion and perspectives, summary and conclusion conflict of interest (IAO:0000616) competing interests, conflict of interest, conflict of interest statement, declaration of competing interests, disclosure of potential conflicts of interest authors’ disclosures of potential conflicts of interest, competing financial interests, conflict of interests, conflicts of interest, declaration of competing interest, declaration of interest, declaration of interests, disclosure of conflict of interest, duality of interest, statement of interest consent (IAO:0000618) consent informed consent ethical approval (IAO:0000620) ethical approval ethics approval and consent to participate, ethical requirements, ethics, ethics statement funding source declaration (IAO:0000623) funding, funding information, funding sources, funding statement, funding/support, source of funding, sources of funding financial support, grants, role of the funding source, study funding future directions (IAO:0000625) future challenges, future considerations, future developments, future directions, future outlook, future perspectives, future plans, future prospects, future research, future research directions, future studies, future work outlook materials (IAO:0000633) materials data, data description statistical analysis (IAO:0000644) statistical analysis statistical methods, statistical methods and analysis, statistics study limitations (IAO:0000631) limitations, study limitations strengths and limitations, study strengths and limitations Table 1b. Newly identified synonyms for existing IAO terms (00006xx) from the digraph mapping of 2,441 publications. Elements in italics have previously been submitted by us for inclusion into IAO and added in the latest release (v2020-12-09). Hu and Sun, et al. | Auto-CORPus bioRχiv | 7 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425887 http://creativecommons.org/licenses/by-nc-nd/4.0/ Proposed category Proposed definition Proposed synonyms disclosure “A part of a document used to disclose any associations by authors that might be perceived as to potentially interfere with or prevent them from reporting research with complete objectivity.” author disclosure statement, declarations, disclosure, disclosure statement, disclosures graphical abstract “An abstract that is a pictorial summary of the main findings described in a document.” central illustration, graphical abstract, TOC image, visual abstract highlights “A short collection of key messages that describe the core findings and essence of the article in concise form. It is distinct and separate from the abstract and only conveys the results and concept of a study. It is devoid of jargon, acronyms and abbreviations and targeted at a broader, non-technical audience.” author summary, editors’ summary, highlights, key points, overview, research in context, significance, TOC participants “A section describing the recruitment of subjects into a research study. This section is distinct from the ‘patients’ section and mostly focusses on healthy volunteers.” participants, sample Table 2. Newly proposed categories of entities found in 2,441 publications in the biomedical literature that could not be mapped to existing terms in IAO. Elements in italics have previously been submitted by us for inclusion into IAO and added in the latest release (v2020-12-09). Known phenotype Papers Accuracy cancer 492 0.84 gastrointestinal diseases 37 0.97 metabolic syndrome 286 0.80 neurodegenerative, psychiatric, brain illnesses 113 0.76 Table 3. Summary of results for named-entity recognition (NER) of phenotypes in MWAS papers. others for specific tasks. For example, no new findings can be found in an introduction however it is well suited to discover the main phenotypes under study, only in materials/methods can details be found on how these phenotypes are studied and using what technologies, and findings can only be found in results (and discussion) sections. Hence it is important to classify these paragraphs and Auto-CORPus does this by using the structure of the publication and the digraph. We showed that we can further improve the assignment by train- ing machine learning models with good accuracy to distin- guish between different types of texts in cases where there may be ambiguity - this can be further improved by using a multi-class classifier and using all paragraphs. These data are then available for use in downstream analyses using ded- icated algorithms for entity recognition or other methods. Auto-CORPus is able to process all HTML formatted tables from both GWAS and MWAS corpora, as opposed to pre- vious methods which could only operate on 86% of 3,573 tables (17). It takes Auto-CORPus on average 0.77 seconds to process all tables within a publication compared to several minutes if this is done manually. Moreover, Auto-CORPus also supports parallel computing, thereby further reducing the time needed to process publications as these can be run in batch. The structured JSON output is machine readable and can be used to support data import into database. Here we used the JSON output of Auto-CORPus in several examples to demonstrate some potential use cases. We demonstrated that existing algorithms trained on biomedical data can be fine- tuned to recognize new entities such as assays and pheno- types, which also opens up the possibility of using these data to train new deep learning algorithms for recognizing new entities such as metabolites (opposed to chemical entities), SNPs and P-values, as well as identifying the relationships between them from text. NER algorithms have difficulty with recognizing terms that are abbreviated, therefore the list of abbreviations found by Auto-CORPus can be used to replace all abbreviations in the text to their definitions. Conclusion The Auto-CORPus package is freely available and can be de- ployed on local machines as well as using high-performance computing to process publications in batch. A step-by-step guide to detail how to use Auto-CORPus is supplied with the package. The key features of Auto-CORPus are that it: 1. outputs all text and table data in a standardized JSON format, 2. classifies each paragraph into separate categories of text, and 8 | bioRχiv Hu and Sun, et al. | Auto-CORPus .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425887 http://creativecommons.org/licenses/by-nc-nd/4.0/ Fig. 3. Example of JSON format for table data from this work (shown for Table 3). The Auto-CORPus output for tables consists of ‘status’, ‘error message’ and ‘tables’ as top level fields, ‘tables’ has fields ‘identifier’, ‘title’, ‘columns’, ‘section’ and ‘footer’, and ‘section’ contains ‘section name’ and ‘results’. Fig. 4. Example of JSON output of abbreviation detection using a rule-based ap- proach on an MWAS publication (16). Fig. 5. Example of JSON output of named-entity recognition (NER) on an MWAS publication (16) using a fine-tuned transformer-based deep learning model for as- says and bidirectional long-short term memory network for chemical entity recogni- tion. 3. is implemented in pure Python code and does not have non-Python dependencies. ACKNOWLEDGEMENTS We thank Mohamed Ibrahim (University of Leicester) for identifying different configu- rations of tables for different HTML formats, and Joy Li and Filip Makraduli (Imperial College London) for testing the package and providing feedback. AUTHOR CONTRIBUTIONS TB and JMP designed and supervised the research. SS and YH developed the pipeline and analyzed data. SS developed the initial table extraction algorithm and implemented the phenotype recognition algorithm. YH developed the section header standardization algorithm and implemented the abbreviation recognition al- gorithm. SS fine-tuned the table extraction algorithm for use on non-PMC texts. TR refined standardization of full texts and contributed algorithms for UTF-8 and UTF- 16 conversions of non-ASCII characters to Unicode. SS, YH, TB and JMP wrote the manuscript. FUNDING This work has been supported by Health Data Research (HDR) UK and the Medical Research Council via an UKRI Innovation Fellowship to TB (MR/S003703/1) and a Rutherford Fund Fellowship to JMP (MR/S004033/1). FOOTNOTE ORCID: 0000-0002-4971-9003 (JMP). Bibliography 1. Seyedmostafa Sheikhalishahi, Riccardo Miotto, Joel T Dudley, Alberto Lavelli, Fabio Rinaldi, and Venet Osmani. Natural language processing of clinical notes on chronic diseases: Systematic review. JMIR Med Inform, 7(2):e12239, 4 2019. ISSN 2291-9694. doi: 10.2196/ 12239. 2. Ramón A-A. Erhardt, Reinhard Schneider, and Christian Blaschke. Status of text-mining techniques applied to biomedical text. Drug Discovery Today, 11(7):315–325, 2006. ISSN 1359-6446. doi: https://doi.org/10.1016/j.drudis.2006.02.011. 3. Nikola Milosevic, Cassie Gregson, Robert Hernandez, and Goran Nenadic. A frame- work for information extraction from tables in biomedical literature. International Jour- nal on Document Analysis and Recognition (IJDAR), 22(1):55–78, 2 2019. doi: 10.1007/ s10032- 019- 00317- 0. 4. Peter M. Visscher, Naomi R. Wray, Qian Zhang, Pamela Sklar, Mark I. McCarthy, Matthew A. Brown, and Jian Yang. 10 years of gwas discovery: Biology, function, and translation. The American Journal of Human Genetics, 101(1):5 – 22, 2017. ISSN 0002-9297. doi: https://doi.org/10.1016/j.ajhg.2017.06.005. 5. Tim Beck, Tom Shorter, and Anthony J Brookes. Gwas central: a comprehensive resource for the discovery and comparison of genotype and phenotype data from genome-wide as- sociation studies. Nucleic Acids Research, 48(D1):D933–D940, 10 2019. ISSN 0305-1048. doi: 10.1093/nar/gkz895. 6. Annalisa Buniello, Jacqueline A L MacArthur, Maria Cerezo, Laura W Harris, James Hay- hurst, Cinzia Malangone, Aoife McMahon, Joannella Morales, Edward Mountjoy, Elliot Sol- lis, Daniel Suveges, Olga Vrousgou, Patricia L Whetzel, Ridwan Amode, Jose A Guillen, Harpreet S Riat, Stephen J Trevanion, Peggy Hall, Heather Junkins, Paul Flicek, Tony Bur- dett, Lucia A Hindorff, Fiona Cunningham, and Helen Parkinson. The NHGRI-EBI GWAS Hu and Sun, et al. | Auto-CORPus bioRχiv | 9 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint https://gtr.ukri.org/projects?ref=MR/S003703/1 https://gtr.ukri.org/projects?ref=MR/S004033/1 https://orcid.org/0000-0002-4971-9003 https://doi.org/10.1101/2021.01.08.425887 http://creativecommons.org/licenses/by-nc-nd/4.0/ Catalog of published genome-wide association studies, targeted arrays and summary statis- tics 2019. Nucleic Acids Research, 47(D1):D1005–D1012, 11 2018. ISSN 0305-1048. doi: 10.1093/nar/gky1120. 7. Jeremy K. Nicholson, Elaine Holmes, and Paul Elliott. The metabolome-wide association study: A new look at human disease risk factors. Journal of Proteome Research, 7(9): 3637–3638, 2008. doi: 10.1021/pr8005099. PMID: 18707153. 8. Werner Ceusters. An information artifact ontology perspective on data collections and asso- ciated representational artifacts. Studies in health technology and informatics, 180:68–72, 2012. ISSN 0926-9630. 9. Alan Ruttenberg, Adam Goldstein, Albert Goldfain, Barry Smith, Bjoern Peters, Carlo Tor- niai, Chris Mungall, Chris Stoeckert, Christian A. Boelling, Darren Natale, David Osumi- Sutherland, Gwen Frishkoff, Holger Stenzhorn, James A. Overton, James Malone, Jen- nifer Fostel, Jie Zheng, Jonathan Rees, Larisa Soldatova, Lawrence Hunter, Mathias Brochhausen, Matt Brush, Melanie Courtot, Michel Dumontier, Paolo Ciccarese, Pat Hayes, Philippe Rocca-Serra, Randy Dipert, Ron Rudnicki, Satya Sahoo, Sivaram Ara- bandi, Werner Ceusters, William Duncan, William Hogan, and Yongqun (Oliver) He. Infor- mation artefact ontology (v2020-06-10). https://raw.githubusercontent.com/ information-artifact-ontology/IAO/v2020-06-10/iao.owl, 2020. Ac- cessed: 2020-06-21. 10. A. Ghazvinian, N. F. Noy, and M. A. Musen. Creating mappings for ontologies in biomedicine: simple methods work. AMIA Annu Symp Proc, 2009:198–202, 11 2009. 11. Peter N. Robinson, Sebastian Köhler, Sebastian Bauer, Dominik Seelow, Denise Horn, and Stefan Mundlos. The human phenotype ontology: A tool for annotating and analyzing hu- man hereditary disease. The American Journal of Human Genetics, 83(5):610–615, 2008. ISSN 0002-9297. doi: https://doi.org/10.1016/j.ajhg.2008.09.017. 12. Ariel Schwartz and Marti Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 4:451–62, 02 2003. doi: 10.1142/9789812776303_0042. 13. Katrin Fundel, Robert Küffner, and Ralf Zimmer. RelEx—Relation extraction using de- pendency parse trees. Bioinformatics, 23(3):365–371, 12 2006. ISSN 1367-4803. doi: 10.1093/bioinformatics/btl616. 14. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 09 2019. ISSN 1367-4803. doi: 10.1093/bioinformatics/btz682. 15. Peter Corbett and John Boyle. Chemlistem: chemical named entity recognition using recurrent neural networks. Journal of Cheminformatics, 10(1), 12 2018. doi: 10.1186/ s13321- 018- 0313- 8. 16. Charles R. Evans, Alla Karnovsky, Melissa A. Kovach, Theodore J. Standiford, Charles F. Burant, and Kathleen A. Stringer. Untargeted LC–MS metabolomics of bronchoalveolar lavage fluid differentiates acute respiratory distress syndrome from health. Journal of Pro- teome Research, 13(2):640–649, 12 2013. doi: 10.1021/pr4007624. 17. Nikola Milosevic, Cassie Gregson, Robert Hernandez, and Goran Nenadic. Disentangling the structure of tables in scientific literature. In Elisabeth Métais, Farid Meziane, Mohamad Saraee, Vijayan Sugumaran, and Sunil Vadera, editors, Natural Language Processing and Information Systems, pages 162–174. Springer International Publishing, 2016. ISBN 978- 3-319-41754-7. doi: https://doi.org/10.1007/978- 3- 319- 41754- 7_14. 10 | bioRχiv Hu and Sun, et al. | Auto-CORPus .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425887doi: bioRxiv preprint https://raw.githubusercontent.com/information-artifact-ontology/IAO/v2020-06-10/iao.owl https://raw.githubusercontent.com/information-artifact-ontology/IAO/v2020-06-10/iao.owl https://doi.org/10.1101/2021.01.08.425887 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10_1101-2021_01_08_425897 ---- 62441649 1 APOBEC1 mediated C-to-U RNA editing: target sequence and trans-acting factor contribution to 177 RNA editing events in 119 murine transcripts in-vivo. Saeed Soleymanjahi1, Valerie Blanc1 and Nicholas O. Davidson1,2 1Division of Gastroenterology, Department of Medicine, Washington University School of Medicine, St. Louis, MO 63105 2To whom communication should be addressed: Email: nod@wustl.edu Running title: APOBEC1 mediated C to U RNA editing Keywords: RNA folding; A1CF; RBM47; January 8, 2021 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 2 ABSTRACT (184 words) Mammalian C-to-U RNA editing was described more than 30 years ago as a single nucleotide modification in APOB RNA in small intestine, later shown to be mediated by the RNA-specific cytidine deaminase APOBEC1. Reports of other examples of C-to-U RNA editing, coupled with the advent of genome-wide transcriptome sequencing, identified an expanded range of APOBEC1 targets. Here we analyze the cis-acting regulatory components of verified murine C- to-U RNA editing targets, including nearest neighbor as well as flanking sequence requirements and folding predictions. We summarize findings demonstrating the relative importance of trans- acting factors (A1CF, RBM47) acting in concert with APOBEC1. Using this information, we developed a multivariable linear regression model to predict APOBEC1 dependent C-to-U RNA editing efficiency, incorporating factors independently associated with editing frequencies based on 103 Sanger-confirmed editing sites, which accounted for 84% of the observed variance. Co- factor dominance was associated with editing frequency, with RNAs targeted by both RBM47 and A1CF observed to be edited at a lower frequency than RBM47 dominant targets. The model also predicted a composite score for available human C-to-U RNA targets, which again correlated with editing frequency. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 3 INTRODUCTION Mammalian C-to-U RNA editing was identified as the molecular basis for human intestinal APOB48 production more than three decades ago (Chen et al. 1987; Hospattankar et al. 1987; Powell et al. 1987). A site-specific enzymatic deamination of C6666 to U of Apob mRNA was originally considered the sole example of mammalian C-to-U RNA editing, occurring at a single nucleotide in a 14 kilobase transcript and mediated by an RNA specific cytidine deaminase (APOBEC1) (Teng et al. 1993). With the advent of massively parallel RNA sequencing technology we now appreciate that APOBEC1 mediated RNA editing targets hundreds of sites (Rosenberg et al. 2011; Blanc et al. 2014) mostly within 3’ untranslated regions of mRNA transcripts. This expanded range of targets of C-to-U RNA editing prompted us to reexamine key functional attributes in the regulatory motifs (both cis-acting elements and trans-acting factors) that impact editing frequency, focusing primarily on data emerging from studies of mouse cell and tissue-specific C-to-U RNA editing. Earlier studies identified RNA motifs (Davies et al. 1989) contained within a 26-nucleotide segment flanking the edited cytidine base in vivo (in cell lines) or within 55 nucleotides using S100 extracts from rat hepatoma cells (Bostrom et al. 1989; Driscoll et al. 1989). Those, and other studies, established that Apob RNA editing reflects both the tissue/cell of origin as well as RNA elements remote and adjacent to the edited base (Bostrom et al. 1989; Davies et al. 1989). A granular examination of the regions flanking the edited base in Apob RNA demonstrated a critical 3’ sequence 6671-6681, downstream of C6666, in which mutations reduced or abolished editing activity (Shah et al. 1991). This 3’ site, termed a “mooring sequence” was associated with a 27s- “editosome” complex (Smith et al. 1991), which was both necessary and sufficient for site-specific Apob RNA editing and editosome assembly (Backus and Smith 1991). Other cis-acting elements include a 5 nucleotide spacer region between the edited cytidine and the mooring sequence, and also sequences 5’ of the editing site that regulate editing efficiency (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 4 (Backus and Smith 1992; Backus et al. 1994) along with AU-rich regions both 5’ and 3’ of the edited cytidine that together function in concert with the mooring sequence (Hersberger and Innerarity 1998). Advances in our understanding of physiological Apob RNA editing emerged in parallel from both the delineation of key RNA regions (summarized above) and also with the identification of components of the Apob RNA editosome (Sowden et al. 1996). APOBEC1, the catalytic deaminase (Teng et al. 1993) is necessary for physiological C-to-U RNA editing in vivo (Hirano et al. 1996) and in vitro (Giannoni et al. 1994). Using the mooring sequence of Apob RNA as bait, two groups identified APOBEC1 complementation factor (A1CF), an RNA-binding protein sufficient in vitro to support efficient editing in presence of APOBEC1 and Apob mRNA (Lellek et al. 2000; Mehta et al. 2000). Those findings reinforced the importance of both the mooring sequence and an RNA binding component of the editosome in promoting Apob RNA editing. However, while A1CF and APOBEC1 are sufficient to support in vitro Apob RNA editing, neither heterozygous (Blanc et al. 2005) or homozygous genetic deletion of A1cf impaired Apob RNA editing in vivo in mouse tissues (Snyder et al. 2017), suggesting that an alternate complementation factor was likely involved. Other work identified a homologous RNA binding protein, RBM47, that functioned to promote Apob RNA editing both in vivo and in vitro (Fossat et al. 2014), and more recent studies utilizing conditional, tissue-specific deletion of A1cf and Rbm47 indicate that both factors play distinctive roles in APOBEC1-mediated C-to-U RNA editing, including Apob as well as a range of other APOBEC1 targets (Blanc et al. 2019). These findings together establish important regulatory roles for both cis-acting elements and trans-acting factors in C-to-U mRNA editing. However, the majority of studies delineating cis- acting elements reflect earlier, in vitro experiments using ApoB mRNA and relatively little is known regarding the role of cis-acting elements in tissue-specific C-to-U RNA editing of other transcripts, in vivo. Here we use statistical modeling to investigate the independent roles of (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 5 candidate regulatory factors in mouse C-to-U mRNA editing using data from in vivo studies from over 170 editing sites in 119 transcripts (Meier et al. 2005; Rosenberg et al. 2011; Gu et al. 2012; Blanc et al. 2014; Rayon-Estrada et al. 2017; Snyder et al. 2017; Blanc et al. 2019; Kanata et al. 2019). We also examined these regulatory factors in known human mRNA targets (Chen et al. 1987; Powell et al. 1987; Skuse et al. 1996; Mukhopadhyay et al. 2002; Grohmann et al. 2010; Schaefermeier and Heinze 2017). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 6 RESULTS Descriptive data 177 C-to-U RNA editing sites were identified based on eight studies that met inclusion and exclusion criteria (Meier et al. 2005; Rosenberg et al. 2011; Gu et al. 2012; Blanc et al. 2014; Rayon-Estrada et al. 2017; Snyder et al. 2017; Blanc et al. 2019; Kanata et al. 2019), representing 119 distinct RNA editing targets. 84% (100/119) of RNA targets were edited at one chromosomal location (Figure 1C) and 75% (89/119) of mRNA targets were edited at both a single chromosomal location and also within a single tissue (Figure 1D). The majority of editing sites occur in the 3` untranslated region (142/177; 80%), with exonic editing sites the next most abundant subgroup (28/177; 16%, Figure 1E). Chromosome X harbors the highest number of editing sites (18/177; 10%), followed by chromosomes 2 and 3 (15/177; 8.5% for both, Supplemental Figure 1). 103/177 editing sites were confirmed by Sanger sequencing, with a mean editing frequency of 37 ± 22%. Base content of sequences flanking edited and mutated cytidines AU content was enriched (~87%) in nucleotides both immediately upstream and downstream of the edited cytidine across mouse RNA editing targets (Figure 2A and 2C). The average AU content across the region 10 nucleotides upstream to 20 nucleotides downstream of the edited cytidine was ~70% (60 - 87%). Because APOBEC1 has been shown to be a DNA mutator (Harris et al. 2003; Wolfe et al. 2019; Wolfe et al. 2020), we determined the AU content of the mutated deoxycytidine region flanking human DNA targets (Nik-Zainal et al. 2012) to be ~66% at a site one nucleotide downstream of the edited base (Figure 2B, C). The average AU content in the sequence 10 nucleotides upstream and 10 nucleotides downstream of mutated deoxycytidines is 59% (57-66.0%). The average AU content was 90% and 80% in nucleotides immediately upstream and downstream, respectively, of the targeted deoxycytidine in a (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 7 subgroup of over 700 DNA editing events of the C to T type (Nik-Zainal et al. 2012), which is closer to the distribution found in C to U RNA editing targets. These features suggest that AU enrichment is an important component to editing function of APOBEC1 on both RNA and DNA targets, especially for the C/dC to U/dT change. Factors influencing editing frequency Regulatory-spacer-mooring cassette: We observed no significant associations between editing frequency and mismatches in motif A (r=-0.05, P=.46) or motif B (r=-0.1, P=.20) (Supplemental Figure 2), while mismatches in motif C and D negatively impacted editing frequency (r=-0.24, P=.001) (motif D r=-0.20, P=.008, Figure 3B). AU content of motif B showed a trend towards negative association with editing frequency (r=-0.13, P=.08 Figure 3C), but AU contents of motifs A (r=0.06, P=.4), C (r=-0.02, P=.8), and D (r=-0.02, P=.78) did not impact editing frequency (Supplemental Figure 2). The abundance of G in motif C (r=0.17, P=.02), abundance of C in motif B (r=0.13, P=.08), and G/C fraction in motif C (r=0.14, P=.04) showed either significance or a trend to associations with editing frequency. The spacer sequence averaged 5 ± 4 nucleotides, ranging from 0 to 20, with trend of association between length and editing frequency (r=-0.14, P=.09). The mean spacer sequence AU content was 73 ± 23%, with no association between editing frequency and AU content (r=-0.1, P=.2, Supplemental Figure 3). However, G abundance (r=-0.23, P=.01) and G/C fraction (r=-0.20, P=.03) of spacer showed significant associations with editing frequency in Sanger-confirmed targets. The mean number of mismatches in the first 4 nucleotides of the spacer sequence was 2.5 ± 1 with higher number of mismatches exerting a significant negative impact on editing frequency (r=-0.24, P=.01) (Figure 3D). The mean number of mismatches in the mooring sequence was 2.1 ± 1.8, ranging from 0 to 8 nucleotides. The number of mismatches showed a significant negative association with editing frequency (r=-0.30, P=.0003, Figure 3E). The base content of individual nucleotides surrounding the edited cytidine showed significant associations with editing frequency, which (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 8 was more emphasized in nucleotides closer to the edited cytidine (Figure 3F, Supplemental Table 1). Furthermore, overall AU content of downstream sequence +16 to +20 had positive impact on editing frequency (r=0.17, P=.02) (Supplemental Figure 3). However, G abundance in downstream 20 nucleotides (r=-0.24, P=.001) and G/C fraction in downstream 10 nucleotides (r=-0.16, P=.09) showed significant or a trend of significant negative associations with editing frequency in Sanger-confirmed targets. Secondary structure: We generated a predicted secondary structure for 172 editing sites, with four subgroups based on overall structure and location of the edited cytidine: loop (Cloop), stem (Cstem), tail (Ctail), and non-canonical structure (NC). The majority of editing sites were in the Cloop subgroup (59%), followed by Cstem (20%), Ctail (13%), and NC (8%) subgroups (Figure 4A). Editing sites in the Ctail subgroup exhibited lower editing frequencies compared to editing sites in Cloop (29 ± 12 vs 41 ± 23%, P=.02) or Cstem (37 ± 21%, P=.04) subgroups. No significant differences were detected in other comparisons (Figure 4B). The edited cytidine was located in loop, stem, and tail of the secondary structure in 110 (64%), 38 (22%), and 24 (14%) of the edited RNAs, respectively. Editing sites with the edited cytidine within the loop exhibited significantly higher editing frequency compared to those with the edited cytidine in the tail (40 ± 24% vs 28 ± 12 %, P=.04). Other subgroups exhibited comparable editing frequencies (Supplemental Figure 4). The majority (78%) of editing sites contained a mooring sequence located in main stem-loop structure (Figure 4C), with the remainder located in the tail or secondary loop. Average editing efficiency was significantly higher in targets where the mooring sequence was located in the main stem-loop (Figure 4D). We also calculated the proportion of total nucleotides that constitute the main stem-loop in the secondary structure. The average ratio was 0.62 ± 0.18 ranging from 0.28 to 1 (Supplemental Table 2) with higher ratios associated with higher editing frequency of the corresponding editing site (r=0.20, P=.007) (Figure 4E). Finally, we considered the orientation of free tails in the secondary structure in (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 9 terms of length and symmetry. Symmetric free tails were observed in 59% of editing sites (Supplemental Figure 4). The length of 5’ free tail showed negative association with editing frequency (r=-0.14, P=.04, Figure 4F) while no significant associations were detected between either the length of 3’ tail or symmetry of tails and editing frequency (Supplemental Figure 4). Trans-acting factors and tissue specificity: Data for relative dominance of cofactors in APOBEC1- dependent RNA editing were available for 72 editing sites for targets in small intestine or liver (Blanc et al. 2019). RBM47 was identified as the dominant factor in 60/72 (83%) sites; A1CF was the dominant factor in 5/72 (7%) editing sites with the remaining sites (7/72; 10%), exhibiting equal codominancy (Figure 5A). The average editing frequencies at editing sites revealed differences across the groups with 41 ± 20% in RBM47-dominant targets, 23 ± 14% in A1CF-dominant, and 27 ± 11% in the co-dominant group (P=.03) (Figure 5B). The majority of RNA editing targets were edited in one tissue (103/119; 86% Figure 5C), while the maximum number of tissues in which an editing target is edited (at the same site) is 5 (Cd36). The small intestine harbors the highest number of verified editing sites (95/177; 54%), followed by liver (31/177; 17%), and adipose tissue (19/177; 11% Figure 5D). Sites edited in brain tissue showed the highest average editing frequency (54 ± 35 %, n=11), followed by bone marrow myeloid cells (50 ± 22 %, n=4), and kidney (47 ± 29%, n=10 Figure 5E). We then developed a multivariable linear regression model to predict APOBEC1 dependent C- to-U RNA editing efficiency, incorporating factors independently associated with editing frequencies (Table 1). This model, based on 103 Sanger-confirmed editing sites with available data for all of the parameters mentioned, accounted for 84% of variance in editing frequency of editing sites included (R2=0.84, P<.001 Table 1). The final multivariable model revealed several factors independently associated with editing frequency, specifically the number of mismatches in mooring sequence; regulatory sequence motif D; AU content of regulatory sequence motif B; overall secondary structure for group Ctail vs group Cloop; location of mooring sequence in (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 10 secondary structure; “base content score” parameter that represents base content of the sequences flanking edited cytidine (Table 1). Removing “base content score” from the model reduced the power from R2=0.84 to R2=0.59. Next, we added a co-factor dominance variable and fit the model using the 72 editing sites with available data for cofactor dominance. Along with other factors mentioned above, co-factor dominance showed significant association with editing frequency (Table 1) with RNAs targeted by both RBM47 and A1CF observed to be edited at a lower frequency than RBM47 dominant targets. Factors associated with co-factor dominance (Figure 6, Supplemental Table 3, Supplemental Figure 5), included tissue-specificity, with higher frequency of RBM47-dominant sites in small intestine compared to liver (91 vs 63%, P=.008) and A1CF-dominant and co-dominant editing sites more prevalent in liver. The number of mooring sequence mismatches also varied among three subgroups: 1.1 ± 1.3 in RBM47-dominant subgroup; 2.0 ± 2.5 in A1CF-dominant subgroup; and 2.9 ± 0.4 in co-dominant subgroup (P=.004). This was also the case regarding mismatches in the spacer: 2.4 ± 1.2 in RBM47-dominant subgroup; 2.7 ± 1.5 in A1CF-dominat subgroup; 3.8 ± 0.4 in co-dominant subgroup (P=.02). AU content (%) of downstream sequence +6 to +10 was higher in RBM47-dominant subgroup (P=.01). Finally, the location of the edited cytidine in secondary structure of mRNA strand was different across three subgroups (P=.04, Figure 6). We used pairwise multinomial logistic regression to determine factors independently associated with co-factor dominance (Figure 6C, Supplemental Table 4). Ctail editing sites, those with more mismatches in mooring and regulatory motif C, lower AU content in downstream sequence, and higher AU content in regulatory motif D were more likely co-dominant. Editing sites from small intestine and those with higher AU content of downstream sequence were more likely RBM47-dominant. Editing sites from liver and those with higher mismatches in regulatory motif B were more likely A1CF-dominant (Figure 6C). Human mRNA targets (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 11 Finally, we turned to an analysis of human C-to-U RNA editing targets for which this same panel of parameters was available (Table 2). Aside from APOB RNA, which is known to be edited in the small intestine (Chen et al. 1987; Powell et al. 1987), other targets have been identified in central or peripheral nervous tissue (Skuse et al. 1996; Mukhopadhyay et al. 2002; Meier et al. 2005; Schaefermeier and Heinze 2017). The human targets were categorized into low editing (NF1, GLYRα2, GLYRα3) and high editing (APOB, TPH2B exon3, TPH2B exon7) subgroups using 20% as cut-off. A composite score (maximum=6) was generated based on six parameters introduced in the mouse model with notable variance between the two subgroups including mismatches in mooring sequence, spacer length, location of the edited cytidine, and relative abundance of stem-loop bases (Table 2). High editing targets exhibited a significantly higher composite score (4.7 vs 2, P=.001) compared to low editing targets and the composite score significantly correlated with editing frequency in individual targets (r=0.95, P=.005). The canonical editing target ApoB (Chen et al. 1987; Powell et al. 1987) achieved a score of 5 (out of 6), reflecting the observation that one of the six parameters (AU% of regulatory motifs) in human APOB is non-preferential compared to the editing-promoting features identified in the mouse multivariable model. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 12 DISCUSSION The current study reflects our analysis of 177 C-to-U RNA editing sites from 119 target mRNAs, with the majority residing within the 3’ untranslated region. Our multivariable model identified several key factors influencing editing frequency, including host tissue, base content of nucleotides surrounding the edited cytidine, number of mismatches in regulatory and mooring sequences, AU content of the regulatory sequence, overall secondary structure, location of the mooring sequence, and co-factor dominance. These factors, each exerting independent effects, together accounted for 84% of the variance in editing frequency. Our findings also showed that mismatches in the mooring and regulatory sequences, AU content of regulatory and downstream sequences, host tissue and secondary structure of target mRNA were associated with the pattern of co-factor dominance. Several aspects of these primary conclusions merit further discussion. Previous studies investigating the key factors that regulate C-to-U mRNA editing were confined to in vitro studies and predicated on a single mRNA target (ApoB) (Backus and Smith 1991; Shah et al. 1991; Smith et al. 1991; Backus and Smith 1992; Hersberger and Innerarity 1998). With the expanded range of verified C-to-U RNA editing targets now available for interrogation, we revisited the original assumptions to understand more globally the determinants of C-to-U mRNA editing efficiency. In undertaking this analysis, we were reminded that the requirements for C-to-U mRNA editing in vitro often appear more stringent than in vivo (Backus and Smith 1991; Shah et al. 1991), which further emphasizes the importance of our findings. In addition, our approach included both cis-acting sequence- and folding-related predictions along with the role of trans-acting factors and took advantage of statistical modeling to adjust for confounding or modifier effects between these factors to identify their role in editing frequency. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 13 We began with the assumptions established for Apob RNA editing which identified a 26 nucleotide segment encompassing the edited base, spacer, mooring sequence, and part of regulatory sequence as the minimal sequence competent for physiological editing in vitro and in vivo (Davies et al. 1989; Shah et al. 1991; Backus and Smith 1992). Those studies identified an 11-nucleotide mooring sequence as essential and sufficient for editosome assembly and site- specific C-to-U editing (Backus and Smith 1991; Shah et al. 1991; Backus and Smith 1992) and established optimal positioning of the mooring sequence relative to the edited base in Apob RNA (Backus and Smith 1992). The current work supports the key conclusions of this original mooring sequence model as applied to the entire range of C-to-U RNA editing targets. We observed that mismatches in either the mooring or regulatory sequences were independent factors governing editing frequency. By contrast, while mismatches in the spacer sequence also showed negative association with editing frequency, the impact of spacer mismatches were not retained in the final model, nor was the length of the spacer associated with editing frequency. Furthermore, we found mismatches in the regulatory sequence motif C to be more important than mismatches in motif B. These inconsistencies might conceivably reflect the context in which an RNA segment is studied (Backus and Smith 1992). For example, our analysis reflects physiological conditions in which naturally occurring mRNA targets are edited, while the aforementioned study used in vitro data based on varying lengths of Apob mRNA embedded within different mRNA contexts (Apoe RNA) (Backus and Smith 1992). In addition to the components of mooring sequence model, we examined variations in the base content in different segments/motifs as well as among individual nucleotides surrounding the edited cytidine. As expected, we found that sequences flanking the edited cytidine exhibited high AU content. We further observed a similarly high AU content in the flanking sequences of a range of proposed APOBEC-mediated DNA mutation targets in human cancer tissues and cell lines (Alexandrov et al. 2013; Petljak et al. 2019), especially in targets with dC/dT change (Nik- (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 14 Zainal et al. 2012). This observation implies that APOBEC-mediated DNA and RNA editing frequency may each be functionally modified by AU enrichment in the flanking sequences surrounding modifiable bases. The base content in individual nucleotides surrounding the edited cytidine also exerted significant impact on editing frequency, particularly in a 10- nucleotide segment spanning the edited cytidine (Supplemental Table 1), accounting for 25% of the variance in editing frequency independent of the mooring sequence model. Our findings regarding individual nucleotides surrounding the edited cytidine are consistent with findings for both DNA and RNA editing targets, particularly in the setting of cancers (Backus and Smith 1992; Conticello 2012; Roberts et al. 2013; Saraconi et al. 2014; Gao et al. 2018; Arbab et al. 2020). Recent work examining the sequence-editing relationship of a large in vitro library of DNA targets edited by different synthetic cytidine base editor (CBE)s (Arbab et al. 2020) showed that the base content of a 6-nucleotide window spanning the edited cytidine explained 23-57% of the editing variance, in particular one or two nucleotides immediately 5’ of the edited nucleotide. That study also demonstrated that occurrence of T and C nucleotides at the position -1 increased, while a G nucleotide at that position decreased editing frequency (Arbab et al. 2020). However, in contrast to our findings, the presence of A at position -1 had either a negative or null effect on DNA editing activity (Arbab et al. 2020). This latter finding is consistent with the lower AU content observed in nucleotides adjacent to the edited cytidine in Apobec-1 DNA targets compared to the AU content in RNA targets. Our findings assign a greater importance of adjacent nucleotides in RNA editing frequency, similar to earlier reports that the five bases immediately 5’ of the edited cytidine in Apob mRNA exert a greater impact on editing activity compared to nucleotides further upstream of this segment (Backus and Smith 1991; Shah et al. 1991; Backus and Smith 1992). G/C fraction of a 6-nucleotide window spanning the edited cytidine in DNA targets is associated with editing activity of the synthetic CBEs (Arbab et al. 2020). Although we found significant associations of RNA editing with G/C fraction in segments surrounding the edited cytidine in univariate analyses, these associations (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 15 were not retained in the final model. In contrast, the AU content of regulatory sequence motif B remained as an independent factor determining editing frequency in the final model. The conserved 26-nucleotide sequence around the edited C forms a stem-loop secondary structure, where the editing site is in an octa-loop (Richardson et al. 1998) as predicted for the 55-nucleotide sequence of ApoB mRNA (Shah et al. 1991). This stem-loop structure is predicted to play an important role in recognition of the editing site by the editing factors (Bostrom et al. 1989; Davies et al. 1989; Driscoll et al. 1989; Chen et al. 1990). Mutations resulting in loss of base pairing in peripheral parts of the stem did not impact the editing frequency (Shah et al. 1991). Editing sites with the cytidine located in central parts (e.g. loop) exhibited higher editing frequencies than those with the edited cytidine located in peripheral parts (e.g. tail) and it is worth noting that the computer-based stem-loop structure was independently confirmed by NMR studies of a 31-nucleotide human ApoB mRNA (Maris et al. 2005). Those studies demonstrated that the location of the mooring sequence in the ApoB mRNA secondary structure plays a critical role in the RNA recognition by A1CF (Maris et al. 2005). In line with those findings, the current findings emphasize that the location of the mooring sequence in secondary structure of the target mRNA exerts significant independent impact on editing frequency. These predictions were confirmed in crystal structure studies of the carboxyl-terminal domain of APOBEC-1 and its interaction with cofactors and substrate RNA (Wolfe et al. 2020). Our conclusions regarding murine C-to-U editing frequency, such as mooring sequence, base content, and secondary structure appear consistent with a similar regulatory role among the smaller number of verified human targets. That being said, further study and expanded understanding of the range of C-to-U editing targets in human tissues will be needed as recently suggested (Destefanis et al. 2020), analogous to that for A-to-I editing (Bahn et al. 2012; Bazak et al. 2014). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 16 We recognize that other factors likely contribute to the variance in RNA editing frequency not covered by our model. We did not consider the role of naturally occurring variants in APOBEC1, for example, which may be a relevant consideration since mutations in APOBEC family genes were shown to modify the editing activity of related hybrid DNA cytosine base editors (Arbab et al. 2020). Furthermore, genetic variants of APOBEC1 in humans were associated with altered frequency of GlyR editing (Kankowski et al. 2017). Other factors not included in our approach included entropy-related features, tertiary structure of the mRNA target and other regulatory co-factors. Another limitation in the tissue-specific designation used to categorize editing frequency is that cell specific features of editing frequency may have been overlooked. For example, small intestinal and liver preparations are likely a blend of cell types (MacParland et al. 2018; Elmentaite et al. 2020) and tumor tissues are highly heterogeneous in cellular composition (Barker et al. 2009). The current findings provide a platform for future approaches to resolve these questions. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 17 MATERIALS AND METHODS Search strategy A comprehensive literature review from 1987 (when ApoB RNA editing was first reported (Chen et al. 1987; Powell et al. 1987)) to November 2020, using studies published in English reporting C-to-U mRNA editing frequencies of individual or transcriptome-wide target genes. Databases searched included Medline, Scopus, Web of Science, Google Scholar, and ProQuest (for thesis). The references of full texts retrieved were also scrutinized for additional papers not indexed in the initial search. Study selection Primary records (N=528) were screened for relevance and in vivo studies reporting editing frequencies of individual or transcriptome-wide APOBEC1-dependent C-to-U mRNA targets selected, using a threshold of 10% editing frequency. For analyses based on RNA sequence information, only targets with available sequence information or chromosomal location for the edited cytidine were included. Exclusion criteria included: studies that reported C-to-U mRNA editing frequencies of target genes in other species, studies reporting editing frequencies of target genes in animal models overexpressing APOBEC1, exclusively in vitro studies, and conference abstracts. Human targets We included studies reporting human C-to-U mRNA targets (Chen et al. 1987; Powell et al. 1987; Skuse et al. 1996; Mukhopadhyay et al. 2002; Grohmann et al. 2010; Schaefermeier and Heinze 2017). We also included work describing APOBEC1-mediated mutagenesis in human breast cancer (Nik-Zainal et al. 2012). Data extraction (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 18 Two reviewers (SS and VB) conducted the extraction process independently and discrepancies were addressed upon consensus and input from a third reviewer (NOD). The parameters were categorized as follows: General parameters: Gene name (RNA target), chromosomal and strand location of the edited cytidine, tissue site, editing frequency determined by RNA-seq or Sanger sequencing as illustrated for ApoB (Figure 1A). Editing frequency was highly correlated by both approaches (r=0.8 P<0.0001), and where both methodologies were available we used RNA- seq. We also defined relative dominance of editing co-factors (A1CF-dominant, RBM47- dominant, or co-dominant), relative mRNA expression (edited gene vs unedited gene) by RNA- seq or quantitative RT-PCR, and abundance of corresponding protein (edited gene vs unedited gene) by western blotting or proteomic comparison. Co-factor dominancy was determined based on the relative contribution of each co-factor to editing frequency. In each editing site, editing frequencies in mouse tissues deficient in A1cf or Rbm47 were compared to that of wild- type mice. The relative contribution of each co-factor was calculated by subtracting the editing frequency for each target in A1cf or Rbm47 knockout tissue from the total editing frequency in wild-type control. Editing sites with <20% difference between contributions of RBM47 and A1CF were considered co-dominant. Sites with ≥20% difference were considered either RBM47- or A1CF-dominant, depending on the co-factor with higher contribution (Blanc et al. 2019). Sequence-related parameters: A sequence spanning 10 nucleotides upstream and 30 nucleotides downstream of the edited cytidine was extracted for each C-to-U mRNA editing site. These sequences were extracted either directly from the full-text or using online UCSC Genome Browser on Mouse (NCBI37/mm9) and Human (Grch38/hg38) (https://genome.ucsc.edu/cgi- bin/hgGateway) . Using the mooring sequence model (Backus and Smith 1992), three cis-acting elements were considered for each site. These elements included 1) a 10-nucleotide segment immediately upstream of the edited cytidine as “regulatory sequence”; 2) a 10-nucleotide segment downstream of the edited cytidine with complete or partial consensus with the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 19 canonical “mooring sequence” of ApoB mRNA; 3) the sequence between the edited cytidine and the 5’ end of the mooring sequence, referred to as “spacer”. We used an unbiased approach to identify potential mooring sequences by taking the nearest segment to the edited cytidine with lowest number of mismatch(es) compared to the canonical mooring sequence of ApoB RNA. For each of the three segments, we investigated the number of mismatches compared to the corresponding segment of ApoB gene (Blanc et al. 2014), as well as length of spacer, the abundance of A and U nucleotides (AU content) and the G to C abundance ratio (G/C fraction (Arbab et al. 2020)). We also calculated relative abundance of A, G, C, and U individually across a region 10 nucleotides upstream and 20 nucleotides downstream of the edited cytidine across all editing sites. For comparison, we examined the base content of a sequence spanning 10 nucleotides upstream and downstream of mutated deoxycytidine for over 6000 proposed C to X (T, A, and G) DNA mutation targets of APOBEC family in human breast cancer (Nik-Zainal et al. 2012) along with relative deoxynucleotide distribution in proximity to the edited site. Secondary structure parameters: We used RNA-structure (Reuter and Mathews 2010) and Mfold (Zuker 2003) to determine the secondary structure of an RNA cassette consisting of regulatory sequence, edited cytidine, spacer, and mooring sequence. Secondary structures similar to that of the cassette for ApoB chr12: 8014860 consisting of one loop and stem (with or without unassigned nucleotides with ≤4 unpaired bases inside the stem) as the main stem-loop with or without free tail(s) in one or both ends of the stem were considered as canonical. Two other types of secondary structure were considered as non-canonical structures (Figure 1B), with ≥2 loops located either at ends of the stem or inside the stem. Loops inside the stem were circular open structures with ≥5 unpaired bases. Editing sites with canonical structure were further categorized into three subgroups based on location of the edited cytidine: specifically (Cloop), stem (Cstem), or tail (Ctail). In addition to overall secondary structure, we considered (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 20 location of the edited cytidine, location of mooring sequence, symmetry of the free tails, and proportion of the nucleotides in the target cassette that constitute the main stem-loop. This proportion is 1.0 in the case of ApoB chr12: 8014860 where all the bases are part of the main stem-loop structure. Symmetry was defined based on existence of free tails in both ends of the RNA strand. Statistical methodology Continuous variables are reported as means ± SD with relative proportions for binary and categorical variables. T-test and ANOVA tests were used to compare continuous parameters of interest between two or more than two groups, respectively. Chi-squared testing was used to compare binary or categorical variables among different groups. Pearson r testing was used to investigate correlation of two continuous variables. We used linear regression analyses to develop the final model of independent factors that correlate with editing frequency. We used the Hosmer and Lemeshow approach for model building (Hosmer Jr et al. 2013) to fit the multivariable regression model. In brief, we first used bivariate and/or simple regression analyses with P value of 0.2 as the cut-off point to screen the variables and detect primary candidates for the multivariable model. Subsequently, we fitted the primary multivariable model using candidate variables from the screening phase. A backward elimination method was employed to reach the final multivariable model. Parameters with P values <0.05 or those that added to the model fitness were retained. Next, the eliminated parameters were added back individually to the final model to determine their impact. Plausible interaction terms between final determinants were also checked. The final model was screened for collinearity. We used the same approach to develop a multinomial logistic regression model to identify factors that were independently associated with co-factor dominance in RNA editing sites. Squared R and pseudo squared R were used to estimate the proportion of variance in responder parameter that could be explained by multivariable linear regression and multinomial logistic regression models, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 21 respectively. The same screening and retaining methods were used to investigate association of base content in a sequence 10 nucleotides upstream and 20 nucleotides downstream of the edited cytidine, with editing frequency. However, after determining the nucleotides that were retained in final regression model, a proxy parameter named “base content score” was calculated for each editing site based on the β coefficient values retrieved for individual nucleotides in the model. This parameter was used in the final model as representative variable for base content of the aforementioned sequence in each editing site. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 22 ACKNOWLEDGMENTS This work was supported by grants from the National Institutes of Health grants DK-119437, DK-112378, Washington University Digestive Diseases Research Core Center P30 DK-52574 (to NOD) (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 23 REFERENCES UCSC Genome Browser on Mouse (NCBI37/mm9; 2007) and Human (GRCh38/hg38; 2013) assemblies. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Borresen-Dale AL et al. 2013. Signatures of mutational processes in human cancer. Nature 500: 415-421. Arbab M, Shen MW, Mok B, Wilson C, Matuszek Z, Cassa CA, Liu DR. 2020. Determinants of Base Editing Outcomes from Target Library Analysis and Machine Learning. Cell 182: 463-480 e430. Backus JW, Schock D, Smith HC. 1994. Only cytidines 5' of the apolipoprotein B mRNA mooring sequence are edited. Biochim Biophys Acta 1219: 1-14. Backus JW, Smith HC. 1991. Apolipoprotein B mRNA sequences 3' of the editing site are necessary and sufficient for editing and editosome assembly. Nucleic Acids Res 19: 6781-6786. -. 1992. Three distinct RNA sequence elements are required for efficient apolipoprotein B (apoB) RNA editing in vitro. Nucleic Acids Res 20: 6007-6014. Bahn JH, Lee JH, Li G, Greer C, Peng G, Xiao X. 2012. Accurate identification of A-to-I RNA editing in human by transcriptome sequencing. Genome Res 22: 142-150. Barker N, Ridgway RA, van Es JH, van de Wetering M, Begthel H, van den Born M, Danenberg E, Clarke AR, Sansom OJ, Clevers H. 2009. Crypt stem cells as the cells-of-origin of intestinal cancer. Nature 457: 608-611. Bazak L, Haviv A, Barak M, Jacob-Hirsch J, Deng P, Zhang R, Isaacs FJ, Rechavi G, Li JB, Eisenberg E et al. 2014. A-to-I RNA editing occurs at over a hundred million genomic sites, located in a majority of human genes. Genome Res 24: 365-376. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 24 Blanc V, Henderson JO, Newberry EP, Kennedy S, Luo J, Davidson NO. 2005. Targeted deletion of the murine apobec-1 complementation factor (acf) gene results in embryonic lethality. Molecular and cellular biology 25: 7260-7269. Blanc V, Park E, Schaefer S, Miller M, Lin Y, Kennedy S, Billing AM, Ben Hamidane H, Graumann J, Mortazavi A et al. 2014. Genome-wide identification and functional analysis of Apobec-1-mediated C-to-U RNA editing in mouse small intestine and liver. Genome Biol 15: R79. Blanc V, Xie Y, Kennedy S, Riordan JD, Rubin DC, Madison BB, Mills JC, Nadeau JH, Davidson NO. 2019. Apobec1 complementation factor (A1CF) and RBM47 interact in tissue-specific regulation of C to U RNA editing in mouse intestine and liver. RNA 25: 70- 81. Bostrom K, Lauer SJ, Poksay KS, Garcia Z, Taylor JM, Innerarity TL. 1989. Apolipoprotein B48 RNA editing in chimeric apolipoprotein EB mRNA. J Biol Chem 264: 15701-15708. Chen SH, Habib G, Yang CY, Gu ZW, Lee BR, Weng SA, Silberman SR, Cai SJ, Deslypere JP, Rosseneu M et al. 1987. Apolipoprotein B-48 is the product of a messenger RNA with an organ-specific in-frame stop codon. Science 238: 363-366. Chen SH, Li XX, Liao WS, Wu JH, Chan L. 1990. RNA editing of apolipoprotein B mRNA. Sequence specificity determined by in vitro coupled transcription editing. J Biol Chem 265: 6811-6816. Conticello SG. 2012. Creative deaminases, self-inflicted damage, and genome evolution. Annals of the New York Academy of Sciences 1267: 79-85. Davies MS, Wallis SC, Driscoll DM, Wynne JK, Williams GW, Powell LM, Scott J. 1989. Sequence requirements for apolipoprotein B RNA editing in transfected rat hepatoma cells. J Biol Chem 264: 13395-13398. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 25 Destefanis E, Avsar G, Groza P, Romitelli A, Torrini S, Pir P, Conticello SG, Aguilo F, Dassi E. 2020. A mark of disease: how mRNA modifications shape genetic and acquired pathologies. RNA. Driscoll DM, Wynne JK, Wallis SC, Scott J. 1989. An in vitro system for the editing of apolipoprotein B mRNA. Cell 58: 519-525. Elmentaite R, Ross ADB, Roberts K, James KR, Ortmann D, Gomes T, Nayak K, Tuck L, Pritchard S, Bayraktar OA et al. 2020. Single-Cell Sequencing of Developing Human Gut Reveals Transcriptional Links to Childhood Crohn's Disease. Dev Cell. Fossat N, Tourle K, Radziewic T, Barratt K, Liebhold D, Studdert JB, Power M, Jones V, Loebel DA, Tam PP. 2014. C to U RNA editing mediated by APOBEC1 requires RNA-binding protein RBM47. EMBO Rep 15: 903-910. Gao J, Choudhry H, Cao W. 2018. Apolipoprotein B mRNA editing enzyme catalytic polypeptide-like family genes activation and regulation during tumorigenesis. Cancer science 109: 2375-2382. Giannoni F, Bonen DK, Funahashi T, Hadjiagapiou C, Burant CF, Davidson NO. 1994. Complementation of apolipoprotein B mRNA editing by human liver accompanied by secretion of apolipoprotein B48. J Biol Chem 269: 5932-5936. Grohmann M, Hammer P, Walther M, Paulmann N, Buttner A, Eisenmenger W, Baghai TC, Schule C, Rupprecht R, Bader M et al. 2010. Alternative splicing and extensive RNA editing of human TPH2 transcripts. PloS one 5: e8956. Gu T, Buaas FW, Simons AK, Ackert-Bicknell CL, Braun RE, Hibbs MA. 2012. Canonical A-to-I and C-to-U RNA editing is enriched at 3'UTRs and microRNA target sites in multiple mouse tissues. PLoS One 7: e33720. Harris RS, Bishop KN, Sheehy AM, Craig HM, Petersen-Mahrt SK, Watt IN, Neuberger MS, Malim MH. 2003. DNA deamination mediates innate immunity to retroviral infection. Cell 113: 803-809. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 26 Hersberger M, Innerarity TL. 1998. Two efficiency elements flanking the editing site of cytidine 6666 in the apolipoprotein B mRNA support mooring-dependent editing. J Biol Chem 273: 9435-9442. Hirano K, Young SG, Farese RV, Jr., Ng J, Sande E, Warburton C, Powell-Braxton LM, Davidson NO. 1996. Targeted disruption of the mouse apobec-1 gene abolishes apolipoprotein B mRNA editing and eliminates apolipoprotein B48. J Biol Chem 271: 9887-9890. Hosmer Jr DW, Lemeshow S, Sturdivant RX. 2013. Applied logistic regression. John Wiley & Sons. Hospattankar AV, Higuchi K, Law SW, Meglin N, Brewer HB, Jr. 1987. Identification of a novel in-frame translational stop codon in human intestine apoB mRNA. Biochem Biophys Res Commun 148: 279-285. Kanata E, Llorens F, Dafou D, Dimitriadis A, Thune K, Xanthopoulos K, Bekas N, Espinosa JC, Schmitz M, Marin-Moreno A et al. 2019. RNA editing alterations define manifestation of prion diseases. Proc Natl Acad Sci U S A 116: 19727-19735. Kankowski S, Forstera B, Winkelmann A, Knauff P, Wanker EE, You XA, Semtner M, Hetsch F, Meier JC. 2017. A Novel RNA Editing Sensor Tool and a Specific Agonist Determine Neuronal Protein Expression of RNA-Edited Glycine Receptors and Identify a Genomic APOBEC1 Dimorphism as a New Genetic Risk Factor of Epilepsy. Front Mol Neurosci 10: 439. Lellek H, Kirsten R, Diehl I, Apostel F, Buck F, Greeve J. 2000. Purification and molecular cloning of a novel essential component of the apolipoprotein B mRNA editing enzyme- complex. J Biol Chem 275: 19848-19856. MacParland SA, Liu JC, Ma XZ, Innes BT, Bartczak AM, Gage BK, Manuel J, Khuu N, Echeverri J, Linares I et al. 2018. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat Commun 9: 4383. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 27 Maris C, Masse J, Chester A, Navaratnam N, Allain FH. 2005. NMR structure of the apoB mRNA stem-loop and its interaction with the C to U editing APOBEC1 complementary factor. RNA 11: 173-186. Mehta A, Kinter MT, Sherman NE, Driscoll DM. 2000. Molecular cloning of apobec-1 complementation factor, a novel RNA-binding protein involved in the editing of apolipoprotein B mRNA. Mol Cell Biol 20: 1846-1854. Meier JC, Henneberger C, Melnick I, Racca C, Harvey RJ, Heinemann U, Schmieden V, Grantyn R. 2005. RNA editing produces glycine receptor alpha3(P185L), resulting in high agonist potency. Nat Neurosci 8: 736-744. Mukhopadhyay D, Anant S, Lee RM, Kennedy S, Viskochil D, Davidson NO. 2002. C-->U editing of neurofibromatosis 1 mRNA occurs in tumors that express both the type II transcript and apobec-1, the catalytic subunit of the apolipoprotein B mRNA-editing enzyme. Am J Hum Genet 70: 38-50. Nik-Zainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, Jones D, Hinton J, Marshall J, Stebbings LA et al. 2012. Mutational processes molding the genomes of 21 breast cancers. Cell 149: 979-993. Petljak M, Alexandrov LB, Brammeld JS, Price S, Wedge DC, Grossmann S, Dawson KJ, Ju YS, Iorio F, Tubio JMC et al. 2019. Characterizing Mutational Signatures in Human Cancer Cell Lines Reveals Episodic APOBEC Mutagenesis. Cell 176: 1282-1294 e1220. Powell LM, Wallis SC, Pease RJ, Edwards YH, Knott TJ, Scott J. 1987. A novel form of tissue- specific RNA processing produces apolipoprotein-B48 in intestine. Cell 50: 831-840. Rayon-Estrada V, Harjanto D, Hamilton CE, Berchiche YA, Gantman EC, Sakmar TP, Bulloch K, Gagnidze K, Harroch S, McEwen BS et al. 2017. Epitranscriptomic profiling across cell types reveals associations between APOBEC1-mediated RNA editing, gene expression outcomes, and cellular function. Proc Natl Acad Sci U S A 114: 13296- 13301. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 28 Reuter JS, Mathews DH. 2010. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11: 129. Richardson N, Navaratnam N, Scott J. 1998. Secondary structure for the apolipoprotein B mRNA editing site. Au-binding proteins interact with a stem loop. J Biol Chem 273: 31707-31717. Roberts SA, Lawrence MS, Klimczak LJ, Grimm SA, Fargo D, Stojanov P, Kiezun A, Kryukov GV, Carter SL, Saksena G et al. 2013. An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers. Nat Genet 45: 970-976. Rosenberg BR, Hamilton CE, Mwangi MM, Dewell S, Papavasiliou FN. 2011. Transcriptome- wide sequencing reveals numerous APOBEC1 mRNA-editing targets in transcript 3' UTRs. Nat Struct Mol Biol 18: 230-236. Saraconi G, Severi F, Sala C, Mattiuz G, Conticello SG. 2014. The RNA editing enzyme APOBEC1 induces somatic mutations and a compatible mutational signature is present in esophageal adenocarcinomas. Genome Biol 15: 417. Schaefermeier P, Heinze S. 2017. Hippocampal Characteristics and Invariant Sequence Elements Distribution of GLRA2 and GLRA3 C-to-U Editing. Mol Syndromol 8: 85-92. Shah RR, Knott TJ, Legros JE, Navaratnam N, Greeve JC, Scott J. 1991. Sequence requirements for the editing of apolipoprotein B mRNA. J Biol Chem 266: 16301-16304. Skuse GR, Cappione AJ, Sowden M, Metheny LJ, Smith HC. 1996. The neurofibromatosis type I messenger RNA undergoes base-modification RNA editing. Nucleic Acids Res 24: 478- 485. Smith HC, Kuo SR, Backus JW, Harris SG, Sparks CE, Sparks JD. 1991. In vitro apolipoprotein B mRNA editing: identification of a 27S editing complex. Proc Natl Acad Sci U S A 88: 1489-1493. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 29 Snyder EM, McCarty C, Mehalow A, Svenson KL, Murray SA, Korstanje R, Braun RE. 2017. APOBEC1 complementation factor (A1CF) is dispensable for C-to-U RNA editing in vivo. RNA 23: 457-465. Sowden M, Hamm JK, Spinelli S, Smith HC. 1996. Determinants involved in regulating the proportion of edited apolipoprotein B RNAs. RNA 2: 274-288. Teng B, Burant CF, Davidson NO. 1993. Molecular cloning of an apolipoprotein B messenger RNA editing protein. Science 260: 1816-1819. Wolfe AD, Arnold DB, Chen XS. 2019. Comparison of RNA Editing Activity of APOBEC1-A1CF and APOBEC1-RBM47 Complexes Reconstituted in HEK293T Cells. J Mol Biol 431: 1506-1517. Wolfe AD, Li S, Goedderz C, Chen XS. 2020. The structure of APOBEC1 and insights into its RNA and DNA substrate selectivity. NAR Cancer 2: zcaa027. Zuker M. 2003. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31: 3406-3415. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 30 Table 1. Multivariable linear regression model for determinant factors of editing frequency in mouse APOBEC1-dependent C-to-U mRNA editing sites. Determinant of editing frequency Subgroup ß (95% CI) P value Model without co-factor group N=103; R2= 0.84; P<.001 Base content score per unit increments 1.00 [0.83, 1.17] <0.001 Count of mismatches in mooring sequence per unit increments -5.89 [-7.48, -4.31] <.001 Count of mismatches in regulatory sequence motif D (whole sequence) per unit increments -2.00 [-3.58, -0.43] .01 AU content of regulatory sequence motif B per 10% increments -2.41 [-4.38, -0.45] .02 Overall secondary structure C loop Reference C stem 1.20 [-5.07, 7.47] .7 C tail -12.19 [-20.80, -3.58] .006 Non-canonical -10.67 [-20.92, -0.43] 0.04 Location of mooring sequence Stem-loop Reference Other -11.56 [-17.35, -5.77] <.001 After adding co-factor group to the model N=72; R2= 0.84; P<.001 Co-factor group RBM47 dominant Reference Co-dominant -12.30 [-20.63, -3.97] .005 A1CF dominant 11.54 [-0.64, 23.72] .07 ß: represents average change (%) in the editing frequency compared to the reference group CI: confidence interval (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 31 Table 2: Characteristics of human C-to-U mRNA editing targets Parameter Low editing High editing NF1 GLYCRA3 GLYCRA2 TPH2B TPH2B APOB Editing location C2914 C554 C575 C385 (exon3) C830 (exon7) C6666 Tissue neural sheath / CNS tumor hippocampus hippocampus amygdala amygdala small intestine Editing frequency %) 10 10 17 89 98 >95 Mismatches in regulatory motif A 1 3 3 2 3 0 Mismatches in regulatory motif B 2 4 5 4 5 0 Mismatches in regulatory motif C 4 4 4 4 4 0 Mismatches in regulatory motif D 6 8 9 8 9 0 AU content (%) in regulatory motif A 100 33 33 100 0 100 AU content (%) in regulatory motif B 100 60 20 100 20 80 AU content (%) in regulatory motif C* 60 40 60 40 40 100 AU content (%) in regulatory motif D 80 50 40 70 30 90 Spacer length* 6 2 2 0 3 4 Spacer AU content (%) 67 0 0 33 100 Mismatches in spacer 2 2 2 2 0 Mismatches in mooring* 3 4 2 1 5 0 AU content (%) of 3 downstream bases* 67 33 33 100 33 100 AU content (%) of 20 downstream bases 60 60 70 55 35 85 Overall secondary structure canonical canonical canonical canonical canonical canonical Location of edited C* loop tail tail stem loop loop Location of mooring sequence stem-loop stem-loop stem-loop stem-loop stem-loop stem-loop Ratio of stem-loop bases* 0.46 0.375 0.5 0.45 0.92 0.96 Free tail orientation symmetric symmetric asymmetric symmetric asymmetric asymmetric Composite score 2 2 2 5 4 5 CNS: central nervous system * these items were used to calculate the composite score (total score = 6) as follows: AU content (%) in regulatory motif C: < 50%: 1, ≥ 50%: 0 spacer length: ≤ 4: 1, > 4: 0 mismatches in mooring: < 3: 1, ≥ 3: 0 AU content (%) of 3 downstream bases: > 50%: 1, ≤ 50%: 0 location of edited C in secondary structure: stem-loop: 1, tail: 0 ratio of stem-loop bases: > 50%: 1, ≤ 50%: 0 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 32 FIGURE LEGENDS Figure 1. Characteristics of murine APOBEC1-mediated C-to-U mRNA editing sites. A: schematic presentation of mRNA target, chromosomal editing location, and editing sites considered. Each mRNA target could be edited at one or more chromosomal location(s) (blue boxes). Each editing location could be edited in one or more tissues giving rise to one or more editing site(s) per location (green boxes). Editing site(s) of each mRNA target are the sum of editing sites from all editing locations reported for that target. B: examples of canonical (ApoB chr12: 8014860, top) and two types of non-canonical (Kctd12 chr14: 103379573 and Dcn chr10: 96980535) secondary structures. C: distribution of number of chromosomal editing location(s), or targeted cytidine(s), per mRNA target. D: distribution of number of total editing sites per mRNA target considering all chromosomal location(s) edited at different tissue(s). E: distribution of location of editing sites within gene structure. Figure 2. Base content of sequences flanking modified cytidine in RNA editing and DNA mutation targets. A: base content of 10 nucleotides upstream and 20 nucleotides downstream of edited cytidine in mouse APOBEC1-mediated C-to-U mRNA editing targets. B: base content of 10 nucleotides upstream and 10 nucleotides downstream of mutated cytidine in proposed human APOBEC-mediated DNA mutation targets in patients with breast cancer. C: comparison of AU base content (%) of nucleotides flanking modified cytidine in RNA editing targets and DNA mutation targets in mouse and human breast cancer patients, respectively. Figure 3. Characteristics of regulatory-spacer-mooring cassette and base content of individual nucleotides flanking edited cytidine in association with editing frequency. A: schematic illustration of regulatory-spacer-mooring cassette. Four motifs were defined for (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 33 regulatory sequence: motif A for nucleotides -1 to -3; motif B for nucleotides -1 to -5; motif C for nucleotides -6 to -10; motif D representative of the whole sequence. B: association of the mismatches in motif D of regulatory sequence with editing frequency. C: association between the AU content (%) of regulatory sequence (motif B) and editing frequency. D: association of the mismatches in spacer (nucleotides +1 to +4 downstream of the edited cytidine) with editing frequency. E: association of the mismatches in mooring sequence with editing frequency. F: heatmap plot illustrating the association between base content of 30 nucleotides flanking the edited cytidine with editing frequency. Red color density in each cell represents the beta coefficient value of corresponding base in the multivariable linear regression model fit including that nucleotide. The asteriska refer to the nucleotides that were retained in the final model. Mismatches in regulatory, spacer, and mooring sequences were determined in comparison to the corresponding sequences in ApoB mRNA (as reference). r: Pearson correlation coefficient. Figure 4. Secondary structure-related features in association with editing frequency. A: distribution of different types of overall secondary structure in editing sites. C loop, C stem, C tail are three subtypes of canonical secondary structure based on the location of the edited cytidine. B: association between type of secondary structure and editing frequency. C: distribution of the mooring sequence location in editing sites. “Other” refers to mooring sequences located in tail or stem/loop and not part of the main stem-loop structure. D: association of mooring sequence location with editing frequency. E: association between ratio of main stem-loop bases to total bases count and editing frequency. F: association of the 5’ free tail length with editing frequency. * P<.05; ** P<.001. r: Pearson correlation coefficient. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 34 Figure 5. Dominance and tissue-specific cofactor patterns among editing sites. A: distribution of dominant co-factor in editosomes of editing sites. B: association of dominant co- factor with editing frequency. C: distribution of number of editing tissue(s) per mRNA target. D: tissue distribution of editing sites. E: average editing frequency of editing sites edited at different tissues. SI, small intestine. Figure 6. Co-factor pattern and tissue-specific role in murine C-to-U mRNA editing sites. A: distribution of editing tissue across subgroups of editing sites with different dominant co- factor patterns. B: location of edited cytidine in secondary structure of editing sites with different dominant co-factor patterns. C: schematic presentation of factors that correlate with dominant co-factor pattern in editing sites. This graph is based on the findings derived from pairwise multinomial logistic regression models. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 35 SUPPLEMENTAL FIGURE LEGENDS Supplemental Figure 1. Chromosomal distribution of murine APOBEC1-mediated C-to-U mRNA editing sites. The black curve corresponds to left Y-axis and represents average editing frequencies of editing sites related to each chromosome. The blue curve corresponds to right Y axis and represents number of editing sites related to each chromosome. Supplemental Figure 2. Association of editing frequency with characteristics of regulatory sequence in murine APOBEC1-mediated C-to-U mRNA editing sites. A-C. Association of editing frequency with number of mismatches and AU content (%). D-F Association of editing frequency with different regulatory sequence motifs. Mismatches were determined in comparison to the same regulatory sequence motif in ApoB mRNA (as reference). Supplemental Figure 3. Association of editing frequency with characteristics of downstream sequence in murine APOBEC1-mediated C-to-U mRNA editing sites. A. Association of editing frequency with spacer length. B. Association of editing frequency with spacer AU content (%). C-F. Association of editing frequency with and AU content of successive segments downstream of the edited cytidine. Supplemental Figure 4. Association of editing frequency with secondary structure- related characteristics in C-to-U mRNA editing sites. A: distribution of edited cytidine location in secondary structure regardless of the overall secondary structure. B: association of editing frequency with edited cytidine location in secondary structure. C: distribution of free tail (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 36 orientation in editing sites. D: association of editing frequency with free tail orientation in editing sites. E: association of editing frequency with 3’ free tail length. * P<.05; *** P<.0001. r: Pearson correlation coefficient. Supplemental Figure 5. Association of secondary structure-related characteristics with dominant co-factor pattern in APOBEC1-mediated C-to-U mRNA editing sites. A. Distribution of mooring sequence location presented in the context of different dominant co- factor patterns. B. Distribution of free tail orientation in secondary structure among editing sites, presented in the context of different dominant co-factor patterns. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 37 Supplemental table 1. Multivariable linear regression model for individual nucleotides surrounding edited cytosine (-10 to +20) in mouse APOBEC1-dependent C-to-U mRNA editing sites. Location of nucleotide relative to edited C Base preference ß (95% CI) P value Nucleotide -8 GU 8.15 [3.0,13.3] 0.002 Nucleotide -7 C 12.7 [4.3, 21.0] 0.003 Nucleotide -6 G 7.1 [0.6, 13.7] 0.03 Nucleotide -5 U 5.2 [1.0, 9.5] 0.02 Nucleotide -2 AUC 13.5 [9.0, 17.9] <0.001 Nucleotide -1 AU 15.9 [4.0, 27.9] 0.01 Nucleotide +1 AGU 19.5 [12.5, 26.6] <0.001 Nucleotide +3 G 12.2 [7.4, 16.9] <0.001 Nucleotide +4 G 15.9 [10.9, 21.0] <0.001 Nucleotide +7 C 10.3 [1.5, 19.2] 0.02 Nucleotide +9 G 9.7 [1.4, 18.0] 0.02 Nucleotide +12 AUC 7.5 [1.0, 13.9] 0.02 Nucleotide +16 AC 6.6 [2.2, 11.0] 0.004 Nucleotide +17 AU 5.6 [0.5, 10.8] 0.03 Nucleotide +18 AU 6.6 [1.5, 11.8] 0.01 Nucleotide +19 AC 5.65 [1.3, 10.0] 0.01 ß: represents average change (%) in the editing frequency compared to the reference group (non- preferred group) CI: confidence interval (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 38 Supplemental table 2. Descriptive data of regulatory-spacer-mooring cassette in mouse APOBEC1- dependent C-to-U mRNA editing sites. Parameter N Mean SD Min Max Sequence-related features Mismatches in regulatory (motif A) 177 1.72 0.94 0 3 Mismatches in regulatory (motif B) 177 3.35 1.12 0 5 Mismatches in regulatory (motif C) 177 3.78 0.99 0 5 Mismatches in regulatory (motif D) 177 7.12 1.76 0 10 AU content (%) of regulatory (motif A) 177 75.14 26.00 0 100 AU content (%) of regulatory (motif B) 177 73.44 22.10 0 100 AU content (%) of regulatory (motif C) 177 63.00 23.40 0 100 AU content (%) of regulatory (motif D) 177 68.25 18.40 10 100 Spacer length 177 5.08 3.67 0 20 Mismatches in spacer 152 2.54 1.09 0 4 AU content (%) of spacer 172 72.65 23.39 0 100 Mismatches in mooring 177 2.13 1.81 0 8 AU content (%) of downstream sequence +1 to +5 177 72.88 19.46 0 100 AU content (%) of downstream sequence +6 to +10 177 69.94 22.78 0 100 AU content (%) of downstream sequence +11 to +15 177 72.43 20.65 20 100 AU content (%) of downstream sequence +16 to +20 177 66.21 22.56 0 100 Secondary structure-related features Proportion of the bases that constitute main stem- loop 172 0.61 0.18 0.28 1 Length of 5’ free tail 172 4.25 3.93 0 15 Length of 3’ free tail 172 5.27 4.65 0 17 SD: standard deviation (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 39 Supplemental table 3. Comparing three subgroups of mouse APOBEC1-dependent C-to-U mRNA editing sites based on co-factor dominance. Parameter RBM47-dominant A1CF-dominant Co-dominant P value N Mean SD N Mean SD N Mean SD Mismatches in regulatory (motif A) 60 1.48 0.93 5 1.80 0.45 7 1.14 0.69 .4 Mismatches in regulatory (motif B) 60 3.05 1.13 5 3.60 0.55 7 3.00 0.82 .51 Mismatches in regulatory (motif C) 60 3.58 1.05 5 3.80 0.45 7 4.29 1.11 .1 Mismatches in regulatory (motif D) 60 6.63 1.90 5 7.40 0.55 7 7.29 1.50 .44 AU content (%) of regulatory (motif A) 60 82.22 18.88 5 80.00 18.26 7 85.71 17.82 .8 AU content (%) of regulatory (motif B) 60 76.33 16.67 5 84.00 16.73 7 82.86 17.99 .5 AU content (%) of regulatory (motif C) 60 62.67 22.84 5 72.00 17.89 7 62.86 21.38 .6 AU content (%) of regulatory (motif D) 60 69.50 14.89 5 78.00 13.04 7 72.86 12.54 .4 Spacer length 60 5.20 3.93 5 7.20 5.45 7 7.86 5.08 .2 Mismatches in spacer (in 4-base cassette) 40 2.43 1.20 4 2.75 1.50 6 3.83 0.41 .02 Mismatches in spacer (relative abundance (%)) 60 61.81 30.89 5 61.67 36.13 7 82.14 37.40 .2 AU content (%) of spacer 60 77.30 17.83 5 72.08 18.14 7 71.37 15.24 .5 Mismatches in mooring 60 1.12 1.30 5 2.00 2.55 7 2.86 0.38 .004 AU content (%) of downstream sequence +1 to +5 60 77.33 14.94 5 80.00 20.00 7 71.43 15.74 .7 AU content (%) of downstream sequence +6 to +10 60 77.67 18.81 5 60.00 24.49 7 57.14 13.80 .01 AU content (%) of downstream sequence +11 to +15 60 80.33 15.40 5 72.00 17.89 7 65.71 15.12 0.06 AU content (%) of downstream sequence +16 to +20 60 70.33 20.00 5 72.00 10.95 7 77.14 17.99 .6 Proportion of the bases that constitute main stem-loop 60 0.62 0.18 5 0.71 0.10 7 0.59 0.21 .5 Length of 5’ free tail 60 4.08 3.81 5 2.40 3.91 7 6.86 6.20 .3 Length of 3’ free tail 60 5.35 4.84 5 6.00 2.55 7 5.00 5.66 .6 SD: standard deviation (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 40 Supplemental Table 4. Multinomial logistic regression model for determinant factors of co-factor dominancy in mouse APOBEC1-dependent C-to-U mRNA editing sites. Determinant of co-factor dominancy Subgroup Coefficient (95% CI) P value A1CF-dominant vs RBM47-dominant Tissue Small intestine Reference Liver 4.40 [0.34, 5.21] .04 Location of edited cytosine Loop Reference Stem -3.88 [-8.31, 0.55] 0.08 Tail -19.13 [-25.82, -12.44] <0.001 Mismatches in mooring sequence per unit increments 0.30 [-0.97, 1.57] 0.6 Mismatches in regulatory sequence motif B per unit increments 1.62 [0.063, 3.30] .05 Mismatches in regulatory sequence motif C per unit increments 0.12 [-0.83, 1.08] .8 AU content (%) of regulatory sequence motif D per unit increments 0.17 [-0.04, 0.39] 0.1 AU content (%) of downstream sequence +1 to +5 per unit increments -0.02 [-0.09, 0.04] 0.5 AU content (%) of downstream sequence +6 to +10 per unit increments -0.06 [-0.1, -0.02] 0.006 AU content (%) of downstream sequence +11 to +15 per unit increments -0.06 [-0.18, 0.07] 0.4 Co-dominant vs RBM47-dominant Tissue Small intestine Reference Liver -1.73 [-6.00, 2.50] 0.4 Location of edited cytosine in secondary structure C loop Reference C stem 1.70 [-2.11, 5.51] 0.4 C tail 3.70 [0.72, 6.67] 0.01 Mismatches in mooring sequence per unit increments 0.66 [0.01, 1.33] .05 Mismatches in regulatory sequence motif B per unit increments -2.32 [-3.86, -0.79] .003 Mismatches in regulatory sequence motif C per unit increments 3.16 [1.12, 5.21] 0.002 AU content (%) of regulatory sequence motif D per unit increments 0.13 [0.02, 0.24] 0.02 AU content (%) of downstream sequence +1 to +5 per unit increments -0.17 [-0.35, -0.01] 0.04 AU content (%) of downstream sequence +6 to +10 per unit increments -0.10 [-0.28, 0.07] 0.25 AU content (%) of downstream sequence +11 to +15 per unit increments -0.10 [-0.19, -0.01] 0.03 Co-dominant vs A1CF -dominant Tissue Small intestine Reference Liver -6.13 [-10.60, -0.31] 0.04 Location of edited cytosine in secondary structure C loop Reference C stem 5.58 [0.06, 9.22] 0.05 C tail 22.83 [15.53, 30.12] <0.001 Mismatches in mooring sequence per unit increments 0.36 [-0.87, 1.59] 0.6 Mismatches in regulatory sequence motif B per unit increments -3.94 [-6.27, -1.61] 0.001 Mismatches in regulatory sequence motif C per unit increments 3.04 [0.91, 5.16] 0.005 AU content (%) of regulatory sequence motif D per unit increments -0.04 [-0.29, 0.20] 0.72 AU content (%) of downstream sequence +1 to +5 per unit increments -0.15 [-0.32, 0.02] 0.09 AU content (%) of downstream sequence +6 to +10 per unit increments -0.04 [-0.22, 0.13] 0.62 AU content (%) of downstream sequence +11 to +15 per unit increments -0.04 [-0.19, 0.11] 0.58 Model parameters: N=72; Pseudo R2= 0.59; P<.001 CI: confidence interval (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 8, 2021. ; https://doi.org/10.1101/2021.01.08.425897doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425897 10_1101-2021_01_08_425379 ---- Competitive binding of STATs to receptor phospho-Tyr motifs accounts for altered cytokine responses in autoimmune disorders 1 Competitive binding of STATs to receptor phospho-Tyr motifs accounts for altered cytokine responses in autoimmune disorders Stephan Wilmes1*, Polly-Anne Jeffrey2*, Jonathan Martinez-Fabregas1, Maximillian Hafer3, Paul Fyfe1, Elizabeth Pohler1, Silvia Gaggero 4, Martín López-García2, Grant Lythe2, Thomas Guerrier5, David Launay5, Mitra Suman4, Jacob Piehler3, Carmen Molina-París2# and Ignacio Moraga1# 1 Division of Cell Signalling and Immunology, School of Life Sciences, University of Dundee, Dundee, UK. 2 Department of Applied Mathematics, School of Mathematics, University of Leeds, Leeds, UK. 3 Department of Biology and Centre of Cellular Nanoanalytics, University of Osnabrück, Osnabrück, Germany. 4 Université de Lille, INSERM UMR1277 CNRS UMR9020–CANTHER and Institut pour la Recherche sur le Cancer de Lille (IRCL), Lille, France. 5 Univ. Lille, Inserm, CHU Lille, U1286 - INFINITE - Institute for Translational Research in Inflammation, F-59000 Lille, France. * These authors contributed equally to this work # These authors share senior authorship ABSTRACT Cytokines elicit pleiotropic and non-redundant activities despite strong overlap in their usage of receptors, JAKs and STATs molecules. We use IL-6 and IL-27 to ask how two cytokines activating the same signaling pathway have different biological roles. We found that IL-27 induces more sustained STAT1 phosphorylation than IL-6, with the two cytokines inducing comparable levels of STAT3 phosphorylation. Mathematical and statistical modelling of IL-6 and IL-27 signaling identified STAT3 binding to GP130, and STAT1 binding to IL-27Ra, as the main dynamical processes contributing to sustained pSTAT1 by IL-27. Mutation of Tyr613 on IL-27Ra decreased IL-27-induced STAT1 phosphorylation by 80% but had limited effect on STAT3 phosphorylation. Strong receptor/STAT coupling by IL-27 initiated a unique gene expression program, which required sustained STAT1 phosphorylation and IRF1 expression and was enriched in classical Interferon Stimulated Genes. Interestingly, the STAT/receptor coupling exhibited by IL-6/IL-27 was altered in patients with Systemic lupus erythematosus (SLE). IL-6/IL-27 induced a more potent STAT1 activation in SLE patients than in healthy controls, which correlated with higher STAT1 expression in these patients. Partial inhibition of JAK activation by sub-saturating doses of Tofacitinib specifically lowered the levels of STAT1 activation by IL-6. Our data show that receptor and STATs concentrations critically contribute to shape cytokine responses and generate functional pleiotropy in health and disease. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 2 INTRODUCTION IL-27 and IL-6 both have intricate functions regulating inflammatory responses (1). IL-27 is a hetero-dimeric cytokine comprised of p28 and EBI3 subunits (2). IL-27 exerts its activities by binding GP130 and IL-27Rα receptor subunits in the surface of responsive cells, triggering the activation of the JAK1/STAT1/STAT3 signaling pathway. IL-27 elicits both pro- and anti- inflammatory responses, although the later activity seems to be the dominant one (3). IL-27 stimulation inhibits RORgt expression, thereby suppressing Th-17 commitment and limiting subsequent production of pro-inflammatory IL-17 (4, 5). Moreover, IL-27 induces a strong production of anti-inflammatory IL-10 on (Tbet+ and FoxP3-) Tr-1 cells (6-8) further contributing to limit the inflammatory response. IL-6 engages a hexameric receptor complex comprised of each of two copies of IL-6Ra, GP130 and IL-6 (9), triggering the activation, as IL-27 does, of the JAK1/STAT1/STAT3 signaling pathway. However, opposite to IL-27, IL-6 is known as a paradigm pro-inflammatory cytokine (10, 11). IL-6 inhibits lineage differentiation to Treg cells (12) while promoting Th-17 (13, 14), thus supporting its pro-inflammatory role. How IL-27 and IL-6 elicit opposite immuno-modulatory activities despite activating almost identical signaling pathways is currently not completely understood. The relative and absolute STATs activation levels seem to have intricate roles, which lead to a strong signaling and functional plasticity by cytokines. Although IL-6 robustly activates STAT3, it is capable to mount a considerable STAT1 response as well (15). Moreover, in the absence of STAT3, IL-6 induces a strong STAT1 response comparable to IFNg – a prototypic STAT1 activating cytokine (16). Likewise, the absence of STAT1 potentiates the STAT3 response for IL-27, which normally elicits a strong STAT1 response, rendering it to mount an IL-6-like response (15). Furthermore, negative feedback mechanisms like SOCSs and phosphatases have been described as critical players influencing STAT1 and STAT3 phosphorylation kinetics and thereby shaping their signal integration for GP130-utilizing cytokines (17-20). Yet, how all these molecular components are integrated by a given cell to produce the desired response is still an open question. Among the IL-6/IL-12 cytokine family, IL-27 exhibits a unique STAT activation pattern. The majority of GP130-engaging cytokines activate preferentially STAT3, with activation of STAT1 being an accessory or balancing component (21, 22). IL-27, however, triggers STAT1 and STAT3 activation with high potency (23). Indeed, different studies have shown that IL-27 responses rely on either STAT1 (24-26) or STAT3 activation (7, 27). Moreover, recent transcriptomics studies showed that in the absence of STAT3, IL-6 and IL-27 lost more than 75% of target gene induction. Yet, STAT1 was the main factor driving the specificity of the IL-27 versus the IL-6 response, highlighting a critical interplay of STAT1 and STAT3 engagement (28). While the biological responses induced by IL-27 and IL-6 have been extensively studied (3, 11), the very initial steps of signal activation and kinetic integration by these two cytokines have not been comprehensively analysed. Since the different biological outcomes elicited by IL-27 and IL-6 are most likely encoded in the early events of cytokine stimulation, here we specifically aimed to identify the molecular determinants underlying functional selectivity by IL-27 in human T-cells. We asked how a defined cytokine stimulus is propagated in time over multiple layers of signaling to produce the desired response. To this end, we probed IL-27 and IL-6 signaling at different scales, ranging from cell surface receptor assembly and early STAT1/3 effector activation to an unbiased and quantitative multi-omics approach: phospho- proteomics after early cytokine stimulation, kinetics of transcriptomic changes and alteration of the T-cell proteome upon prolonged cytokine exposure. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 3 IL-6 and IL-27 induced similar levels of assembly of their respective receptor complexes, which resulted in comparable phosphorylation of STAT3 by the two cytokines. IL-27, on the other hand, triggered a more sustained STAT1 phosphorylation. To decipher the molecular events which determine sustained STAT1 phosphorylation by IL-27, we mathematically model the STAT1 and STAT3 signaling kinetics induced by each of these cytokines. We identified differential binding of STAT1 and STAT3 to IL-27Ra and GP130, respectively, as the main factor contributing to a sustained STAT1 activation by IL-27. At the transcriptional level, IL-27 triggered the expression of a unique gene program, which strictly required the cooperative action between sustained pSTAT1 and IRF1 expression to drive the induction of an interferon- like gene signature that profoundly shaped the T-cell proteome. Interestingly, our mathematical models of IL-6 and IL-27 signaling predicted that changes in receptor and STAT expression could fundamentally change the magnitude and timescale of the IL-6 and IL-27 responses. We found high levels of STAT1 expression in SLE patients when compared to healthy donors, which correlated with biased STAT1 responses induced by IL-6 and IL-27 in these patients. Strikingly, we could specifically inhibit STAT1 activation by IL-6 using suboptimal doses of the JAK inhibitor Tofacitinib. This could provide a new strategy to specifically target individual STATs engaged by cytokines. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 4 RESULTS: IL-27 induces a more sustained STAT1 activation than HypIL-6 in human Th-1 cells IL-6 and IL-27 are critical immuno-modulatory cytokines. While IL-6 engages a hexameric surface receptor comprised of two molecules of IL-6Ra and two molecules of GP130 to trigger the activation of STAT1 and STAT3 transcription factors (Figure 1a), IL-27 binds GP130 and IL-27Ra to trigger activation of the same STATs molecules (Figure 1a). Despite sharing a common receptor subunit, GP130, and activating similar signaling pathways, these two cytokines exhibit non-redundant immuno-modulatory activities, with IL-6 eliciting a potent pro- inflammatory response and IL-27 acting more as an anti-inflammatory cytokine. Here, we set to investigate the molecular rules that determine the functional specificity elicited by IL-6 and IL-27 using human Th-1 cells as a model experimental system. Due to the challenging recombinant expression of the human IL-27, we have recombinantly produced a murine single-chain variant of IL-27 (p28 and EBI3) which cross-reacts with the human receptors and triggers potent signaling, comparable to the signaling output produced by commercial human IL-27 (29) (Supp. Fig. 1a). In addition, we have used a linker-connected single-chain fusion protein of IL-6Ra and IL-6 termed HyperIL-6 (HypIL-6) (30) to diminish IL-6 signaling variability due to changes in IL-6Ra expression during T cell activation (31). CD4+ T cells from human buffy coat samples were isolated by magnetic activated cell sorting (MACS) and grew under Th-1 polarizing conditions. Th-1 cells were then used to study in vitro signaling by IL-27 and IL-6 (Supp. Fig. 1b). We took advantage of a barcoding methodology allowing high-throughput multiparameter flow cytometry to perform detailed dose/response and kinetics studies induced by HypIL-6 and IL-27 in Th-1 cells (32) (Supp. Fig. 1b). Dose- response experiments with IL-27 and HypIL-6 on Th-1 cells showed concentration-dependent phosphorylation of STAT1 and STAT3. Phosphorylation of STAT1/3 was more sensitive to activation by IL-27 with an EC50 of ~20pM compared to ~400pM for HypIL-6 (Figure 1b). Despite this difference in sensitivity, both cytokines yielded the same activation amplitude for pSTAT3. For pSTAT1, however, we observed a significantly reduced maximal amplitude for HypIL-6 relative to IL-27 (Figure 1b). We next performed kinetic studies to assess whether the poor STAT1 activation by HypIL-6 was a result from different activation kinetics. For STAT3, we saw the peak of phosphorylation after ~15-30 minutes, followed by a gradual decline. Both cytokines exhibited an almost identical sustained pSTAT3 profile, with ~20% of activation still seen after 3h of continuous stimulation. Interestingly, IL-27 did not only activate STAT1 with higher amplitude but also more sustained than HypIL-6 (Figure 1c). This could be better appreciated when pSTAT1 levels were normalized to maximal MFI for each cytokine, with IL- 27 inducing clearly a more sustain phosphorylation of STAT1 than HypIL-6 (Supp. Fig. 1c). The same phenotype was observed in other T-cell subsets of activated PBMCs (Supp. Fig. 1d). As cell surface GP130 levels are significantly reduced upon T-cell activation (33), we next investigated whether the transient STAT1 activation profile induced by HypIL-6 resulted from limited availability of GP130. For that we generated a RPE1 cell clone stably expressing ten times higher levels of GP130 in its surface (Figure 1d, right panel). Stimulation of this RPE1 clone with HypIL-6 resulted in a more sustained activation of STAT3, with very little effect on STAT1 activation kinetics when compared to RPE1 wild type cells, suggesting that GP130 receptor density does not contribute to the transient STAT1 activation kinetics elicited by HypIL-6 (Figure 1d). Ligand-induced cell-surface receptor assembly by IL-27 and HypIL-6 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 5 We next investigated whether IL-27 and HypIL-6 elicited differential cell surface receptor engagement that could explain their distinct signaling output. For that, we measured the dynamics of receptor assembly in the plasma membrane of live cells by simultaneous dual- colour total internal reflection fluorescence (TIRF) imaging. RPE1 cells were chosen as a model experimental system since they do not express endogenous IL-27Ra (Supp. Fig. 1e). We used previously described RPE1 GP130 KO cells (Supp. Fig. 2a) (34) to transfect and express tagged variants of IL-27Ra and GP130, to allow quantitative site-specific fluorescence cell surface labelling by dye-conjugated nanobodies (NBs) (Figure 1e) as recently described in (35). For both IL-27Ra and GP130 we found a random distribution and unhindered lateral diffusion of individual receptor monomers (Figure 1f). Single molecule co- localization combined with co-tracking analysis was then used to identify correlated motion of IL-27Ra and GP130 which was taken as a readout for receptor heterodimer formation (36) (Figure 1f, Figure 1 supp. Movie 1). In the resting state, we did not observe pre-assembly of IL-27Ra and GP130. However, after stimulation with IL-27 we found substantial heterodimerization (Figure 1f & 1g, Supp. Fig. 2b, Figure 1 supp. Movie 1 & 2). At elevated laser intensities, bleaching analysis of individual complexes confirmed a one-to-one (1:1) complex stoichiometry of IL-27Ra and GP130, whereas single-molecule Förster resonance energy transfer (FRET) further corroborated close molecular proximity of the two receptor chains (Figure 1h). We also observed association and dissociation events of receptor heterodimers, pointing to a dynamic equilibrium between monomers and dimers as proposed for other heterodimeric cytokine receptor systems (37, 38) (Figure1 supp. Movie 3). To measure homodimerization of GP130 by HypIL-6, we stochastically labelled GP130 with equal concentrations of the same NB species conjugated to either of the two dyes (39). We saw strong homodimerization of GP130 after stimulation with HypIL-6 (Figure 1g, Supp. Fig. 2b , Figure 1 supp. Movie 4). Homodimerization was confirmed either by single- color dual-step bleaching or dual-color single-step bleaching as shown for other homodimeric cytokine receptors (Supp. Fig. 2c) (40). For both cytokine receptor systems, we saw a cytokine-induced reduction of the diffusion mobility, which has been ascribed to increased friction of receptor dimers diffusing in the plasma membrane. However, we note that HypIL-6 stimulation impaired diffusion of GP130 more strongly than IL-27 did, possibly indicating faster receptor internalization (Supp. Fig. 2d). Based on the dimerization data, we were able to calculate the two-dimensional equilibrium dissociation constants (𝐾!"!) according to the law of mass action for a dynamic monomer-dimer equilibrium: for IL-27-induced heterodimerization of IL-27Ra and GP130, we calculated a 2D KD of ~0.81 µm-2. In activated T-cells with high levels and a significant excess of IL-27Ra over GP130, this 𝐾!"! ensures strong receptor assembly by IL-27 (41). The 2D KD for GP130 homodimerization by HypIL-6 was ~0.21 µm-2. This higher affinity is most likely due to the two high-affinity binding sites engaged in the hexameric receptor complex (9). However, in T-cells the expression of GP130 can be particularly low, thus, probably limiting HypIL-6. Taken together, these experiments marked ligand-induced receptor assembly as the initial step triggering downstream signaling for both IL-27 and HypIL-6, with no obvious differences in their receptor activation mechanism which could support the observed more sustained STAT1 activation elicited by IL-27. Mathematical and statistical analysis of HypIL-6 and IL-27 induced STAT kinetic responses To gain further insight into the molecular rules and kinetics that define IL-27 sustained STAT1 phosphorylation, we developed two mathematical models of the initial steps of HypIL-6 and .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 6 IL-27 receptor-mediated signaling, respectively. The mathematical model for each cytokine considers the following events: i) cytokine association and dissociation to a receptor chain (Figure 2a, Supp. Fig. 3a and 3b, top panel), ii) cytokine-induced dimer association and dissociation (Supp. Fig. 3a and 3b, bottom panel), iii) STAT1 (or STAT3) binding and unbinding to dimer (Supp. Fig. 3c and 3d), iv) STAT1 (or STAT3) phosphorylation when bound to dimer (Supp. Fig. 3c and 3d), v) internalisation/degradation of complexes (Supp. Fig. 3e and 3f), and vi) dephosphorylation of free STAT1 (or STAT3) (Supp. Fig. 3g). Details of model assumptions, model parameters and parameter inference have been provided in the Material and Methods under Mathematical models and Bayesian inference. We first wanted to explore if there existed a potential feedback mechanism in the way in which receptor molecules are internalised/degraded over time. To this end, and for each cytokine model, we considered two hypotheses: hypothesis 1 assumes that receptor complexes (Supp. Fig. 3e and 3f) are internalised with rate proportional to the concentration of the species in which they are contained (e.g., different dimer types), and hypothesis 2, that receptor complexes are internalised with rate proportional to the product of the concentration of the species in which they are contained and the sum of the concentrations of free phosphorylated STAT1 and STAT3. Hypothesis 2 is consistent with a negative feedback mechanism in which pSTAT molecules translocate to the nucleus, where they increase the production of negative feedback proteins such as SOCS3. As described in the Material and Methods (Mathematical models and Bayesian inference) we made use of the RPE1 experimental data set to carry out mathematical model selection for the two different hypotheses. We found that hypothesis 1 could explain the data better than hypothesis 2, with a probability of 70%. This result can be seen in Figure 2b, in which we plot, for different values of the distance threshold between the mathematical model output and the data (see Mathematical models and Bayesian inference in Material and Methods, for details), the relative probability of each hypothesis, where hypothesis 1 is denoted 𝐻# and hypothesis 2 is denoted 𝐻". It can be observed that for smaller values of the distance threshold, which indicate better support from the data to the mathematical model, the relative probability of hypothesis 1 is higher than that of hypothesis 2. We then made use of this result to explore the mathematical models for both cytokines under hypothesis 1, in particular we performed parameter calibration. To this end (and as described in Material and Methods under Mathematical models and Bayesian inference), we carried out Bayesian inference together with the mathematical models (hypothesis 1) and the experimental data sets to quantify the reaction rates (see Supp. Fig. 3) and initial molecular concentrations (see Table 1 and Table 2). The Bayesian parameter calibration of the two models of cytokine signaling allows one to quantify the observed kinetics of pSTAT1/3 phosphorylation induced by HypIL-6 and IL-27 in RPE1 and Th-1 cells (Figure 2c). Substantial differences in STAT association rates to and dissociation rates from the dimeric complexes were inferred to critically contribute to defining pSTAT1/3 kinetics. Figure 2d shows the kernel density estimates (KDEs) for the posterior distributions of the rate constants and initial concentrations in the models. 𝑘$% & denotes the rate at which STAT𝑖 binds to GP130 and 𝑘$' & denotes the rate at which STAT𝑖 binds to IL-27Ra, for 𝑖 ∈ {1,3}. Our results indicate that STAT1 and STAT3 exhibit different binding preferences towards IL-27Ra and GP130, respectively. While STAT1 exhibits stronger binding to IL-27Ra than GP130 (𝑘#' & > 𝑘#% & ), STAT3 exhibits stronger binding to GP130 than IL-27Ra, (𝑘(%& > 𝑘(' & ) in agreement with previous observations (42). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 7 IL-27Rα cytoplasmic domain is required for sustained pSTAT1 kinetics The Bayesian inference carried out with the experimental data and the mathematical models clearly indicated statistically significant differences in the binding rates of STAT1/STAT3 to GP130 and IL-27Ra, to account for the different phosphorylation kinetics exhibited by HypIL- 6 and IL-27. Thus, we next investigated whether the more sustained STAT1 activation by IL- 27 resulted from its specific engagement of IL-27Ra. For that, we used RPE1 cells, which do not express IL-27Ra (Supp. Fig. 1e), to systematically dissect the contribution of the IL-27Ra cytoplasmic domain to the differential pSTAT activation by IL-27. IL-27Ra’s intracellular domain is very short and only encodes two Tyr susceptible to be phosphorylated in response to IL-27 stimulation, i.e., Tyr543 and Ty613 (Figure 3a). We mutated these two Tyr to Phe to analyse their contribution to IL-27 induced signaling. We stably expressed WT IL-27Ra as well as different IL-27Ra Tyr mutants in RPE1 cells with comparable cell surface expression levels (Figure 3b). Importantly, this reconstituted experimental system mimicked the pSTAT1/3 activation kinetics of T-cells (Supp. Fig. 4a). As the endogenous GP130 expression levels remain unaltered, all generated clones exhibited very comparable responses to HypIL- 6 (Figure 3b, bottom panels). IL-27 triggered comparable levels of STAT1 and STAT3 activation in RPE1 cells reconstituted with IL-27Ra WT and IL-27Ra Y543F mutant, suggesting that this Tyr residue does not contribute to signaling by this cytokine (Figure 3b and Supp. Fig. 4b). In RPE1 cells reconstituted with the IL-27Ra Y613F or Y543F-Y613F mutants, IL-27 stimulation resulted in 80% of the STAT3 activation, but only 20% of the STAT1 activation levels induced by this cytokine relative to IL-27Ra WT (Figure 3b) (43). These observations suggest a tight coupling of STAT phosphorylation to one of the receptor chains; namely, IL-27Ra with pSTAT1 and GP130 with pSTAT3, respectively. We next tested how the cytoplasmic domains of GP130 and IL-27Ra shape the pSTAT kinetic profiles. Thus, we generated a stable RPE1 clone expressing a chimeric construct comprised of the extracellular and transmembrane domain of IL-27Ra but the cytoplasmic domain of GP130 (Figure 3c, Supp. Fig. 5a). Again, as both cell lines express unaltered endogenous GP130 levels, they exhibited comparable responses to HyIL-6 (Figure 3c). Strikingly, this domain-swap resulted in a transient pSTAT1 kinetic response by IL-27 comparable to HypIL-6 stimulation. STAT3 activation on the other hand remained unaltered suggesting that the cytoplasmic domain of IL-27Ra is essential for a sustained pSTAT1 response but not for pSTAT3. Two plausible scenarios could explain the observed pSTAT1/3 activation differential by HypIL- 6 and IL-27: i) IL-27Ra-JAK2 complex phosphorylates STAT1 faster than GP130-JAK1 complex or ii) pSTAT1 is more quickly dephosphorylated in the IL-6/GP130 receptor homodimer. In the latter case, pSTAT deactivation by constitutively expressed phosphatases could be an additional factor of regulation. Indeed, SHP-2 has been described to bind to GP130 and shape IL-6 responses (44). However, our Bayesian inference results (together with the mathematical models and the experimental data) identified the STAT/receptor association rates as the only rates that could account for the greater and more sustained activation of STAT1 by IL-27. We note (as described in the Material and Methods) that the phosphorylation rate, denoted by q, of STAT1 and STAT3 when bound to a dimer (homo- or hetero-) has been assumed to be independent of the STAT type and the receptor chain. Moreover, the model also included dephosphorylation of free pSTAT molecules, and predicted that the rates at which these reactions occur (𝑑# and 𝑑() had rather similar posterior distributions, hence arguing against the potential role of phosphatases to specifically target STAT1 upon HypIL-6 stimulation. To distinguish between the two plausible scenarios, we next .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 8 determined the rates of pSTAT1/3 dephosphorylation by blocking JAK activity upon cytokine stimulation making use of the JAK inhibitor Tofacitinib in RPE1 cells. Tofacitinib was added 15 minutes after stimulation with either cytokine and pSTAT1 and pSTAT3 levels were measured at the indicated times. JAK inhibition markedly shortened the pSTAT1/3 activation profiles induced by both cytokines (Figure 3d, Supp. Fig. 5b). The relative dephosphorylation rates could then be determined by the signal intensity ratio of +/- Tofacitinib. Even though pSTAT1 levels were more affected by JAK inhibition than those of pSTAT3, the observed relative changes were nearly identical for IL-27 and HypIL-6. These findings were also confirmed for Th-1 cells (Supp. Fig. 5c & 5d) and indicate, that selective phosphatase activity cannot serve as an explanation for the pSTAT1/3 differential by HypIL-6 and IL-27, in agreement with our mathematical modelling predictions. Similarly, we tested whether neosynthesis of feedback inhibitors such as SOCS3 (19) would selectively impair signaling by HypIL-6 but not by IL-27. To this end we pre-treated cells with Cycloheximide (CHX) and followed the pSTAT1/3 kinetics induced by the two cytokines (Supp. Fig. 6a & 6b). CHX treatment resulted in more sustained pSTAT3 activity for both cytokines. To our surprise, STAT1 phosphorylation by IL-27 was even more sustained while pSTAT1 levels induced by IL-6 remained unaffected. These observations exclude that feedback inhibitors selectively impair STAT1 activation kinetics by HypIL-6 and thus do not account for the faster STAT1 dephosphorylation kinetics observed under HypIL-6 stimulation. Overall our data from the chimera and mutant experiments, which were not used in the Bayesian calibration, provide strong and independent support, as well as validation, to the mathematical models of HypIL- 6 and IL-27 signaling, and point to the differential association/dissociation of STAT1 and STAT3 to IL-27Ra and GP130, respectively, as the main factor defining STAT phosphorylation kinetics in response to HypIL-6 and IL-27 stimulation. Unique and overlapping effects of IL-27 and HypIL-6 on the Th-1 phosphoproteome Thus far, we have investigated the differential activation of STAT1/STAT3 induced by HypIL- 6 and IL-27. Next, we asked whether IL-27 and IL-6 induced the activation of additional and specific intracellular signaling programs that could contribute to their unique biological profiles. To this end, we investigated the IL-27 and HypIL-6 activated signalosome using quantitative mass-spectrometry-based phospho-proteomics. MACS-isolated CD4+ were polarized into Th- 1 cells and expanded in vitro for stable isotope labelling by amino acids in cell culture (SILAC). Cells were then stimulated for 15 min with saturating concentrations of IL-27, HypIL-6 or left untreated. Samples were enriched for phosphopeptides (Ti-IMAC), subjected to mass spectrometry and raw files analysed by MaxQuant software (Supp. Fig. 7a). In total we could quantify ~6400 phosphopeptides from 2600 proteins, identified across all conditions (unstimulated, IL-27, HypIL-6) for at least two out of three tested donors. For IL-27 and HypIL- 6 we detected similar numbers of significantly upregulated (87 vs. 78) and downregulated (155 vs. 140) phosphorylation events (Figure 4a) and systematically categorized them in context with their cellular location and ascribed biological functions (Supp. Fig. 7b & 7c) (45). The two cytokines shared approximately half of the upregulated and one third of the downregulated phospho-peptides (Supp. Fig. 8a) but also exhibited differential target phosphorylation (Figure 4b and Supp. Fig. 8b). As expected, we found multiple members of the STAT protein family among the top phosphorylation hits by the two cytokines, validating our study (Figure 4b & 4c). In line with our previous observations, we detected the same relative amplitudes for tyrosine phosphorylated STAT3 and STAT1. In addition to tyrosine- phosphorylation, we detected robust serine-phosphorylation on S727 for STAT1 and STAT3 (Figure 4c). While pS-STAT1 activity correlated with pY-STAT1 with IL-27 being more potent .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 9 than HypIL-6, this was not the case for STAT3. Despite an identical pY-STAT3 phosphorylation profile, HypIL-6 induced a ~50% higher pS-STAT3 relative to IL-27 (Figure 4c). These results were corroborated, following the phosphorylation kinetics of pS- STAT1 and pS-STAT3 by flow-cytometry (Figure 4d). Given the overlapping phospho-proteomic changes, gene ontology (GO) analysis associated several sets of phosphopeptides with biological processes that were mostly shared between both cytokines (Figure 4e, Supp. Fig. 8c). A large set of phospho-peptides was linked to transcription initiation (including JAK/STAT signaling) or mRNA modification (Figure 5e). Interestingly, IL-27 stimulation was associated to negative regulation of RNA polymerase II, whereas a positive regulation was detected for HypIL-6. A closer look into the functional regulation of RNA-pol II activity by the two cytokines revealed that multiple proteins involved in this process were differentially regulated by HypIL-6 and IL-27 (Figure 5f). While positive regulators of RNA-pol II transcription, such as Negative Elongation Factor A (NELFA), PPM1G, RCHY1 and POL2RA, were much more phosphorylated in response to HypIL-6 than IL-27, negative regulators of RNA-pol II transcription, such as LARP7, were much more engaged by IL-27 treatment than by HypIL-6 (Figure 4f). Interestingly, in a previous study we linked RNA-pol II regulation with the levels of STAT3 S727phosphorylation induced by HypIL- 6 via recruitment of CDK8 to STAT3 dependent genes (46). Our phospho-proteomic analysis thus, suggests that IL-27 and HypIL-6 recruit different transcriptional complexes that ultimately could contribute to provide gene expression specificity by the two cytokines. Additionally, we identified several interesting IL-27-specific phosphorylation targets. One example was Ubiquitin Protein Ligase E3 Component N-Recognin 5 (UBR5). Phosphorylated UBR5 leads to ubiquitination and subsequent degradation of Rorgc (47), the key transcription factor required for Th-17 lineage commitment, thus limiting Th-17 differentiation (Supp. Fig. 8d). A second example is PAK2, which phosphorylates and stabilizes FoxP3 leading to higher levels of TReg cells (Supp. Fig. 8d) (48). Moreover, IL-27 stimulation led to a very strong phosphorylation of BCL2-associated agonist of cell death (BAD), a critical regulator of T-cell survival and a well-known substrate of the PAK2 kinase (49). Overall, our data show a large overlap between the IL-6 and IL-27 signaling program, with a strong focus on JAK/STAT signaling. However, IL-27 engages additional signaling intermediaries that could contribute to its unique immuno-modulatory activities. Further studies will be required to assess how these IL-27 specific signaling pockets contribute to shape IL-27 responses. Kinetic decoupling of gene induction programs depends on sustained STAT1 activation and IRF1 expression by IL-27 Next, we investigated how the different kinetics of STAT activation induced by HypIL-6 and IL-27 ultimately modulated gene expression by these two cytokines. To this end, we performed RNA-seq analysis of Th-1 cells stimulated with HypIL-6 or IL-27 for 1h, 6h and 24h to obtain a dynamic perspective of gene regulation. We identified ~12500 shared genes that could be quantified for all three donors and throughout all tested experimental conditions. In a first step, we compared how similar the gene programs induced by HypIL-6 and IL-27 were. Principal component analysis (PCA) was run for a subset of genes, found to be significantly up- (total ~250) or downregulated (total ~950) by either of the experimental conditions (p value£ 0.05, fold change ³+2 or £-2). At one hour of stimulation HypIL-6 and IL-27 induced very similar gene programs, with the two cytokines clustering together in the PCA analysis regardless of whether we focused on the subsets of upregulated or downregulated genes (Figure 5a). However, the similarities between the two cytokines changed dramatically in the .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 10 course of continuous stimulation. While the two cytokines induced the downregulation of comparable gene programs at 6h and 24h stimulation, as denoted by the close clustering in the PCA analysis (Figure 5a, right panel) and the fraction of shared genes (~40%, Figure 5b, Supp. Fig. 9a-c, Supp. Fig. 10a), this was not observed for upregulated genes. Although the two cytokines induced comparable gene upregulation programs after 1h of stimulation (~80% shared genes), this trend almost completely disappeared at later stimulation times (Figure 5a & 5b, Supp. Fig. 10b). This is well-reflected by the absolute numbers of up- or downregulated genes observed for IL-27 and HypIL-6 (Figure 5c). Stimulation with both cytokines yielded a similar trend of gene downregulation (Figure 5c, right panel). However, while HypIL-6 stimulation resulted in a spike of gene upregulation at 1h that quickly disappeared at later stimulation times, IL-27 stimulation was capable to increase the number of upregulated genes beyond 6h of stimulation and maintains it even after 24h (Figure 5c, left panel). This “kinetic decoupling” of gene induction seems to have a striking functional relevance. Gene set enrichment analysis (GSEA) (50) identified several reactome pathways to be enriched for IL-27 over the course of stimulation – most of them linked with Interferon signaling and immune responses (Figure 5d). In contrast, for HypIL-6 stimulation no pathway enrichment was detected. Most importantly, the vast majority of IL-27-induced genes that were associated to these pathways belonged to genes upregulated by IL-27 treatment and that have been previously linked to STAT1 activation (51, 52) (Supp. Fig. 10c). Although HypIL-6 treatment resulted in the induction of some of these genes, their expression was very transient in time, in agreement with the short STAT1 activation kinetic profile exhibited by HypIL-6 (Supp. Fig. 10b & 10c). Next, we performed cluster analysis to find further similarities and discrepancies between the gene expression programs engaged by HypIL-6 and IL-27 (Figure 5e). Since genes downregulated by IL-27 and HypIL-6 showed overall good similarity throughout the whole kinetic series, we mainly focused on differences in upregulated gene induction. We identified three functionally relevant gene clusters. The first gene cluster corresponds to genes that are transiently and equally induced by HypIL-6 and IL-27. These genes peak after one hour and return to basal levels after 6h and 24h of stimulation (Figure 5e). Interestingly, this cluster contains classical IL-6-induced and STAT3-dependent genes, such as members of the NFkB and Jun/Fos transcriptional complex (53), as well as the feedback inhibitor Suppressor Of Cytokine Signaling 3 (SOCS3) (54) and T-cell early activation marker CD69. (Figure 5e). A second cluster of genes corresponded to genes that were persistently activated by IL-27 but only transiently by HypIL-6 (Figure 5e). Among these genes we found classical STAT1- dependent genes, such as SOCS1, Programmed Cell Death Ligand 1 (PDL1 = CD274) (55) and members of the interferon-induced protein with tetratricopeptide repeats (IFIT) family. The third cluster of genes corresponded to genes exhibiting strong and sustained activation by IL- 27 after 6h and 24h stimulation but no activation by HypIL-6 at all. This “2nd wave” of gene induction by IL-27 was almost exclusively comprised of classical Interferon Stimulated Genes (ISGs) (Supp. Fig. 10c), such as STAT1 & 2, Guanylate Binding Protein 1 (GBP1), GBP2, 4 & 5, and IRF8 & 9. It is worth mentioning, that genes in the third cluster appear to require persistent STAT1 activation (56, 57) and were the basis for the IFN signature identified in our reactome pathway analysis. Still, we were surprised about the magnitude of this 2nd gene wave. Even though IL- 27 exerts a sustained pSTAT1 kinetic profile, pSTAT1 levels were down to ~10% of maximal amplitude after 3h of stimulation. We reasoned that additional factors could further amplify the STAT1 response for IL-27 but not for HypIL-6. Within the 1st wave of STAT1-dependent genes, .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 11 we also spotted the transcription factor Interferon Response Factor 1 (IRF1), that was continuously induced throughout the kinetic series in response to IL-27 but only transiently spiking after 1h of HypIL-6 stimulation (Figure 5e). IRF1 expression was shown to prolong pSTAT1 kinetics (58) and to be required for IL-27-dependent Tr-1 differentiation and function (59). We confirmed the kinetics of IRF1 protein expression by flow cytometry and showed higher and more sustained protein levels after IL-27 stimulation relative to HypIL-6 (Figure 6a). Next, we tested in our RPE1 cell system, whether siRNA mediated knockdown of IRF1 would alter the gene induction profiles of certain STAT1 or STAT3-dependent marker genes. In RPE1 cells, reconstituted with IL-27Ra, IRF1 protein levels were peaking around 6h after stimulation with IL-27 and transfection with IRF1-targeting siRNA knocked down expression by >80% (Figure 6b). Importantly, knockdown of IRF1 did not alter the overall kinetics of pSTAT1 and pSTAT3 activation (Figure 6c). Induction of STAT1-dependent genes STAT1, GBP5 and OAS1 as well as STAT3-dependent gene SOCS3 were followed by RT qPCR (Figure 6d). Interestingly, up to 6h of stimulation, the gene induction curves were identical for control- and IRF1-siRNA treated cells. Later than 6h – that is, when IRF1 protein levels are peaking – the gene induction was decreased between 40-70% in absence of IRF1. Strikingly, expression of SOCS3, a classical STAT3-dependent reporter gene was transient and independent on IRF1 levels, highlighting that IRF1 selectively amplifies STAT1-dependent gene induction. Taken together our data support a scenario whereby IL-27 by exhibiting a kinetic decoupling of STAT1 and STAT3 activation is capable of triggering independent gene expression waves, which ultimately contribute to shape its distinct biology. IL-27-induced STAT1 response drives global proteomic changes in Th-1 cells Next, we aimed to uncover how the distinct gene expression programs engaged by HypIL-6 and IL-27 ultimately relate to alterations of the Th-1 cell proteome. For that, we continuously stimulated SILAC labelled Th-1 cells for 24h with saturating doses of IL-27 and HypIL-6 and compared quantitative proteomic changes to unstimulated controls (Figure 7a). We quantified ~3600 proteins present in all three biological replicates and in all tested conditions (unstimulated/IL-27/HypIL-6). Both cytokines downregulated a similar number of proteins (IL- 27: 57, HypIL-6: 52) (Figure 7b) with approximately half of them being shared by the two cytokines, mimicking our observations in the RNA-seq studies (Figure 7c, Supp. Fig. 11a). With 68 upregulated proteins, IL-27 was almost twice as potent as HypIL-6 (35 proteins) with very little overlap. Among the upregulated proteins by IL-27 but not HypIL-6, we detected several proteins with described immune-modulatory functions on T-cells. One of these proteins was Transforming Growth Factor b (TGF-b), which is a key regulator with pleiotropic functions on T-cells (60). TGF-b has been identified to synergistically act with IL-27 to induce IL-10 secretion from Tr-1 cells – thus accounting for one of the key anti-inflammatory functions of IL-27 (61). On the other hand, we also found SELPLG-encoded protein RSGL-1 which is critically required for efficient migration and adhesion of Th-1 cells to inflamed intestines (62, 63). Interestingly, we found LARP7 moderately upregulated by IL-27. This negative regulator for RNA pol II was also identified in our phospho-target screening and selectively engaged by IL-27 (Figure 4f). IL-27 and HypIL-6 share ~60% of downregulated proteins, but without strong functional patterns. Both cytokines downregulated several proteins related to mitotic cell cycle (LIG1, CSNK2B, PSMB1) mRNA processing and splicing (NCBP2, PCBP2, NUDT21) (64). Strikingly, a significant number (~40%) of proteins upregulated by IL-27 belong to the group of ISGs (Figure 7b & 7c, Supp. Fig. 11b). This particular set of proteins including STAT1, .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 12 STAT2, MX Dynamin like GTPase 1 (MX1), Interferon Stimulated Gene 20 (ISG20) or Poly(ADP-Ribose) Polymerase Family Member 9 (PARP9) was not markedly altered by HypIL-6. Of note: the overall expression patterns of the most significantly altered proteins are congruent to the gene induction patterns observed after 6h and 24h (Figure 7d & 7e, Supp. Fig. 10b). Similar to this, GSEA reactome analysis identified again pathways associated with interferon signaling and cytokine/immune system but failed to detect any significant functional enrichment by HypIL-6 (Figure 7e, Supp. Fig. 11b & 11c). Finally, we correlated RNAseq-based gene induction patterns with detected proteomic changes. To our surprise we only found a relatively low number of shared hits. However, the identified proteins belong exclusively to a group upregulated by IL-27 (Figure 7f). They are all located in the “2nd gene wave” cluster and all of them are regulated by ISGs (Figure 5e). Taken together these results provide compelling evidence that sustained pSTAT1 activation by IL-27 accounts for its gene induction and proteomic profiles, thus, giving a mechanistic explanation for the diverse biological outcomes of IL-27 and IL-6. Our observations are in good agreement with previous findings in cancer cells, showing that particularly the involvement of STAT1 activation is responsible for proteomic remodeling by IL-27 (65). Receptor and STAT concentrations determine the nature of the IL-6/IL-27 response Our data suggest that STAT molecules compete for binding to a limited number of phospho- Tyr motifs in the intracellular domains of cytokine receptors. A direct consequence derived from this hypothesis is that cells can adjust and change their responses to cytokines by altering their concentrations of specific STATs or receptors molecules. To assess to what degree immune cells differ in their expression of cytokine receptors and STATs, we investigated levels of IL-6Ra, GP130, IL-27Ra, STAT1 and STAT3 protein expression across different immune cell populations making use of the Immunological Proteomic Resource (ImmPRes - http://immpres.co.uk) database. Strikingly, the level of expression of these proteins change dramatically across the populations studied (Figure 8a), suggesting that these cells could potentially produce very different responses to HypIL-6 and IL-27 stimulation. In order to quantify (and predict) how changes in expression levels of different proteins modify the kinetics of pSTAT, we made use of the two mathematical models of HypIL-6 and IL-27 stimulation and the parameters inferred with Bayesian methods. Our mathematical models could accurately reproduce the experimental results generated across our study, i.e., signaling by the IL-27Ra chimeric and IL-27Ra-Y616F mutant receptors and dose/response studies (Supp. Fig. 12a-c), making use of the posterior parameter distributions generated from the Bayesian parameter calibration. Having developed mathematical models which are able to accurately explain the experimental data (Supp. Fig. 5b and 5c) and reproduce independent experiments (Fig. 3b and 3c), we then sought to use the models to predict pSTAT signaling kinetics under different concentration regimes of receptors and STATs. To simplify the simulations, we focused our analysis in GP130 and STAT1 proteins, two of the proteins that greatly vary in the different immune populations (Figure 8a). As baseline values for the concentrations [𝐺𝑃130(0)], [𝐼𝐿27𝑅𝑎(0)] [𝑆𝑇𝐴𝑇1(0)] and [𝑆𝑇𝐴𝑇3(0)] we used approximately the median values from the posterior distributions for each parameter: [𝐺𝑃130(0)] = 25 nM, [𝐼𝐿27𝑅𝑎(0)] = 50 nM and [𝑆𝑇𝐴𝑇1(0)] = [𝑆𝑇𝐴𝑇3(0)] = 500 nM. To see the effect of varying GP130 concentrations on pSTAT signaling, we decreased the initial concentration of GP130 and simulated the model using the accepted parameters sets from the ABC-SMC to inform the other parameter values. A tenfold reduction on GP130 concentration ([𝐺𝑃130(0)] = 2.5𝑛𝑀) resulted in a striking loss in pSTAT1 levels induced by HypIL-6, with very little effect .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 13 on pSTAT3 levels induced by this cytokine (Figure 8b). pSTAT1/3 kinetics induced by IL-27 however was not affected by this decrease in GP130 concentration (Figure 8b). Interestingly, the HypIL-6 signaling profile predicted by our model at low GP130 concentrations strongly resemble the one induced by HypIL-6 in Th-1 cells (Figure 1c), where very low levels of GP130 are found, further confirming the robustness of the predictions generated by our mathematical models. When the concentration of STAT1 was increased by a factor of ten ([𝑆𝑇𝐴𝑇1(0)] = 5000 nM, both HypIL-6 and IL-27 induced significantly higher levels of pSTAT1 activation (Figure 8b). pSTAT3 levels were not affected for HypIL-6 stimulation but were decreased for IL-27 stimulation (Figure 8b), further indicating the competitive nature of the binding of STAT1 and STAT3 to IL-27Ra and GP130. Overall, our mathematical model predicts that changes on GP130 and STAT1 expression produce a substantial remodeling of the HypIL-6 and IL-27 signalosome, which ultimately could lead to aberrant responses. STAT1 protein levels in SLE patients modify HypIL-6 and IL-27 signaling responses STAT1 is a classical IFN responsive gene and STAT1 levels are highly increased in environments rich in IFNs (66). Thus, we next ask whether STAT1 levels would be increased in SLE patients, an examples of disease where IFNs have been shown to correlate with a poor prognosis, making use of available gene expression datasets (67). We did not find differences in the expression of GP130, IL-6Ra or IL-27Ra in SLE patients (Figure 8c). However, we detected a significant increase in the levels of STAT1 and STAT3 transcripts in these patients when compared to healthy controls, with the increase on STAT1 expression being significantly more pronounced (Figure 8c). Since our mathematical model predicted that increases in STAT1 expression could significantly change cytokine-induced cellular responses by HypIL-6 and IL-27, we next experimentally tested this prediction. For that, we primed Th-1 cells with IFNa2 overnight to increase total STAT1 levels (and to a lower extent STAT3) in these cells (Supp. Fig. 13a). While both HypIL-6 and IL-27 induced comparable levels of pSTAT3 in primed and non-primed Th-1 cells, levels of pSTAT1 induced by the two cytokines were significantly upregulated in primed Th-1 cells, resulting in a bias STAT1 response and confirming our model predictions (Figure 8d). We next investigated whether this bias STAT1 activation by HypIL-6 and IL-27 observed in IFNa2-primed Th-1 cells was also present in SLE patients. For that we collected PBMCs from six SLE patients or five age-matched healthy controls and measured STAT1 and STAT3 expression, as well as pSTAT1 and pSTAT3 induction by HyIL-6 and IL-27 after 15 min treatments in CD4 T cells. Importantly, comparable results to those obtained with IFN-primed Th-1 cells were obtained, with signaling bias towards pSTAT1 in CD4+ T cells from SLE patients stimulated with HypIL-6 and IL-27 (Figure 8e, Supp. Fig. 13b & c), further supporting the fact that STAT concentrations play a critical role in defining cytokine responses in autoimmune disorders. Our data show that STAT1 and STAT3 compete for phospho-Tyr motifs in GP130, with STAT3 having an advantage resulting from its tighter affinity to GP130. Finally, we asked whether crippling JAK activity by using sub-saturating doses of JAK inhibitors could differentially affect STAT1 and STAT3 activation by HypIL-6 and therefore rescue the altered cytokine responses found in SLE patients. To test this, RPE1 and Th-1 cells were stimulated with saturated concentrations of HypIL-6 and titrating the concentrations of Tofacitinib, a clinically approved JAK inhibitor. Strikingly, Tofacitinib inhibited HypIL-6 induced pSTAT1 more efficiently than pSTAT3 in both RPE1 cells and Th-1 cells (Figure 8f). At 50 nM concentration, Tofacitinib inhibited pSTAT1 levels induced by HypIL-6 by 60%, while only inhibited pSTAT3 levels by 30% (Figure 8f) – an effect that we did not observe for IL-27 stimulation (Supp. Fig. 13d). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 14 Overall, our results show that the changes in STATs concentration found in autoimmune disorders shape cytokine signaling responses and could contribute to disease progression. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 15 DISCUSSION: Cytokine pleiotropy is the ability of a cytokine to exert a wide range of biological responses in different cell types. This functional pleiotropy has made the study of cytokine biology extremely challenging given the strong cross-talk and shared usage of key components of their signaling pathways, leading to a high degree of signaling plasticity, yet still allowing functional selectivity (68, 69). Here we aimed to identify the underlying determinants that define cytokine functional selectivity by comparing IL-27 and IL-6 at multiple scales – ranging from cell surface receptors to proteomic changes. We show that IL-27 triggers a more sustained STAT1 phosphorylation than IL-6, via a high affinity STAT1/IL-27Ra interaction centered around Tyr613 on IL-27Ra. This in turn results in a more sustained IRF1 expression induced by IL-27, which leads to the upregulation of a second wave of gene expression unique to IL-27 and comprised of classical ISGs. We go one step further and show that this strong receptor/STAT coupling is altered in autoimmune disorders where STATs concentrations are often dysregulated. Increased expression of STAT1 in SLE patients biases HypIL-6 and IL-27 responses towards STAT1 activation, further contributing to the worsening of the disease. By using suboptimal doses of the JAK inhibitor Tofacitinib we show that specific STAT proteins engaged by a given cytokine can be targeted. Overall, our study highlights a new layer of cytokine signaling regulation, whereby STAT affinity to specific cytokine receptor phospho-Tyr motifs controls STAT phosphorylation kinetics and the identity of the gene expression program engaged, ultimately ensuing the generation of functional diversity through the use of a limited set of signaling intermediaries. The tight coupling of one receptor subunit to one particular STAT that we have identified in our study is a rather unusual phenomenon for heterodimeric cytokine receptor complexes, which has been first suggested by Owaki et al. (27). Generally, the entire signaling output driven by a cytokine-receptor complex emanates from a dominant receptor subunit, which carries several Tyr residues susceptible of being phosphorylated (70, 71). This in turn results in competition between different STATs for binding to shared phospho-Tyr motifs in the dominant receptor chain, leading to different kinetics of STAT phosphorylation as observed for IL-6 stimulation (15) (Figure 1b). Moreover, this localized signaling quantum allows phosphatases and feedback regulators – induced upon cytokine stimulation – to act in synergy to reset the system to its basal state, generating a very synchronous and coordinated signaling wave. Although very effective, this molecular paradigm presents its limitations. STAT competition for the same pool of phospho-Tyr makes the system very sensitive to changes in STAT concentration. IFNg primed cells, which exhibit increased STAT1 levels, trigger an IFNg- like STAT1 response upon IL-6 stimulation (16). IL-10 anti-inflammatory properties are lost in cells with high levels of STAT1 expression, as a result of a pro-inflammatory environment rich in IFNs (72). Indeed, we show that STAT1 transcripts levels are increased in Crohn’s disease and SLE patients and they contributed to alter IL-6 responses. Strikingly, IL-27 appears to have evolved away from this general model of cytokine signaling activation. Our results show that STAT1 activation by IL-27 is tightly coupled to IL-27Ra, while STAT3 activation by this cytokine mostly depends on GP130. This decoupled STAT1 and STAT3 activation by IL-27 is possible thanks to the presence of a putative high affinity STAT1 binding site on IL-27Ra that resembles the one present in IFNgR1 (41). As a result of this, IL-27 can trigger sustained and independent phosphorylation of both STAT1 and STAT3. This unique feature of IL-27 allows it to induce robust responses in dynamic immune environments. Indeed, our mathematical models of cytokine signaling and Bayesian inference, together with the experimental observations show that changes in receptor concentration minimally affected .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 16 pSTAT1/3 induced by IL-27, while they fundamentally alter IL-6 responses. Overall, our data show that cytokine responses are versatile and adapt to the continuously changing cell proteome, highlighting the need to measure cytokine receptors and STATs expression levels, in addition to cytokine levels, in disease environments to better understand and predict altered responses elicited by dysregulated cytokines. In recent years, it has become apparent that the stability of the cytokine-receptor complex influences signaling identity by cytokines (73). Short-lived complexes activate less efficiently those STAT molecules that bind with low affinity phospho-Tyr motif in a given cytokine receptor (34). Our current results further support this kinetic discrimination mechanism for STAT activation. Our statistical inference identified differences in STAT recognition to the cytokine receptor phospho-Tyr motifs as one of the major determinants of STAT phosphorylation kinetics. This parameter alone was sufficient to explain transient and sustained STAT1 phosphorylation induced by IL-6 and IL-27, respectively, without the need to invoke the action of phosphatases or negative feedback regulators such as SOCSs. Indeed, our results indicate that the rate of STAT1 dephosphorylation is similar between the IL-6 and IL-27 systems, suggesting that phosphatases do not contribute to these early kinetic differences. Moreover, blocking protein translation, and therefore the upregulation of negative feedback regulators by IL-6 treatment did not result in a more sustained STAT1 phosphorylation by IL-6, again indicating that the transient kinetics of STAT1 phosphorylation by IL-6 is encoded at the receptor level and does not require further regulation. However, recent reports have found that the amplitude of STAT1 phosphorylation in response to IL-6 is regulated by levels of PTPN2 expression, suggesting that phosphatases can play additional roles in shaping IL-6 responses beyond controlling the kinetics of STAT activation (74). STAT1 phosphorylation levels by IL-27 on the other hand were significantly more sustained in the absence of protein translation, suggesting that negative feedback mechanisms are required to downmodulate signaling emanating from high affinity STAT-receptor interactions. Overall our results suggest that while phosphatases and negative feedback regulators play an important role in maintaining cytokine signaling homeostasis (75), the kinetics of STAT activation appears to be already encoded at the level of receptor engagement, thus ensuring maximal efficiency and signal robustness. Cytokine signaling plasticity can occur at the level of receptor activation. In the past years, a scenario has emerged suggesting that the absolute number of signaling active receptor complexes is a critical determinant for signal output integration. Accordingly, specific biological responses were shown to be tuned either by abundance of cell surface receptors (76, 77) or by the level of receptor assembly (34, 38, 78). Here, we show for the first time that IL-27- induced dimerization of IL-27Ra and GP130 at the cell surface of live cells – in good agreement with previous studies on heterodimeric cytokine receptor systems (38, 73). For IL- 27, the receptor subunits IL-27Ra and GP130 can be expressed at different ratios as seen for naïve vs. activated T-cells (79) as well as intestinal cells (80). On T-cells, particularly after activation, IL-27Ra is expressed in strong excess over GP130, rendering GP130 as the limiting factor for receptor complex assembly (41). Interestingly, we observe that in addition to a faster kinetic of STAT1 phosphorylation, HypIL-6 treatment induces a lower maximal amplitude in pSTAT1 activation in T cells. This is in stark contrast to our results in RPE1 cells, where high abundance of GP130 (~3000-4000 copies of cell surface GP130) is found. In these cells both cytokines elicited similar amplitudes of STAT1 phosphorylation. Our results suggest that surface receptor density in synergy with STATs binding dynamics to phospho-Tyr motif .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 17 on cytokine receptors act to define the amplitude and kinetics of STAT activation in response to cytokine stimulation. The distinct STAT1 and STAT3 kinetic profiles induced by IL-6 and IL-27 are the prerequisite for time-correlated decoupling of genetic programs: a “shared GP130/STAT3-dependent wave” and an IL-27-“unique IL-27Ra/STAT1-dependent wave”. However, pSTAT1 levels induced by IL-27 at 3h were down to ~10% of maximal amplitude, suggesting that additional factors would be required to amplify the initial STAT1 response elicited by IL-27. We observed that IL-27 induces the expression of an early wave of classical STAT1-dependent genes, which is also shared by IL-6. However, while IL-27 induces the upregulation of these genes throughout the entire duration of the experiment, IL-6 only resulted in a transient spike. We reasoned that this additional factor required for IL-27 signal amplification would be among these early STAT1-dependent genes. Among this set of genes we found the transcription factor IRF1, which had been shown to act as a feedback amplificant for pSTAT1 activity (58). Importantly, IRF1 protein levels have been shown to be upregulated in response to IL-27 and IFNg but not to IL-6 stimulation in hepatocytes (81). IRF1 plays a key role in chromatin accessibility which is critically required for IL-27-induced differentiation of Tr1 cells and subsequent IL-10 secretion (59). Here, we could prove that the contribution of IRF1 on STAT1- but not STAT3-dependent genes is a generic feature of IL-27 signaling. This readily explains the significant transcriptomic overlap of IL-27 with type I (82) or type II interferons (15) after long-term stimulation with these cytokines. Along this line, it is not surprising that IL-27 – beyond its well-described effects on T-cell development – can also mount a considerable antiviral response as shown in hepatic cells and PBMCs (83, 84). Our results suggest that by modulating the kinetics of STAT phosphorylation, cytokines can modulate the expression of accessory transcription factors, such as IRF1, that act in synergy with STATs to fine-tune gene expression and provide functional diversity. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 18 ACKNOWLEDGMENTS We thank members of the Moraga, Molina-París, Piehler and Mitra laboratories for helpful advice and discussion. We thank G. Hikade and H. Kenneweg for technical support, C. P. Richter for providing software for single-molecule image analysis, R. Kurre (Integrated Bioimaging Facility Osnabrück) for support with fluorescence microscopy and the FingerPrints Proteomics facility (Dundee) for support with the mass spectrometry data. This work was supported by the StG, LS6, Wellcome-Trust-202323/Z/16/Z (IM EP), ERC-206-STG grant (IM JMF EP PKF), EMBO (SW 454–2017), DFG (SFB 944, P8/Z, JP), National Heart, Lung and Blood Institute (K22HL125593, MK) and Contrat de Plan Etat Région Hauts de France and Institut pour la Recherche sur le Cancer de Lille (SM SG). CMP and GL were supported by H2020, QuanTII. PJ is supported by the EPSRC, AstraZeneca and Smith Institute (Smith Institute CASE studentship, award reference 1969354). Numerical work was undertaken on ARC3, which is part of the High Performance Computing facilities at the University of Leeds, UK. COMPETING INTERESTS The authors declare that they have no competing interests. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 19 MATERIAL AND METHODS Protein expression and purification: Murine IL-27 was cloned as a linker-connected single-chain variant (p28+EBI3) as described in (29). Human HyperIL-6 (HypIL-6), and murine single-chain IL-27 were cloned into the pAcGP67-A vector (BD Biosciences) in frame with an N-terminal gp67 signal sequence and a C-terminal hexahistidine tag, and produced using the baculovirus expression system, as described in (85). Baculovirus stocks were prepared by transfection and amplification in Spodoptera frugiperda (Sf9) cells grown in SF900II media (Invitrogen) and protein expression was carried out in suspension Trichoplusiani ni (High Five) cells grown in InsectXpress media (Lonza). Purification was performed using the method described in (86). For IL-27, the cells were pelleted with centrifugation at 2000 rpm, prior to a precipitation step through addition of Tris pH 8.0, CaCl2 and NiCl2 to final concentrations of 200mM, 50mM and 1mM respectively. The precipitate formed was then removed through centrifugation at 6000 rpm. Nickel-NTA agarose beads (Qiagen) were added and the target proteins purified through batch binding followed by column washing in HBS-Hi buffer (HBS buffer supplemented to 500mM NaCl and 5% glycerol, pH 7.2). Elution was performed using HBS-Hi buffer plus 200mM imidazole. Final purification was performed by size exclusion chromatography on an ENrich SEC 650 300 column (Biorad), again equilibrated in HBS-Hi. Concentration of the purified sample was carried out using 10kDa Millipore Amicon-Ultra spin concentrators. For HypIL-6, proteins were purified likewise, but in 10 mM HEPES (pH 7.2) containing 150 mM NaCl. Recombinant cytokines were purified to greater than 98% homogeneity. For cell surface labeling, the anti-GFP nanobody (NB) “enhancer” and “minimizer” were used, which bind mEGFP with subnanomolar binding affinity (87). NB was cloned into pET-21a with an additional cysteine at the C-terminus for site-specific fluorophore conjugation in a 1:1 fluorophore:nanobody stoichiometry. Furthermore, (PAS)5 sequence to increase protein stability and a His-tag for purification were fused at the C-terminus. Protein expression in E. coli Rosetta (DE3) and purification by immobilized metal ion affinity chromatography was carried out by standard protocols. Purified protein was dialyzed against HEPES pH 7.5 and reacted with a two-fold molar excess of DY647 maleimide (Dyomics), ATTO 643 maleimide (AT643) and ATTO Rho11 maleimide (Rho11) (ATTO-TEC GmbH), respectively. After 1 h, a 3-fold molar excess (with respect to the maleimide) of cysteine was added to quench excess dye. Protein aggregates and free dye were subsequently removed by size exclusion chromatography (SEC). A labeling degree of 0.9-1:1 fluorophore:protein was achieved as determined by UV/Vis spectrophotometry. CD4+ T cell purification and Th-1 differentiation: Human buffy coats were obtained from the Scottish Blood Transfusion Service and peripheral blood mononuclear cells (PBMCs) of healthy donors were isolated from buffy coat samples by density gradient centrifugation according to manufacturer’s protocols (Lymphoprep, STEMCELL Technologies). From each donor, 100x106 PBMCs were used for isolation of CD4+ T-cells. Cells were decorated with anti-CD4FITC antibodies (Biolegend, #357406) and isolated by magnetic separation according to manufacturer’s protocols (MACS Miltenyi) to a purity >98% CD4+. Freshly isolated resting CD4+ T cells (3x107 per donor) were activated under Th-1 polarizing conditions using ImmunoCult™ Human CD3/CD28 T Cell Activator (StemCell, Cat#10971) following manufacturer instructions for 3 days in RPMI-1640, 10% v/v .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 20 FBS, 100 U/ml penicillin-streptomycin (Gibco) in the presence of the cytokines IL-2 (Novartis, #709421, 20 ng/ml), anti-IL-4 antibody (10 ng/ml, BD Biosciences, #554481), IL-12 (20 ng/ml, BioLegend, #573002). After three days of priming, cells were expanded for another 5 days in the presence of IL-2 (20 ng/ml). Human SLE patient samples: This study was authorized by the French Competent Authority dealing with Research on Human Biological Samples namely the French Ministry of Research. The Authorization number is ECH 19/04. To issue such authorization, the Ministry of Research has sought the advice of an independent ethics committee, namely the “Comité de Protection des Personnes,” which voted positively, and all patients gave their written informed consent. The healthy volunteer was recruited to serve as healthy control individuals. Healthy and patients’ blood samples were collected in heparinized tubes (BD Vacutainer 368886, BD Biosciences San Jose, CA, USA) and PBMC samples were isolated using Ficoll (Pancoll, Pan Biotech #P04-60500) density gradient centrifugation. The isolated PBMCs were washed with PBS and the remaining red blood cells were lysed using RBC lysis buffer (ACK lysing buffer, Gibco #A10492-01), incubate 3min at room temperature. Cells were washed in PBS and resuspend the cells with 1ml of freezing medium (with DMSO, PAN Biotech, #P07-90050) and transfer the cells in a cryotube. cryotube in a Freezing container (Nalgene) and at -80°C and then transferred into liquid nitrogen container for long term storage. Classification and demographic information about SLE patients and healthy controls: SLE patients were included if they fulfilled the American College of Rheumatology (ACR) Classification Criteria (Hochberg MC. Updating the American College of Rheumatology revised criteria for the classification of systemic lupus erythematosus (88). Exclusion criteria were current intake of 10 mg or more of prednisone or equivalent and/or use of immunosupressants within the previous 6 months before inclusion. Use of hydroxychloroquine was not an exclusion criterion. Patients were mostly in clinical remission, half with biological remission, half with persistent anti native DNA autoantibodies. All SLE patients and healthy controls were females between 41 and 58 years old. (Phospho-) Proteomics: For (phospho-) proteomic experiments, Th-1 cells from each donor were split into three different conditions after initial expansion: Light SILAC media (40 mg/ml L-Lysine K0 (Sigma, #L8662) and 84 mg/ml L-Arginine R0 (Sigma, #A8094)), medium SILAC media (49 mg/ml L- Lysine U-13C6 K6 (CKGAS, #CLM-2247-0.25) and 103 mg/ml L-Arginine U-13C6 R6 (CKGAS, #CLM-2265-0.25)) and heavy SILAC media (49.7 mg/ml L-Lysine U-13C6,U-15N2 K8 (CKGAS, #CNLM-291-H-0.25) and 105.8 mg/ml L-Arginine U-13C6,U-15N2 R10 (CKGAS, #CNLM-539-H-0.25)) prepared in RPMI SILAC media (Thermo Scientific, #88365) supplemented with 10% dialyzed FBS (HyClone, #SH30079.03), 5 ml L-Glutamine (Invitrogen, #25030024), 5 ml Pen/Strep (Invitrogen, #15140122), 5 ml MEM vitamin solution (Thermo Scientific, #11120052), 5 ml Selenium-Transferrin-Insulin (Thermo Scientific, #41400045) and expanded in the presence of 20 ng/ml IL-2 and 10 ng/ml anti-IL4 for another 10 days in order to achieve complete labelling. Media was exchanged every two days. Incorporation of medium and heavy version of Lysine and Arginine was checked by mass spectrometry and samples with an incorporation greater than 95% were used. After expansion, cells were starved without IL-2 for 24 hours before stimulation with 10 nM IL- 27 or 20 nM HyIL-6 for 15 minutes (phosphoproteomics) or 24 h (global proteomic changes). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 21 Cells were then washed three times in ice-cold PBS, mix in a 1:1:1 ratio, resuspended in SDS- containing lysis buffer (1% SDS in 100 mM Triethylammonium Bicarbonate buffer (TEAB)) and incubated on ice for 10 min to ensure cell lysis. Then, cell lysates were centrifuged at 20000 g for 10 minutes at +4°C and supernatant was transferred to a clean tube. Protein concentration was determined by using BCA Protein Assay Kit (Thermo, #23227), and 10 mg of protein per experiment were reduced with 10mM dithiothreitol (DTT, Sigma, #D0632) for 1 h at 55°C and alkylated with 20mM iodoacetamide (IAA, Sigma, #I6125) for 30 min at RT. Protein was then precipitated using six volumes of chilled (-20°C) acetone overnight. After precipitation, protein pellet was resuspended in 1 ml of 100 mM TEAB and digested with Trypsin (1:100 w/w, Thermo, #90058) and digested overnight at 37.C. Then, samples were cleared by centrifugation at 20000 g for 30 min at +4°C, and peptide concentration was quantified with Quantitative Colorimetric Peptide Assay (Thermo, #23275). Phosphopeptide enrichment in the peptide fractions generated as described above was carried out using MagResyn Ti-IMAC following manufacturer instructions (2BScientific, MRTIM002). High pH reverse phase fractionation for phosphoproteomics: Samples were dissolved in 200 μL of 10 mM ammonium formate buffer pH 9.5 and peptides are fractionated using high pH RP chromatography. A C18 Column from Waters (XBridge peptide BEH, 130Å, 3.5 µm 4.6 X 150 mm, Ireland) with a guard column (XBridge, C18, 3.5 µm, 4.6 X 20mm, Waters) are used on a Ultimate 3000 HPLC (Thermo-Scientific). Buffers A and B used for fractionation consist, respectively of 10 mM ammonium formate in milliQ water (Buffer A) and 10 mM ammonium formate in 90% acetonitrile (Buffer B), both buffers were adjusted to pH 9.5 with ammonia. Fractions are collected using a WPS-3000FC autosampler (Thermo-Scientific) at 1 min intervals. Column and guard column were equilibrated with 2% buffer B for 20 min at a constant flow rate of 0.8 ml/min and a constant temperature 0f 21oC. Samples (193 µl) are loaded onto the column at 0.8 ml/min, and separation gradient started from 2% buffer B, to 8% B in 6 min, then from 8% B to 45% B within 54 min and finaly from 45% B to 100% B in 5 min. The column is washed for 15 min at 100% buffer B and equilibrated at 2% buffer B for 20 min as mentioned above. The fraction collection started 1 min after injection and stopped after 80 min (total of 80 fractions, 800 µl each). Each peptide fraction was acidified immediately after elution from the column by adding 20 to 30 µl 10% formic acid to each tube in the autosampler. The total number of fractions concatenated was set to 10. The content of fractions from each set was dried prior to further analysis. LC-MS/MS Analysis: LC-MS analysis was done at the FingerPrints Proteomics Facility (University of Dundee). Analysis of peptide readout was performed on a Q Exactive™ plus, Mass Spectrometer (Thermo Scientific) coupled with a Dionex Ultimate 3000 RS (Thermo Scientific). LC buffers used are the following: buffer A (0.1% formic acid in Milli-Q water (v/v)) and buffer B (80% acetonitrile and 0.1% formic acid in Milli-Q water (v/v). Dried fractions were resuspended in 35µl, 1% formic acid and aliquots of 15 μL of each fraction were loaded at 10 μL/min onto a trap column (100 μm × 2 cm, PepMap nanoViper C18 column, 5 μm, 100 Å, Thermo Scientific) equilibrated in 0.1% TFA. The trap column was washed for 5 min at the same flow rate with 0.1% TFA and then switched in-line with a Thermo Scientific, resolving C18 column (75 μm × 50 cm, PepMap RSLC C18 column, 2 μm, 100 Å). The peptides were eluted from the column .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 22 at a constant flow rate of 300 nl/min with a linear gradient from 2% buffer B to 5 % buffer B in 5 min then from 5% buffer B to 35% buffer B in 125 min, and finally from 35% buffer B to 98% buffer B in 2 min. The column was then washed with 98% buffer B for 20 min and re- equilibrated in 2% buffer B for 17 min. The column was kept at a constant temperature of 50oC. Q-exactive plus was operated in data dependent positive ionization mode. The source voltage was set to 2.5 Kv and the capillary temperature was 250oC. A scan cycle comprised MS1 scan (m/z range from 350-1600, ion injection time of 20 ms, resolution 70 000 and automatic gain control (AGC) 1x106) acquired in profile mode, followed by 15 sequential dependent MS2 scans (resolution 17500) of the most intense ions fulfilling predefined selection criteria (AGC 2 x 105, maximum ion injection time 100 ms, isolation window of 1.4 m/z, fixed first mass of 100 m/z, spectrum data type: centroid, intensity threshold 2 x 104, exclusion of unassigned, singly and >7 charged precursors, peptide match preferred, exclude isotopes on, dynamic exclusion time 45 s). The HCD collision energy was set to 27% of the normalized collision energy. Mass accuracy is checked before the start of samples analysis. Mass spectrometry data analysis: Q Exactive Plus Mass Spectrometer .RAW files were analyzed, and peptides and proteins quantified using MaxQuant (89), using the built-in search engine Andromeda (90). All settings were set as default, except for the minimal peptide length of 5, and Andromeda search engine was configured for the UniProt Homo sapiens protein database (release date: 2018_09). Peptide and protein ratios only quantified in at least two out of the three replicates were considered, and the p-values were determined by Student’s t test and corrected for multiple testing using the Benjamini–Hochberg procedure (Benjamini and Hochberg, 1995). Plasmid constructs: For single molecule fluorescence microscopy, monomeric non-fluorescent (Y67F) variant of eGFP was N-terminally fused to GP130. This tag (mXFPm) was engineered to specifically bind anti-GFP nanobody “minimizer” (aGFP-miNB). This construct was inserted into a modified version of pSems-26 m (Covalys) using a signal peptide of Igk. The ORF was linked to a neomycin resistance cassette via an IRES site. A mXFPe-IL-27Ra construct was designed likewise but is recognized by aGFP nanobody “enhancer” (mXFPe). The chimeric construct mXFP-IL-27Ra (ECD & TMD)-GP130(ICD) was a fusion construct of IL-27Ra (aa 33-540) and GP130 (aa 645-918). Cell lines and media: HeLa cells were grown in DMEM containing 10% v/v FBS, penicillin-streptomycin, and L- glutamine (2 mM). RPE1 cells were grown in DMEM/F12 containing 10% v/v FBS, penicillin- streptomycin, and L-glutamine (2 mM). RPE1 cells were stably transfected by mXFPe-IL- 27Ra, mutants and the chimeric construct by PEI method according to standard protocols. Using G418 selection (0.6 mg/ml) individual clones were selected, proliferated and characterized. For comparing receptor cell surface expression levels of stable clones expressing variants of IL-27Ra, cells were detached using PBS+2mM EDTA, spun down (300g, 5 min) and incubated with “enhancer” aGFP-enNBDy647 (10 nM, 15 min on ice). After incubation, cells were washed with PBS and run on cytometer. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 23 Flow cytometry staining and antibodies: For measuring dose-response curves of STAT1/3 phosphorylation (either Th-1 cells or RPE1 clones), 96-well plated were prepared with 50µl of cell suspensions at 2x106 cells/ml/well for Th-1 and 2x105 cells/ml/well for RPE1. The latter were detached using Accutase (Sigma). Cells were stimulated with a set of different concentrations to obtain dose-response curves. To this end cells were stimulated for 15 min at 37°C with the respective cytokines followed by PFA fixation (2%) for 15 min at RT. For kinetic experiments, cell suspensions were stimulated with a defined, saturating concentration of cytokines (10 nM IL-27, 20 nM HypIL-6, 100 nM wt-IL-6) in a reverse order so that all cell suspensions were PFA-fixed (2%) simultaneously. For pSTAT1/3 kinetic experiments at JAK inhibition, Tofacitinib (2 μM, Stratech, #S2789-SEL) was added after 15 min of stimulation and cells were PFA-fixed in correct order. After fixation (15 min at RT), cells were spun down at 300g for 6 min at 4°C. Cell pellets were resuspended and permeabilized in ice-cold methanol and kept for 30 min on ice. After permeabilization cells were fluorescently barcoded according to (91). In brief: using two NHS- dyes (PacificBlue, #10163, DyLight800, #46421, Thermo Scientific), individual wells were stained with a combination of different concentrations of these dyes. After barcoding, cells are pooled and stained with anti-pSTAT1Alexa647 (Cell Signaling Technologies, #8009) and anti- pSTAT3Alexa488 (Biolegend, #651006) at a 1:100 dilution in PBS+0.5%BSA for 1h at RT. T-cells were also stained with anti-CD8AlexaFlour700 (1:120, Biolegend, #300920), anti-CD4PE (1:120, Biolegend, #357404), anti-CD3BrilliantViolet510 (1:100, Biolegend, #300448). Cells were analzyed at the flow cytometer (Beckman Coulter, Cytoflex S) and individual cell populations were identified by their barcoding pattern. Mean fluorescence intensity (MFI) of pSTAT1647and pSTAT3488 was measured for all individual cell populations. For measuring total STAT levels, methanol-permeabilized cells were stained with anti- STAT1Alexa647 (1:70, Biolegend, #558560) or anti-STAT3APC (1:50, Biolegend, #560392). Total IRF1 levels methanol-permeabilized cells were stained with anti-IRF1Alexa647 (1:50, Biolegend, #14105). For measuring cell surface levels of GP130, cells were detached with Accutase (Sigma) and stained with anti-GP130APC (1:100, Biolegend, #362006) for 1h on ice. RNA Transcriptome Sequencing: Human Th-1 cells from three donors each (StemCell Technologies) were cultivated and stimulated as described in above. Cells were washed in Hank’s balanced salt solution (HBSS, Gibco) and snap frozen for storage. RNA was isolated using the RNeasy Kit (Quiagen) according to manufacturer’s protocol. All RNA 260/280 ratios were above 1.9. Of each sample, 1 μg of RNA was used. Transcriptomic analysis was done by Novogene as follows. Sequencing libraries were generated using NEBNext® UltraTM RNALibrary Prep Kit for Illumina® (NEB, USA) following manufacturer’s recommendations and index codes were added to attribute sequences to each sample. Briefly, mRNA was purified from total RNA using poly-T oligo-attached magnetic beads. Fragmentation was carried out using divalent cations under elevated temperature in NEBNext First StrandSynthesis Reaction Buffer (5X). First strand cDNA was synthesized using random hexamer primer and M-MuLV Reverse Transcriptase (RNase H-). Second strand cDNA synthesis was subsequently performed using DNA Polymerase I and RNase H. Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities. After adenylation of 3’ ends of DNA fragments, NEBNext Adaptor with hairpin loop structure were ligated to prepare for hybridization. In order to select .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 24 cDNA fragments of preferentially 150~200 bp in length, the library fragments were purified with AMPure XP system (Beckman Coulter, Beverly, USA). Then 3 μl USER Enzyme (NEB, USA) was used with size-selected, adaptor-ligated cDNA at 37 °C for 15 min followed by 5 min at 95 °C before PCR. Then PCR was performed with Phusion High-Fidelity DNA polymerase, Universal PCR primers and Index (X) Primer. At last, PCR products were purified (AMPure XP system) and library quality was assessed on the Agilent Bioanalyzer 2100 system. RNA Sequencing Data Analysis: Primary data analysis for quality control, mapping to reference genome and quantification was conducted by Novogene as outlined below. Quality control: Raw data (raw reads) of FASTQ format were firstly processed through in- house scripts. In this step, clean data (clean reads) were obtained by removing reads containing adapter and poly-N sequences and reads with low quality from raw data. At the same time, Q20, Q30 and GC content of the clean data were calculated. All the downstream analyses were based on the clean data with high quality. Mapping to reference genome: Reference genome and gene model annotation files were downloaded from genome website browser (NCBI/UCSC/Ensembl) directly. Paired-end clean reads were mapped to the reference genome using HISAT2 software. HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome. These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. Quantification: HTSeq was used to count the read numbers mapped of each gene, including known and novel genes. And then RPKM of each gene was calculated based on the length of the gene and reads count mapped to this gene. RPKM, (Reads Per Kilobase of exon model per Million mapped reads), considers the effect of sequencing depth and gene length for the reads count at the same time and is currently the most commonly used method for estimating gene expression levels. For each identified gene, the fold change was calculated by the ratio of cytokine stimulated/unstimulated expression levels within each donor and an unpaired, two-tailed t test was applied to calculate p values. Genes were considered to be significantly altered if: p value £ 0.05, and log2 fold change ³+1 or £-1. Genes with an RPKM of less than 1 in two or more donors were excluded from analysis so as to remove genes with abundance near detection limit. Genes without annotated function were also removed. Functional annotation of genes (KEGG pathways, GO terms) was done using DAVID Bioinformatics Resource functional annotation tool (92, 93). Clustered heatmap was generated using R Studio Pheatmap package. siRNA-mediated knockdown of IRF1 in RPE1 cells: A set of four IRF1-siRNAs were purchased from Dharmacon and tested individually to determine levels of knockdown achieved. The siRNA providing the highest level of IRF1. knockdown (Horizon, LQ-011704-00-0005, siRNA #2: UGAACUCCCUGCCAGAUAU) were subsequently used in all the experiments. RPE1-IL27Ra cells were plated in 6-well dishes (0.4x106 cells per well) and transfected the next day with IRF1-siRNA or control-GAPDH siRNA (Horizon, D-001830-10-05) (Dharmacon) using DharmaFect 1 transfection reagent (Dharmacon) following the manufacturer’s instructions for 24h. At different timepoints of IL-27 (2nM) or HypIL-6 (10nM) stimulation, samples were collected from each one 6-well. Cells were .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 25 trypsinized and each sample was spun down and pellets snap-frozen in liquid nitrogen for subsequent RNA isolation (90%) or PFA-fixed for total IRF1 staining (10%) by flow cytometry. Real-time quantitative PCR: Cells were subject to RNA isolation using the Qiagen RNeasy kit. RNA (100 ng) was reverse transcribed to complementary DNA (cDNA) using an iScript cDNA synthesis kit (BioRad, #1708890), which was used as template for quantitative PCR. PowerTrack™ SYBR Green Master Mix (Takara, #A46109) was used for the reaction with the following primers: b-actin was used as housekeeping gene for normalization. Each siRNA knockdown experiment was performed in three replicates with each sample for qPCR being done in two technical replicates. Mathematical models and Bayesian inference: We developed two new mathematical models, making use of ordinary differential equations (ODEs), for the initial steps of cytokine-receptor binding, dimer formation and signal activation by HypIL-6 and IL-27, respectively; namely, a set of ODEs for the HypIL-6 system and a separate set of ODEs for the IL-27 system (see end of this section for the set of ODEs included in each model). These ODEs describe the rate of change of the concentration for each molecular species considered in the receptor-ligand systems (HypIL-6 and IL-27) over time. By solving these ODEs, a time-course for the concentration of total (free and bound) phosphorylated STAT1 and STAT3 can be obtained and compared to the experimental data (Supp. Fig. 5b & c). The HypIL-6 and IL-27 mathematical models differ due to the reactions involved in the formation of the signaling dimer for each cytokine. Under stimulation with HypIL-6, two HypIL-6 bound GP130 monomers are required to form the homodimer (Supp. Fig. 3a), whereas under IL-27 stimulation, we assume that IL-27 binds to the IL-27Ra chain and not to GP130 (Supp. Fig. 3b) and hence the heterodimer is comprised of an IL-27 molecule bound to an IL-27Ra monomer and one GP130 chain. In the mathematical models, we assume that upon formation of the dimers (homo- or heterodimer), these receptor chains become immediately phosphorylated. The models do not consider JAK molecules explicitly. We are assuming that these molecules are constitutively bound to their corresponding receptor chains and that they phosphorylate immediately upon receptor phosphorylation (dimer formation). After the formation of the dimer, which we denote by 𝐷) or 𝐷"*, formed by HypIL-6 or IL-27 respectively, the biochemical reactions included in each mathematical model are similar, and are summarized as follows. Table 1 provides a description of the rates for each reaction considered in each (and both) mathematical model(s). In what follows we assume mass action kinetics for all the reactions. A free cytoplasmic unphosphorylated STAT1 or STAT3 molecule can bind to either receptor chain in the dimer, provided that the intracellular tyrosine residue of the receptor in the dimer is free (Supp. Fig. 3c & d). The STAT1 or STAT3 target For Rev Size b-actin CATGTACGTTGCTATCCAGGC CTCCTTAATGTCACGCACGAT 250bp STAT1 CTAGTGGAGTGGAAGCGGAG CACCACAAACGAGCTCTGAA 252bp GBP5 TCCTCGGATTATTGCTCGGC CCTTTGCGCTTCAGCCTTTT 309bp OAS1 GAAGGCAGCTCACGAAACC AGGCCTCAGCCTCTTGTG 114bp SOCS3 GTCCCCCCAGAAGAGCCTATTA TTGACGGTCTTCCGACAGAGAT 118 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 26 molecule can subsequently dissociate from the receptor chain in the dimer or can become phosphorylated (with rate 𝑞) whilst bound to the dimer. We have assumed that the rate of STAT1 or STAT3 phosphorylation when bound does not depend on the STAT type (1 or 3) or on the receptor chain (Supp. Fig. 3c & d). Phosphorylated STAT1 (pSTAT1) and STAT3 (pSTAT3) molecules can dissociate from the dimer. Once free in the cytoplasm, they can then dephosphorylate (Supp. Fig. 3g). We have assumed that this rate of STAT dephosphorylation only depends on the concentration of the respective pSTAT type, free in the cytoplasm. We note that no allostery has been considered in the models and hence, phosphorylated and unphosphorylated STAT molecules dissociate from the receptor with the same rate (Supp. Fig. 3c & d). Finally, any molecular species containing receptor molecules can be removed from the system, due to internalisation or degradation, via one of two hypothesised mechanisms (Supp. Fig. 3e & f): • hypothesis 1 (H1): receptors (free or bound, phosphorylated or unphosphorylated) are internalised/degraded with a rate proportional to the concentration of the species in which they are contained, or • hypothesis 2 (H2): receptors (free or bound, phosphorylated or unphosphorylated) are internalised/degraded with a rate proportional to the product of the concentration of the species in which they are contained and the sum of the concentrations of free cytoplasmic phosphorylated STAT1 and STAT3. We note that hypothesis 1 assumes that receptor molecules (free or bound, phosphorylated or unphosphorylated) are being internalised/degraded as part of the natural cellular trafficking cycle. Hypothesis 2 is consistent with a potential feedback mechanism, whereby the free cytoplasmic pSTAT molecules would migrate to the nucleus and increase the production of negative feedback proteins, such as SOCS3, which down-regulate cytokine signaling. Thus, the internalisation/degradation rate of receptor molecules (free or bound, phosphorylated or unphosphorylated) under hypothesis 2 increases with the total amount of free cytoplasmic phosphorylated STAT1 and STAT3, to account for this surface receptor down-regulation. A depiction of the reactions in both the HypIL-6 and IL-27 mathematical models and under each hypothesis is given in Supp. Fig. 3 where a), c), e) and g) describe the HypIL-6 model and b), d), f) and g) describe the IL-27 model. In this figure, 𝑖 ∈ {1,3} so that the reactions shown can either involve STAT1 or STAT3. Above or below the reaction arrows is a symbol which represents the rate at which the reaction occurs (under the assumption of mass action kinetics). The notation for the rate constants and initial concentrations in the models, along with their descriptions and units, are given in Table 1. Parameter Description Unit 𝑟#,) & ,𝑟#,"* & Rate of receptor-ligand binding nM-1s-1 𝑟#,) , ,𝑟#,"* , Rate of receptor-ligand dissociation s-1 𝑟",) & ,𝑟","* & Rate of monomers binding to form a dimer nM-1s-1 𝑟",) , ,𝑟","* , Rate of dissociation of the dimer s-1 𝑘$% & Rate of STAT𝑖 binding to GP130 nM-1s-1 𝑘$' & Rate of STAT𝑖 binding to IL-27Ra nM-1s-1 𝑘$% , Rate of STAT𝑖 dissociating GP130 s-1 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 27 𝑘$' , Rate of STAT𝑖 dissociating IL-27Ra s-1 𝑞 Rate of STAT phosphorylation on the dimer s-1 𝑑$ Rate of free pSTAT𝑖 dephosphorylation s -1 𝛽),𝛽"* Rate of receptor internalisation/degradation under hypothesis 1 s-1 𝛾),𝛾"* Rate of receptor internalisation/degradation under hypothesis 2 nM-1s-1 [𝑅#(0)] Initial concentration of GP130 nM [𝑅"(0)] Initial concentration of IL-27Rα nM [𝑆$(0)] Initial concentration of STAT𝑖 nM Table 1: Notation, definitions and units for the parameter values used in the mathematical models, where 𝑖 ∈ {1,3} so that STAT𝑖 corresponds to STAT1 or STAT3. The HypIL-6 mathematical model was formulated based on reactions involving the following species: • 𝐿) = HypIL-6, • 𝑅# = GP130, • 𝐶# = GP130 - HypIL-6 monomer, • 𝐷) = Phosphorylated GP130 - HypIL-6 - HypIL-6 - GP130 homodimer, • 𝑆# = Unbound cytoplasmic unphosphorylated STAT1, • 𝑆( = Unbound cytoplasmic unphosphorylated STAT3, • 𝐷) ⋅ 𝑆# = Dimer bound to STAT1, • 𝐷) ⋅ 𝑆( = Dimer bound to STAT3, • 𝐷) ⋅ 𝑝𝑆# = Dimer bound to pSTAT1, • 𝐷) ⋅ 𝑝𝑆( = Dimer bound to pSTAT3, • 𝑆# ⋅ 𝐷) ⋅ 𝑆# = Dimer bound to two molecules of STAT1, • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆# = Dimer bound to two molecules of STAT1, one of which is phosphorylated, • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆# = Dimer bound to two molecules of pSTAT1, • 𝑆( ⋅ 𝐷) ⋅ 𝑆( = Dimer bound to two molecules of STAT3, • 𝑝𝑆( ⋅ 𝐷) ⋅ 𝑆( = Dimer bound to two molecules of STAT3, one of which is phosphorylated, • 𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆( = Dimer bound to two molecules of pSTAT3, • 𝑆# ⋅ 𝐷) ⋅ 𝑆( = Dimer bound to one molecule of STAT1 and one of STAT3, • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆( = Dimer bound to one molecule of pSTAT1 and one of STAT3, • 𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆( = Dimer bound to one molecule of STAT1 and one of pSTAT3, • 𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆( = Dimer bound to one molecule of pSTAT1 and one of pSTAT3, • 𝑝𝑆# = Unbound cytoplasmic phosphorylated STAT1, • 𝑝𝑆( = Unbound cytoplasmic phosphorylated STAT3. The initial reactions in the HypIL-6 signaling pathway can then be described by the ODEs (1) – (22), under the law of mass action, where the terms involving the parameter 𝛽) apply only to the model under hypothesis 1 and the terms involving the parameter 𝛾) apply only to the model under hypothesis 2. Square brackets around a species is a notation that denotes the concentration of this species with unit nM, and “⋅” implies a reaction bond between two molecules/species. The ODEs are valid for any time 𝑡, with 𝑡 ≥ 0, but time has been omitted in the species concentration for ease of notation. We note here that, for example [𝑅#] = [𝑅#](𝑡) for all 𝑡 ≥ 0. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 28 𝑑[𝑅1] 𝑑𝑡 = −𝑟1,6 + [𝑅1][𝐿)] + 𝑟1,6 − [𝐶1] − 𝛽6[𝑅1] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝑅1] (1) 𝑑[𝐿)] 𝑑𝑡 = −𝑟1,6 + [𝑅1][𝐿)] + 𝑟1,6 − [𝐶1] (2) 𝑑[𝐶1] 𝑑𝑡 = 𝑟1,6 + [𝑅1][𝐿)] − 𝑟1,6 − [𝐶1] − 2𝑟2,6 + [𝐶1]2 + 2𝑟2,6 − [𝐷6] − 𝛽6[𝐶1] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝐶1] (3) 𝑑[𝐷6] 𝑑𝑡 = 𝑟2,6 + [𝐶1]2 − 𝑟2,6 − [𝐷6] − 2𝑘1𝑎 + [𝐷6][𝑆1] + 𝑘1𝑎 − ([𝐷6 ⋅ 𝑆1] + [𝐷6 ⋅ 𝑝𝑆1]) − 2𝑘3𝑎 + [𝐷6][𝑆3] + 𝑘3𝑎 − ([𝐷6 ⋅ 𝑆3] + [𝐷6 ⋅ 𝑝𝑆3]) − 𝛽6[𝐷6] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝐷6] (4) 𝑑[𝑆1] 𝑑𝑡 = −𝑘1𝑎 + [𝑆1](2[𝐷6] + [𝐷6 ⋅ 𝑆1] + [𝐷6 ⋅ 𝑆3] + [𝐷6 ⋅ 𝑝𝑆1] + [𝐷6 ⋅ 𝑝𝑆3]) + 𝑘1𝑎 − ([𝐷6 ⋅ 𝑆1] + 2[𝑆1 ⋅ 𝐷6 ⋅ 𝑆1] + [𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] + [𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆1] + [𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆3]) + 𝑑1[𝑝𝑆1] (5) 𝑑[𝑆3] 𝑑𝑡 = −𝑘3𝑎 + [𝑆3](2[𝐷6] + [𝐷6 ⋅ 𝑆3] + [𝐷6 ⋅ 𝑆1] + [𝐷6 ⋅ 𝑝𝑆3] + [𝐷6 ⋅ 𝑝𝑆1]) + 𝑘3𝑎 − ([𝐷6 ⋅ 𝑆3] + 2[𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] + [𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] + [𝑆3 ⋅ 𝐷6 ⋅ 𝑝𝑆3] + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(]) + 𝑑3[𝑝𝑆3] (6) 𝑑[𝐷6 ⋅ 𝑆1] 𝑑𝑡 = 2𝑘1𝑎 + [𝑆1][𝐷6] − 𝑘1𝑎 − [𝐷6 ⋅ 𝑆1] − 𝑘1𝑎 + [𝐷6 ⋅ 𝑆1][𝑆1] + 2𝑘1𝑎 − [𝑆1 ⋅ 𝐷6 ⋅ 𝑆1] − 𝑘3𝑎 + [𝐷6 ⋅ 𝑆1][𝑆3] + 𝑘3𝑎 − [𝑆# ⋅ 𝐷6 ⋅ 𝑆(] − 𝑞[𝐷6 ⋅ 𝑆1] + 𝑘1𝑎 − [𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆1] + 𝑘3𝑎 − [𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆3] − 𝛽6[𝐷6 ⋅ 𝑆1] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝐷6 ⋅ 𝑆1] (7) 𝑑[𝐷6 ⋅ 𝑆3] 𝑑𝑡 = 2𝑘3𝑎 + [𝑆3][𝐷6] − 𝑘3𝑎 − [𝐷6 ⋅ 𝑆3] − 𝑘3𝑎 + [𝐷6 ⋅ 𝑆3][𝑆3] + 2𝑘3𝑎 − [𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] − 𝑘1𝑎 + [𝐷6 ⋅ 𝑆3][𝑆1] + 𝑘1𝑎 − [𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] − 𝑞[𝐷6 ⋅ 𝑆3] + 𝑘1𝑎 − [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(] + 𝑘3𝑎 − [𝑆3 ⋅ 𝐷6 ⋅ 𝑝𝑆3] − 𝛽6[𝐷6 ⋅ 𝑆3] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝐷6 ⋅ 𝑆3] (8) 𝑑[𝐷6 ⋅ 𝑝𝑆1] 𝑑𝑡 = −𝑘1𝑎 + [𝑆1][𝐷6 ⋅ 𝑝𝑆1] + 𝑘1𝑎 − [𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆1] − 𝑘3𝑎 + [𝑆3][𝐷6 ⋅ 𝑝𝑆1] + 𝑘3𝑎 − [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(] + 𝑞[𝐷6 ⋅ 𝑆1] − 𝑘1𝑎 − [𝐷6 ⋅ 𝑝𝑆1] + 2𝑘1𝑎 − [𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆1] + 𝑘3𝑎 − [𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆3] − 𝛽6[𝐷6 ⋅ 𝑝𝑆1] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝐷6 ⋅ 𝑝𝑆1] (9) 𝑑[𝐷6 ⋅ 𝑝𝑆3] 𝑑𝑡 = −𝑘3𝑎 + [𝑆3][𝐷6 ⋅ 𝑝𝑆3] + 𝑘3𝑎 − [𝑆3 ⋅ 𝐷6 ⋅ 𝑝𝑆3] − 𝑘1𝑎 + [𝑆1][𝐷6 ⋅ 𝑝𝑆3] + 𝑘1𝑎 − [𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆3] + 𝑞[𝐷6 ⋅ 𝑆3] − 𝑘3𝑎 − [𝐷6 ⋅ 𝑝𝑆3] + 2𝑘3𝑎 − [𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑝𝑆3] + 𝑘1𝑎 − [𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆3] − 𝛽6[𝐷6 ⋅ 𝑝𝑆3] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝐷6 ⋅ 𝑝𝑆3] (10) 𝑑[𝑆1 ⋅ 𝐷6 ⋅ 𝑆1] 𝑑𝑡 = 𝑘1𝑎 + [𝑆1][𝐷6 ⋅ 𝑆1] − 2𝑘1𝑎 − [𝑆1 ⋅ 𝐷6 ⋅ 𝑆1] − 2𝑞[𝑆1 ⋅ 𝐷6 ⋅ 𝑆1] − 𝛽6[𝑆1 ⋅ 𝐷6 ⋅ 𝑆1] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝑆1 ⋅ 𝐷6 ⋅ 𝑆1] (11) 𝑑[𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] 𝑑𝑡 = 𝑘3𝑎 + [𝑆3][𝐷6 ⋅ 𝑆3] − 2𝑘3𝑎 − [𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] − 2𝑞[𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] − 𝛽6[𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] (12) 𝑑[𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑆1] 𝑑𝑡 = 𝑘1𝑎 + [𝑝𝑆1 ⋅ 𝐷6][𝑆1] − 2𝑘1𝑎 − [𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑆1] +2𝑞[𝑆) ⋅ 𝐷* ⋅ 𝑆)] − 𝑞[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆)] − 𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆)] (13) .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 29 −𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆)] 𝑑[𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] 𝑑𝑡 = 𝑘3𝑎 + [𝑝𝑆3 ⋅ 𝐷6][𝑆3] − 2𝑘3𝑎 − [𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] + 2𝑞[𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] − 𝑞[𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] − 𝛽6[𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] (14) 𝑑[𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆1] 𝑑𝑡 = 𝑞[𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑆1] − 2𝑘1𝑎 − [𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆1] −𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆)] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆)] (15) 𝑑[𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑝𝑆3] 𝑑𝑡 = 𝑞[𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑆3] − 2𝑘3𝑎 − [𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑝𝑆3] −𝛽*[𝑝𝑆+ ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷* ⋅ 𝑝𝑆+] (16) 𝑑[𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] 𝑑𝑡 = 𝑘1𝑎 + [𝑆1][𝐷6 ⋅ 𝑆3] − 𝑘1𝑎 − [𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] + 𝑘3𝑎 + [𝑆1 ⋅ 𝐷6][𝑆3] − 𝑘3𝑎 − [𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] − 2𝑞[𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] − 𝛽6[𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] − 𝛾6([𝑝𝑆1] + [𝑝𝑆3])[𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] (17) 𝑑[𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] 𝑑𝑡 = 𝑞[𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] + 𝑘3𝑎 + [𝑝𝑆1 ⋅ 𝐷6][𝑆3] −𝑘+,- [𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] − 𝑞[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] − 𝑘),- [𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] −𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑆+] (18) 𝑑[𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆3] 𝑑𝑡 = 𝑞[𝑆1 ⋅ 𝐷6 ⋅ 𝑆3] + 𝑘1𝑎 + [𝑆1][𝐷6 ⋅ 𝑝𝑆3] −𝑘),- [𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝑞[𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝑘+,- [𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] −𝛽*[𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] − 𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] (19) 𝑑[𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆3] 𝑑𝑡 = 𝑞([𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆3] + [𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑆3]) −[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+](𝑘),- + 𝑘+,- ) − 𝛽*[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] −𝛾*([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷* ⋅ 𝑝𝑆+] (20) 𝑑[𝑝𝑆1] 𝑑𝑡 = 𝑘1𝑎 − ([𝐷6 ⋅ 𝑝𝑆1] + [𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆1] + [𝑆3 ⋅ 𝐷6 ⋅ 𝑝𝑆1] + [𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑝𝑆1] + 2[𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆1]) − 𝑑1[𝑝𝑆1] (21) 𝑑[𝑝𝑆3] 𝑑𝑡 = 𝑘3𝑎 − ([𝐷6 ⋅ 𝑝𝑆3] + [𝑆3 ⋅ 𝐷6 ⋅ 𝑝𝑆3] + [𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆3] + [𝑝𝑆1 ⋅ 𝐷6 ⋅ 𝑝𝑆3] + 2[𝑝𝑆3 ⋅ 𝐷6 ⋅ 𝑝𝑆3]) − 𝑑3[𝑝𝑆3] (22) Similarly, and with some species in common with the HypIL-6 model, the IL-27 model has been formulated based on reactions involving the following species: • 𝐿"* = IL-27, • 𝑅# = GP130, • 𝑅" = IL-27Ra, • 𝐶" = IL-27Ra - IL-27 monomer, • 𝐷"* = Phosphorylated IL-27Ra - IL-27 - GP130 heterodimer, • 𝑆# = Unbound cytoplasmic unphosphorylated STAT1, • 𝑆( = Unbound cytoplasmic unphosphorylated STAT3, • 𝑆# ⋅ 𝐷"* = Dimer bound to STAT1 via 𝑅#, .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 30 • 𝑆( ⋅ 𝐷"* = Dimer bound to STAT3 via 𝑅#, • 𝑝𝑆# ⋅ 𝐷"* = Dimer bound to pSTAT1 via 𝑅#, • 𝑝𝑆( ⋅ 𝐷"* = Dimer bound to pSTAT3 via 𝑅#, • 𝐷"* ⋅ 𝑆# = Dimer bound to STAT1 via 𝑅", • 𝐷"* ⋅ 𝑆( = Dimer bound to STAT3 via 𝑅", • 𝐷"* ⋅ 𝑝𝑆# = Dimer bound to pSTAT1 via 𝑅", • 𝐷"* ⋅ 𝑝𝑆( = Dimer bound to pSTAT3 via 𝑅", • 𝑆# ⋅ 𝐷"* ⋅ 𝑆# = Dimer bound to two molecules of STAT1, • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆# = Dimer bound to two molecules of STAT1, one of them phosphorylated on 𝑅#, • 𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆# = Dimer bound to two molecules of STAT1, one of them phosphorylated on 𝑅", • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆# = Dimer bound to two molecules of pSTAT1, • 𝑆( ⋅ 𝐷"* ⋅ 𝑆( = Dimer bound to two molecules of STAT3, • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆( = Dimer bound to two molecules of STAT3, one of them phosphorylated on 𝑅#, • 𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆( = Dimer bound to two molecules of STAT3, one of them phosphorylated on 𝑅", • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆( = Dimer bound to two molecules of pSTAT3, • 𝑆# ⋅ 𝐷"* ⋅ 𝑆( = Dimer bound to STAT1 via 𝑅# and STAT3 via 𝑅", • 𝑆( ⋅ 𝐷"* ⋅ 𝑆# = Dimer bound to STAT1 via 𝑅" and STAT3 via 𝑅#, • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆( = Dimer bound to pSTAT1 via 𝑅# and STAT3 via 𝑅", • 𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆# = Dimer bound to pSTAT1 via 𝑅" and STAT3 via 𝑅#, • 𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆( = Dimer bound to STAT1 via 𝑅# and pSTAT3 via 𝑅", • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆# = Dimer bound to STAT1 via 𝑅" and pSTAT3 via 𝑅#, • 𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆( = Dimer bound pSTAT1 via 𝑅# and pSTAT3 via 𝑅", • 𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆# = Dimer bound pSTAT3 via 𝑅# and pSTAT1 via 𝑅#, • 𝑝𝑆# = Unbound cytoplasmic phosphorylated STAT1, • 𝑝𝑆( = Unbound cytoplasmic phosphorylated STAT3. Again, under the law of mass action, the initial reactions in the IL-27 signaling pathway can be described by the ODEs (23) – (55). 𝑑[𝑅1] 𝑑𝑡 = −𝑟2,27 + [𝐶2][𝑅1] + 𝑟2,27 − [𝐷27] − 𝛽27[𝑅1] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝑅1] (23) 𝑑[𝑅2] 𝑑𝑡 = −𝑟1,27 + [𝑅2][𝐿27] + 𝑟1,27 − [𝐶2] − 𝛽27[𝑅2] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝑅2] (24) 𝑑[𝐿27] 𝑑𝑡 = −𝑟1,27 + [𝑅2][𝐿27] + 𝑟1,27 − [𝐶2] (25) 𝑑[𝐶2] 𝑑𝑡 = 𝑟1,27 + [𝑅2][𝐿27] − 𝑟1,27 − [𝐶2] − 𝑟2,27 + [𝐶2][𝑅1] + 𝑟2,27 − [𝐷27] − 𝛽27[𝐶2] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝐶2] (26) 𝑑[𝐷27] 𝑑𝑡 = 𝑟2,27 + [𝐶2][𝑅1] − 𝑟2,27 − [𝐷27] − M𝑘1𝑎 + + 𝑘1𝑏 + N[𝐷27][𝑆1] + 𝑘1𝑎 − ([𝑆1 ⋅ 𝐷27] + [𝑝𝑆1 ⋅ 𝐷27]) + 𝑘1𝑏 − ([𝐷27 ⋅ 𝑆1] + [𝐷27 ⋅ 𝑝𝑆1]) − M𝑘3𝑎 + + 𝑘3𝑏 + N[𝐷27][𝑆3] + 𝑘3𝑎 − ([𝑆3 ⋅ 𝐷27] + [𝑝𝑆3 ⋅ 𝐷27]) + 𝑘3𝑏 − ([𝐷27 ⋅ 𝑆3] + [𝐷27 ⋅ 𝑝𝑆3]) − 𝛽27[𝐷27] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝐷27] (27) .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 31 𝑑[𝑆1] 𝑑𝑡 = −𝑘1𝑎 + [𝑆1]([𝐷27] + [𝐷27 ⋅ 𝑆1] + [𝐷27 ⋅ 𝑝𝑆1] + [𝐷27 ⋅ 𝑆3] + [𝐷27 ⋅ 𝑝𝑆3]) + 𝑘1𝑎 − ([𝑆1 ⋅ 𝐷27] + [𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] + [𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + [𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] + [𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3]) − 𝑘1𝑏 + [𝑆1]([𝐷27] + [𝑆1 ⋅ 𝐷27] + [𝑝𝑆1 ⋅ 𝐷27] + [𝑆3 ⋅ 𝐷27] + [𝑝𝑆3 ⋅ 𝐷27]) + 𝑘1𝑏 − ([𝐷27 ⋅ 𝑆1] + [𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] + [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] + [𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] + [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆1]) + 𝑑1[𝑝𝑆1] (28) 𝑑[𝑆3] 𝑑𝑡 = −𝑘3𝑎 + [𝑆3]([𝐷27] + [𝐷27 ⋅ 𝑆1] + [𝐷27 ⋅ 𝑝𝑆1] + [𝐷27 ⋅ 𝑆3] + [𝐷27 ⋅ 𝑝𝑆3]) + 𝑘3𝑎 − ([𝑆3 ⋅ 𝐷27] + [𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] + [𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + [𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] + [𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3]) − 𝑘3𝑏 + [𝑆3]([𝐷27] + [𝑆1 ⋅ 𝐷27] + [𝑝𝑆1 ⋅ 𝐷27] + [𝑆3 ⋅ 𝐷27] + [𝑝𝑆3 ⋅ 𝐷27]) + 𝑘3𝑏 − ([𝐷27 ⋅ 𝑆3] + [𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] + [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] + [𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] + [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆3]) + 𝑑3[𝑝𝑆3] (29) 𝑑[𝑆1 ⋅ 𝐷27] 𝑑𝑡 = 𝑘1𝑎 + [𝑆1][𝐷27] − 𝑘1𝑎 − [𝑆1 ⋅ 𝐷27] − 𝑞[𝑆1 ⋅ 𝐷27] − 𝑘1𝑏 + [𝑆1][𝑆1 ⋅ 𝐷27] + 𝑘1𝑏 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] − 𝑘3𝑏 + [𝑆3][𝑆1 ⋅ 𝐷27] + 𝑘3𝑏 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] + 𝑘1𝑏 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + 𝑘3𝑏 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3] − 𝛽27[𝑆1 ⋅ 𝐷27] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝑆1 ⋅ 𝐷27] (30) 𝑑[𝐷27 ⋅ 𝑆1] 𝑑𝑡 = 𝑘1𝑏 + [𝑆1][𝐷27] − 𝑘1𝑏 − [𝐷27 ⋅ 𝑆1] − 𝑞[𝐷27 ⋅ 𝑆1] − 𝑘1𝑎 + [𝑆1][𝐷27 ⋅ 𝑆1] + 𝑘1𝑎 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] − 𝑘3𝑎 + [𝑆3][𝐷27 ⋅ 𝑆1] + 𝑘3𝑎 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] + 𝑘1𝑎 − [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] + 𝑘3𝑎 − [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] − 𝛽27[𝐷27 ⋅ 𝑆1] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝐷27 ⋅ 𝑆1] (31) 𝑑[𝑆3 ⋅ 𝐷27] 𝑑𝑡 = 𝑘3𝑎 + [𝑆3][𝐷27] − 𝑘3𝑎 − [𝑆3 ⋅ 𝐷27] − 𝑞[𝑆3 ⋅ 𝐷27] − 𝑘3𝑏 + [𝑆3][𝑆3 ⋅ 𝐷27] + 𝑘3𝑏 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] − 𝑘1𝑏 + [𝑆1][𝑆3 ⋅ 𝐷27] + 𝑘1𝑏 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] + 𝑘3𝑏 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] + 𝑘1𝑏 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1] − 𝛽27[𝑆3 ⋅ 𝐷27] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝑆3 ⋅ 𝐷27] (32) 𝑑[𝐷27 ⋅ 𝑆3] 𝑑𝑡 = 𝑘3𝑏 + [𝑆3][𝐷27] − 𝑘3𝑏 − [𝐷27 ⋅ 𝑆3] − 𝑞[𝐷27 ⋅ 𝑆3] − 𝑘3𝑎 + [𝑆3][𝐷27 ⋅ 𝑆3] + 𝑘3𝑎 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] − 𝑘1𝑎 + [𝑆1][𝐷27 ⋅ 𝑆3] + 𝑘1𝑎 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] + 𝑘3𝑎 − [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] + 𝑘1𝑎 − [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] − 𝛽27[𝐷27 ⋅ 𝑆3] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝐷27 ⋅ 𝑆3] (33) 𝑑[𝑝𝑆1 ⋅ 𝐷27] 𝑑𝑡 = −𝑘1𝑏 + [𝑝𝑆1 ⋅ 𝐷27][𝑆1] + 𝑘1𝑏 − [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] − 𝑘3𝑏 + [𝑝𝑆1 ⋅ 𝐷27][𝑆3] + 𝑘3𝑏 − [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] + 𝑞[𝑆1 ⋅ 𝐷27] − 𝑘1𝑎 − [𝑝𝑆1 ⋅ 𝐷27] + 𝑘1𝑏 − [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + 𝑘3𝑏 − [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3] − 𝛽27[𝑝𝑆1 ⋅ 𝐷27] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝑝𝑆1 ⋅ 𝐷27] (34) .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 32 𝑑[𝐷27 ⋅ 𝑝𝑆1] 𝑑𝑡 = −𝑘1𝑎 + [𝐷27 ⋅ 𝑝𝑆1][𝑆1] + 𝑘1𝑎 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] − 𝑘3𝑎 + [𝐷27 ⋅ 𝑝𝑆1][𝑆3] + 𝑘3𝑎 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + 𝑞[𝐷27 ⋅ 𝑆1] − 𝑘1𝑏 − [𝐷27 ⋅ 𝑝𝑆1] + 𝑘1𝑎 − [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + 𝑘3𝑎 − [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1] − 𝛽27[𝐷27 ⋅ 𝑝𝑆1] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝐷27 ⋅ 𝑝𝑆1] (35) 𝑑[𝑝𝑆3 ⋅ 𝐷27] 𝑑𝑡 = −𝑘3𝑏 + [𝑝𝑆3 ⋅ 𝐷27][𝑆3] + 𝑘3𝑏 − [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] − 𝑘1𝑏 + [𝑝𝑆3 ⋅ 𝐷27][𝑆1] + 𝑘1𝑏 − [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] + 𝑞[𝑆3 ⋅ 𝐷27] − 𝑘3𝑎 − [𝑝𝑆3 ⋅ 𝐷27] + 𝑘3𝑏 − [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] + 𝑘1𝑏 − [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1] − 𝛽27[𝑝𝑆3 ⋅ 𝐷27] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝑝𝑆3 ⋅ 𝐷27] (36) 𝑑[𝐷27 ⋅ 𝑝𝑆3] 𝑑𝑡 = −𝑘3𝑎 + [𝐷27 ⋅ 𝑝𝑆3][𝑆3] + 𝑘3𝑎 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] − 𝑘1𝑎 + [𝐷27 ⋅ 𝑝𝑆3][𝑆1] + 𝑘1𝑎 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3] + 𝑞[𝐷27 ⋅ 𝑆3] − 𝑘3𝑏 − [𝐷27 ⋅ 𝑝𝑆3] + 𝑘3𝑎 − [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] + 𝑘1𝑎 − [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3] − 𝛽27[𝐷27 ⋅ 𝑝𝑆3] − 𝛾27([𝑝𝑆1] + [𝑝𝑆3])[𝐷27 ⋅ 𝑝𝑆3] (37) 𝑑[𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] 𝑑𝑡 = 𝑘1𝑎 + [𝑆1][𝐷27 ⋅ 𝑆1] − 𝑘1𝑎 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] +𝑘)0 1 [𝑆) ⋅ 𝐷23][𝑆)] − 𝑘)0 - [𝑆) ⋅ 𝐷23 ⋅ 𝑆)] − 2𝑞[𝑆) ⋅ 𝐷23 ⋅ 𝑆)] −𝛽23[𝑆) ⋅ 𝐷23 ⋅ 𝑆)] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷23 ⋅ 𝑆)] (38) 𝑑[𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] 𝑑𝑡 = 𝑘1𝑏 + [𝑝𝑆1 ⋅ 𝐷27][𝑆1] − 𝑘1𝑏 − [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] +𝑞[𝑆) ⋅ 𝐷23 ⋅ 𝑆)] − 𝑞[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑆)] − 𝑘),- [𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑆)] −𝛽23[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑆)] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑆)] (39) 𝑑[𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] 𝑑𝑡 = 𝑘1𝑎 + [𝑆1][𝐷27 ⋅ 𝑝𝑆1] − 𝑘1𝑎 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] +𝑞[𝑆) ⋅ 𝐷23 ⋅ 𝑆)] − 𝑞[𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆)] − 𝑘)0 - [𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆)] −𝛽23[𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆)] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆)] (40) 𝑑[𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] 𝑑𝑡 = 𝑞([𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆1]) −[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆)](𝑘),- + 𝑘)0 - ) − 𝛽23[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆)] −𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆)] (41) 𝑑[𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] 𝑑𝑡 = 𝑘3𝑎 + [𝑆3][𝐷27 ⋅ 𝑆3] − 𝑘3𝑎 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] +𝑘+0 1 [𝑆+ ⋅ 𝐷23][𝑆+] − 𝑘+0 - [𝑆+ ⋅ 𝐷23 ⋅ 𝑆+] − 2𝑞[𝑆+ ⋅ 𝐷23 ⋅ 𝑆+] −𝛽23[𝑆+ ⋅ 𝐷23 ⋅ 𝑆+] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷23 ⋅ 𝑆+] (42) 𝑑[𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] 𝑑𝑡 = 𝑘3𝑏 + [𝑝𝑆3 ⋅ 𝐷27][𝑆3] − 𝑘3𝑏 − [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] +𝑞[𝑆+ ⋅ 𝐷23 ⋅ 𝑆+] − 𝑞[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑆+] − 𝑘+,- [𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑆+] −𝛽23[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑆+] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑆+] (43) 𝑑[𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] 𝑑𝑡 = 𝑘3𝑎 + [𝑆3][𝐷27 ⋅ 𝑝𝑆3] − 𝑘3𝑎 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] (44) .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 33 +𝑞[𝑆+ ⋅ 𝐷23 ⋅ 𝑆+] − 𝑞[𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆+] − 𝑘+0 - [𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆+] −𝛽23[𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆+] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆+] 𝑑[𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] 𝑑𝑡 = 𝑞([𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] + [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆3]) −[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆+](𝑘+,- + 𝑘+0 - ) − 𝛽23[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆+] −𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆+] (45) 𝑑[𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] 𝑑𝑡 = 𝑘1𝑎 + [𝑆1][𝐷27 ⋅ 𝑆3] − 𝑘1𝑎 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] +𝑘+0 1 [𝑆) ⋅ 𝐷23][𝑆+] − 𝑘+0 - [𝑆) ⋅ 𝐷23 ⋅ 𝑆+] − 2𝑞[𝑆) ⋅ 𝐷23 ⋅ 𝑆+] −𝛽23[𝑆) ⋅ 𝐷23 ⋅ 𝑆+] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷23 ⋅ 𝑆+] (46) 𝑑[𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] 𝑑𝑡 = 𝑘3𝑎 + [𝑆3][𝐷27 ⋅ 𝑆1] − 𝑘3𝑎 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] +𝑘)0 1 [𝑆+ ⋅ 𝐷23][𝑆)] − 𝑘)0 - [𝑆+ ⋅ 𝐷23 ⋅ 𝑆)] − 2𝑞[𝑆+ ⋅ 𝐷23 ⋅ 𝑆)] −𝛽23[𝑆+ ⋅ 𝐷23 ⋅ 𝑆)] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷23 ⋅ 𝑆)] (47) 𝑑[𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] 𝑑𝑡 = 𝑘3𝑏 + [𝑝𝑆1 ⋅ 𝐷27][𝑆3] − 𝑘3𝑏 − [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] +𝑞[𝑆) ⋅ 𝐷23 ⋅ 𝑆+] − 𝑞[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑆+] − 𝑘),- [𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑆+] −𝛽23[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑆+] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑆+] (48) 𝑑[𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] 𝑑𝑡 = 𝑘1𝑏 + [𝑝𝑆3 ⋅ 𝐷27][𝑆1] − 𝑘1𝑏 − [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] +𝑞[𝑆+ ⋅ 𝐷23 ⋅ 𝑆)] − 𝑞[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑆)] − 𝑘+,- [𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑆)] −𝛽23[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑆)] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑆)] (49) 𝑑[𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3] 𝑑𝑡 = 𝑘1𝑎 + [𝑆1][𝐷27 ⋅ 𝑝𝑆3] − 𝑘1𝑎 − [𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3] +𝑞[𝑆) ⋅ 𝐷23 ⋅ 𝑆+] − 𝑞[𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆+] − 𝑘+0 - [𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆+] −𝛽23[𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆+] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆+] (50) 𝑑[𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1] 𝑑𝑡 = 𝑘3𝑎 + [𝑆3][𝐷27 ⋅ 𝑝𝑆1] − 𝑘3𝑎 − [𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1] +𝑞[𝑆+ ⋅ 𝐷23 ⋅ 𝑆)] − 𝑞[𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆)] − 𝑘)0 - [𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆)] −𝛽23[𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆)] − 𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆)] (51) 𝑑[𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3] 𝑑𝑡 = 𝑞([𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3] + [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆3]) −[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆+](𝑘),- + 𝑘+0 - ) − 𝛽23[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆+] −𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆) ⋅ 𝐷23 ⋅ 𝑝𝑆+] (52) 𝑑[𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1] 𝑑𝑡 = 𝑞([𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆1]) −[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆)](𝑘+,- + 𝑘)0 - ) − 𝛽23[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆)] −𝛾23([𝑝𝑆)] + [𝑝𝑆+])[𝑝𝑆+ ⋅ 𝐷23 ⋅ 𝑝𝑆)] (53) .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 34 𝑑[𝑝𝑆1] 𝑑𝑡 = 𝑘1𝑎 − ([𝑝𝑆1 ⋅ 𝐷27] + [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆1] + [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑆3] + [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3]) + 𝑘1𝑏 − ([𝐷27 ⋅ 𝑝𝑆1] + [𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + [𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1] + [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1]) − 𝑑1[𝑝𝑆1] (54) 𝑑[𝑝𝑆3] 𝑑𝑡 = 𝑘3𝑎 − ([𝑝𝑆3 ⋅ 𝐷27] + [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆3] + [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] + [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑆1] + [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆1]) + 𝑘3𝑏 − ([𝐷27 ⋅ 𝑝𝑆3] + [𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] + [𝑝𝑆3 ⋅ 𝐷27 ⋅ 𝑝𝑆3] + [𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3] + [𝑝𝑆1 ⋅ 𝐷27 ⋅ 𝑝𝑆3]) − 𝑑3[𝑝𝑆3] (55) Similarly to the HypIL-6 model, the terms in Equations (23) - (55) involving the parameter 𝛽"* apply only to the model under hypothesis 1 and the terms involving the parameter 𝛾"* apply only to the model under hypothesis 2. We now describe how we have made use of the experimental data (Fig. 6b and 6c supp.) to parameterise the mathematical models described above. Since the experimental outputs are levels of pSTAT1 and pSTAT3 as a function of time under HypIL-6 and IL-27 stimulation (Fig. 6b and 6c supp.), we consider two model outputs of interest for the HypIL-6 and IL-27 mathematical models, which are proportional to the experimental data in Supp. Figure 6b and 6c; namely, the sum of all molecular species (variables) containing phosphorylated STAT1 (free or bound) ([𝑝𝑆#]-,., for 𝑗 ∈ {6,27}) and the sum of all species (variables) containing phosphorylated STAT3 (free or bound) ([𝑝𝑆(]-,., for 𝑗 ∈ {6,27}). The concentrations of the two model outputs of interest at any time 𝑡 are given by [𝑝𝑆#]-,)(𝑡) = [𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆#](𝑡) + 2[𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑆(](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆#](𝑡), (56) [𝑝𝑆(]-,)(𝑡) = [𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑆(](𝑡) + 2[𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑆#](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆(](𝑡), (57) for the HypIL-6 model, and by [𝑝𝑆#]-,"*(𝑡) = [𝑝𝑆# ⋅ 𝐷"*](𝑡) + [𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆#](𝑡) + [𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆#](𝑡) + 2[𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷"* ⋅ 𝑆(](𝑡) + [𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆#](𝑡), (58) [𝑝𝑆(]-,"*(𝑡) = [𝑝𝑆( ⋅ 𝐷"*](𝑡) + [𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆(](𝑡) + [𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆(](𝑡) + 2[𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷"* ⋅ 𝑆#](𝑡) + [𝑆# ⋅ 𝐷"* ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆# ⋅ 𝐷) ⋅ 𝑝𝑆(](𝑡) + [𝑝𝑆( ⋅ 𝐷) ⋅ 𝑝𝑆#](𝑡) + [𝑝𝑆(](𝑡), (59) for the IL-27 model. Having developed two mathematical models for the stimulation of the experimental system with HypIL-6 and IL-27, it was then our objective to parameterise these models making use of approximate Bayesian computation sequential Monte Carlo (ABC-SMC). Firstly, a Bayesian model selection was carried out to determine which hypothesis (mechanism) of internalisation/degradation of receptor molecules is most likely given the data. Once a hypothesis was selected, together with the experimental data, the ABC-SMC method allows one to obtain posterior distributions for each of the parameter values and initial concentrations in the mathematical models. In this way, we can learn about which reactions and parameters in the models are causing the differential signaling by pSTAT1 observed when stimulating with HypIL-6 and IL-27. The experimental data we used to compare with the mathematical model .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 35 outputs, was the mean relative fluorescence intensity of total phosphorylated STAT1 and total phosphorylated STAT3 in both RPE1 and Th-1 cells (Supp. Figure 5b and 5c). We normalised the data to obtain dimensionless values, which can be compared with the mathematical model outputs. Firstly, we constructed a linear model for the fluorescence intensity (background fluorescence) of antibodies for phosphorylated STAT1 and STAT3 in unstimulated cells. We subtracted the value of this linear model at each time point from the corresponding fluorescence intensity in HypIL-6 and IL-27 stimulated cells, for each repeat of the experiment and each cell type. Denoting by 𝑓 the experimental fluorescence intensity, 𝑓(𝑟, 𝑖,𝑡𝑝,𝑗,𝑑) corresponds to the fluorescence intensity for the 𝑟th repeat, 𝑟 ∈ 𝑅 = {1,2,3,4} with antibody for STAT𝑖, 𝑖 ∈ 𝐼 = {1,3} at time point 𝑡𝑝 ∈ 𝑇𝑃 = {0 𝑚𝑖𝑛,5 𝑚𝑖𝑛,15 𝑚𝑖𝑛,30 𝑚𝑖𝑛,60 𝑚𝑖𝑛,90 𝑚𝑖𝑛,120 𝑚𝑖𝑛,180 𝑚𝑖𝑛} under stimulation by cytokine IL-𝑗 (HypIL-𝑗 when 𝑗 = 6), with 𝑗 ∈ 𝐽 = {6,27} and in cell type 𝑑 ∈ 𝐷 = {RPE1,Th-1}. Each data point 𝑑𝑎𝑡𝑎(𝑟, 𝑖, 𝑡𝑝,𝑗,𝑑), to be used in the Bayesian inference and Bayesian model selection was then computed as 𝑑𝑎𝑡𝑎(𝑟, 𝑖, 𝑡𝑝,𝑗,𝑑) = 𝑓(𝑟, 𝑖,𝑡𝑝,𝑗,𝑑) 𝑓(𝑟, 𝑖, 𝑡𝑝 = 30 𝑚𝑖𝑛,𝑗 = 27,𝑑) . To compare the model output, 𝑠𝑖𝑚, with the data, the output was normalised in the same way as the data, i.e., 𝑠𝑖𝑚(𝑖,𝑡𝑝,𝑗,𝑑) = [𝑝𝑆$]-,.(𝑡𝑝,𝑑) [𝑝𝑆$]-,"*(30 𝑚𝑖𝑛,𝑑) , where [𝑝𝑆$]-,.(𝑡𝑝,𝑑) denotes the total concentration of phosphorylated STAT𝑖 at time 𝑡𝑝 (see Equations 56-59) when considering cell type 𝑑. In this way, experimental data and the mathematical model outputs are comparable. The similarity between the model output and the data points is then computed by the introduction of a distance measure 𝛿(𝑠𝑖𝑚,𝑑𝑎𝑡𝑎). Here, this distance measure was chosen as a generalisation of the Euclidean distance, where 𝛿/(𝑠𝑖𝑚,𝑑𝑎𝑡𝑎)" = Z Z ZM𝑠𝑖𝑚(𝑖,𝑡𝑝,𝑗,𝑑) − 𝜇/%0%(𝑖,𝑡𝑝,𝑗,𝑑)N " .∈203∈-4$∈5 , for 𝑑 ∈ 𝐷 = {RPE1,Th-1}, where 𝜇/%0%(𝑖,𝑡𝑝,𝑗,𝑑) is the mean of the four repeats of the data and is given by 𝜇/%0%(𝑖,𝑡𝑝,𝑗,𝑑) = 1 4 Z𝑑𝑎𝑡𝑎(𝑟, 𝑖,𝑡𝑝,𝑗,𝑑) 6 78# . To carry out the Bayesian model selection and Bayesian parameter inference, prior beliefs about the parameters were firstly defined. Each of the parameters (reaction rates) and initial concentrations in the model were sampled from a prior distribution, where the distribution was informed by experimental data or values from the literature, when possible. The choice of prior distributions is given in Table 2. Parameter Prior distribution Reference 𝑟#,) & 107 for 𝑟 ∼ 𝑁(−3,1.5) * 𝑟#,) , 107 for 𝑟 ∼ 𝑁(−3.9,1.96) * 𝑟#,"* & 107 for 𝑟 ∼ 𝑁(−2.34,1.17) * .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 36 𝑟#,"* , 107 for 𝑟 ∼ 𝑁(−2.82,1.41) * 𝑟",$ & for 𝑗 ∈ {6,27} 107 for 𝑟 ∼ 𝑈𝑛𝑖𝑓(−2,3) (94) 𝑟",$ , for 𝑗 ∈ {6,27} 107 for 𝑟 ∼ 𝑈𝑛𝑖𝑓(−3,1) (94) 𝑘$% & ,𝑘$' & for 𝑖 ∈ {1,3} 107 for 𝑟 ∼ 𝑈𝑛𝑖𝑓(−7,1) ** 𝑘$% , ,𝑘$' , for 𝑖 ∈ {1,3} 107 for 𝑟 ∼ 𝑈𝑛𝑖𝑓(−2,1) ** 𝑞 107 for 𝑟 ∼ 𝑈𝑛𝑖𝑓(−3,2) Assumed 𝑑$ for 𝑖 ∈ {1,3} 107 for 𝑟 ∼ 𝑈𝑛𝑖𝑓(−5,−2) *** β. for 𝑗 ∈ {6,27} 107 for 𝑟 ∼ 𝑈𝑛𝑖𝑓(−5,−1) † [𝑅#(0)] 𝑁(12.7,6.35) ‡ [𝑅"(0)] 𝑁(33.8,16.9) ‡ [𝑆#(0)] 𝑁(300,100) (95) [𝑆((0)] 𝑁(400,100) (95) Table 2: Prior distributions assigned to each parameter and initial concentration in the model. * These distributions are centred around measurements obtained from cell surface receptor quantification experiments. ** These distributions were derived based on 𝐾/ values obtained from the literature (42). *** These distributions are based on values derived from experimental data in which the cells were treated with Tofacitinib. † These distributions were based on values derived from experimental data in which the cells were treated with cycloheximide. ‡ These distributions were based on computations involving approximate cell sizes and average numbers of molecules per cell. We made use of the prior distributions from Table 2 to then carry out a Bayesian model selection to determine which hypothesis is most likely given the RPE1 data for both HypIL-6 and IL-27 signaling. We ran 10) simulations for each mathematical model (HypIL-6 and IL-27) and for each hypothesis, sampling model parameters from their prior distributions. We then computed a summary statistic for varying values of 𝛿94:#,∗, the distance threshold between the mathematical model and data at which parameters are accepted (or rejected) in the ABC. Finally, we computed 𝑓(𝐻<), the number of accepted parameter sets for hypothesis 𝑘, where the parameter sets are accepted if they result in a distance value less than or equal to 𝛿94:#,∗, the distance threshold. This allowed us to compute the relative probability, 𝑝(𝐻=), for each hypothesis, as defined by the following equation 𝑝(𝐻=|δ94:#,∗) = 𝑓(𝐻=|δ94:#,∗) 𝑓(𝐻#|δ94:#,∗) + 𝑓(𝐻"|δ94:#,∗) , for 𝑘 ∈ {1,2}. The results of the model selection analysis for RPE1 are shown in Figure 2d, where the relative probability of hypothesis 1 increases as 𝛿94:#,∗ tends to 0, whilst the relative probability of hypothesis 2 decreases as a function of 𝛿94:#,∗. We hence concluded that the experimental data together with the mathematical models for HypIL-6 and IL-27 signaling provide greater support to hypothesis 1 (around 70%) when compared to hypothesis 2 (around 30%). We note that as the distance threshold, 𝛿94:#,∗, is increased, both hypotheses become equally likely, as is to be expected. Given the results of the model selection, the Bayesian parameter inference for the mathematical models of HypIL-6 and IL-27 signaling was only carried out for hypothesis 1. We used the ABC, sequential Monte Carlo (ABC-SMC), approach (96), to obtain posterior distributions for the parameters in Table 1, making use of the prior distributions in Table 2. All .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 37 model parameters in Table 1 were estimated for the RPE1 data set. A subset of the parameters, which we would expect may vary with cell type, were then estimated for the Th-1 data set. In particular, the parameters not being estimated for Th-1 were sampled from the posterior distributions obtained via the ABC-SMC for RPE1, and those parameters estimated separately for Th-1 were: 𝑞, 𝑑#, 𝑑(, 𝛽), 𝛽"*, [𝑅#(0)], [𝑅"(0)], [𝑆#(0)] and [𝑆((0)]. To further validate the two mathematical models of cytokine signaling, we aimed to reproduce additional experimental results making use of the posterior parameter predictions from the RPE1 data ABC-SMC. Firstly, and in order to replicate the experimental dose response curve seen in Supp. Fig. 2a, we run both models using the 106 accepted parameters sets from the ABC-SMC for 18 different values of cytokine concentration, within the range [10,6 – 10"] log nM. The results of this analysis are seen in Supp. Fig. 12b. We also modified the mathematical models to allow them to describe the IL-27Rα-GP130 chimera experiments (Fig. 3c). In particular, a new mathematical model for the chimera experiments was developed as follows: it consisted of the ODEs from the IL-27 model which are involved in the formation of the dimer, (Equations (23) – (26)) and the ODEs from the HypIL-6 model post-dimer formation (Equations (5) – (22)), in which 𝐷) was replaced by 𝐷"*. The ODE for the IL-27 induced dimer in the chimera model was as follows 𝑑[𝐷"*] 𝑑𝑡 = 𝑟","* & [𝐶"][𝑅#] − 𝑟","* , [𝐷"*] − 2𝑘#% & [𝐷"*][𝑆#] + 𝑘#% , ([𝑆# ⋅ 𝐷"*] + [𝑝𝑆# ⋅ 𝐷"*]) − 2𝑘(% & [𝐷"*][𝑆(] + 𝑘(% , ([𝑆( ⋅ 𝐷"*] + [𝑝𝑆( ⋅ 𝐷"*]) − β"*[𝐷"*]. We simulated both the original mathematical model of IL-27 and the chimera model using the accepted parameter sets from the ABC-SMC. The results can be seen in Supp. Fig. 12a. Finally, we focussed on one of the mutant varieties of IL-27Rα, Y613F and sought to reproduce the results of Fig. 3b making use of the mathematical model of IL-27 signaling. Since the mutation decreases the affinity of STAT1 to IL-27Rα, we fixed the association and dissociation rates of STAT1 to the IL-27Rα chain,𝑘#' & and 𝑘#' , , at values which resulted in a high µM affinity. The specific values chosen were 𝑘#' & = 10,> nM-1s-1 and 𝑘#' , = 10# s-1 which yields an affinity of 10" µM. The rate 𝑘#' , was chosen as approximately the median of the posterior distribution for this parameter from the ABC-SMC, and the rate 𝑘#' & was then significantly decreased in order to increase the affinity value. We simulated the mathematical model of IL-27 signaling using the 106 accepted parameter sets from the ABC-SMC, but where the rates 𝑘#' & and 𝑘#' , were fixed as described above. The pointwise medians and 95% credible intervals of these simulations are plotted in Supp. Fig. 12c, as well as the simulations for the WT, without altering any of the parameter values from the posterior distributions. Altering the binding affinity of STAT1 to IL-27Rα in this way in the mathematical model allows us to generate results which replicate reasonably well, the experimental observations for the Y613F mutant in Figure 3b. Live-cell dual-color single-molecule imaging studies: Single molecule imaging experiments were carried out by total internal reflection fluorescence (TIRF) microscopy with an inverted microscope (Olympus IX71) equipped with a triple-line total internal reflection (TIR) illumination condenser (Olympus) and a back-illuminated electron multiplied (EM) CCD camera (iXon DU897D, 512 x 512 pixel, Andor Technology) as recently described (38-40). A 150 x magnification objective with a numerical aperture of 1.45 (UAPO 150 3 /1.45 TIRFM, Olympus) was used for TIR illumination. All experiments were carried out at room temperature in medium without phenol red supplemented with an oxygen scavenger and a redox-active photoprotectant to minimize photobleaching (97). For Heterodimerization experiments of IL-27Ra and GP130 cell surface labeling of RPE1 GP130 KO, co-transfected with mXFPe-IL-27Ra and mXFPm-GP130, was achieved by adding aGFP-enNBRHO11 and .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 38 aGFP-miNBDY647 to the medium at equal concentrations (5 nM) and incubated for at least 5 min prior to stimulation with IL-27 (20 nM) or HypIL-6 (20 nM). For homodimerization experiments with mXFPm-GP130, aGFP-miNBDY647 and aGFP-miNBRHO11 (98) were used for cell surface receptor labelling as described above. The nanobodies were kept in the bulk solution during the whole experiment in order to ensure high equilibrium binding to mXFP- GP130. For simultaneous dual color acquisition, aGFP-NBRHO11 was excited by a 561 nm diode-pumped solid-state laser at 0.95 mW (~32 W/cm2) and aGFP-NBDY647 by a 642 nm laser diode at 0.65 mW (~22 W/cm2). Fluorescence was detected using a spectral image splitter (DualView, Optical Insight) with a 640 DCXR dichroic beam splitter (Chroma) in combination with the bandpass filter 585/40 (Semrock) for detection of RHO11 and 690/70 (Chroma) for detection of DY647 dividing each emission channel into 512x256 pixel. Image stacks of 150 frames were recorded at 32 ms/frame. Single molecule localization and single molecule tracking were carried out using the multiple- target tracing (MTT) algorithm (99) as described previously (100). Step-length histograms were obtained from single molecule trajectories and fitted by two fraction mixture model of Brownian diffusion. Average diffusion constants were determined from the slope (2-10 steps) of the mean square displacement versus time lapse diagrams. Immobile molecules were identified by the density-based spatial clustering of applications with noise (DBSCAN) algorithm as described recently (101). For comparing diffusion properties and for co-tracking analysis, immobile particles were excluded from the data set. Prior to co-localization analysis, imaging channels were aligned with sub-pixel precision by using a spatial transformation. To this end, a transformation matrix was calculated based on a calibration measurement with multicolour fluorescent beads (TetraSpeck microspheres 0.1 mm, Invitrogen) visible in both spectral channels (cp2tform of type ‘affine’, The MathWorks MATLAB 2009a). Individual molecules detected in the both spectral channels were regarded as co-localized, if a particle was detected in both channels of a single frame within a distance threshold of 100 nm radius. For single molecule co-tracking analysis, the MTT algorithm was applied to this dataset of co-localized molecules to reconstruct co-locomotion trajectories (co- trajectories) from the identified population of co-localizations. For the co-tracking analysis, only trajectories with a minimum of 10 steps (~320 ms) were considered in order to robustly remove random receptor co-localizations (39). For heterodimerization experiments of mXFPe-IL-27Ra and mXFPm-GP130, the relative fraction of dimerized receptors was calculated from the number of co-trajectories relative to the number of IL-27Ra trajectories. GP130 was expressed in moderate excess (~1.5-2 fold), so that maximal receptor assembly was not limited by abundance of the low-affinity subunit GP130. For homodimerization experiments with GP130, the relative fraction of co-tracked molecules was determined with respect to the absolute number of trajectories and corrected for GP130 stochastically double-labelled with the same fluorophore species as follows: 𝐴𝐵∗ = ?@ "×BC ! !"# D×C # !"# DE , 𝑟𝑒𝑙.𝑐𝑜 − 𝑙𝑜𝑐𝑜𝑚𝑜𝑡𝑖𝑜𝑛 = "×?@ ∗ (?&@) where A, B, AB and AB* are the numbers of trajectories observed for Rho11, DY647, co- trajectories and corrected co-trajectories, respectively. The two-dimensional equilibrium dissociation constants (𝐾!"!) were calculated according to the law of mass action for a monomer-dimer equilibrium: Heterodimerization (IL-27Ra+GP130): .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 39 𝐾! "! = M[𝐺𝑃130] − (𝛼 × [𝐼𝐿27𝑅𝑎])N × M[𝐼𝐿27𝑅𝑎] − (𝛼 × [𝐼𝐿27𝑅𝑎])N (𝛼 × [𝐼𝐿27𝑅𝑎]) or 𝐾! "! = [𝐺𝑃130] × j 1 𝛼 − 1k + [𝐼𝐿27𝑅𝑎] × (𝛼 − 1) with: 𝛼 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝐼𝐿27 𝑏𝑜𝑢𝑛𝑑 𝐼𝐿27𝑅𝑎 𝑖𝑛 𝑐𝑜𝑚𝑝𝑙𝑒𝑥 𝑤𝑖𝑡ℎ 𝐺𝑃130 Homodimerization (GP130+GP130): 𝐾! "! = [I]% [!] = ([I]&,"[!])% [!] 𝐾! "! = K[L4#(M],"×(N×[L4#(M])O % "×(N×[L4#(M]) with: 𝛼 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝐺𝑃130 ℎ𝑜𝑚𝑜𝑑𝑖𝑚𝑒𝑟𝑠 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑡𝑜 [𝐺𝑃130]/2 Where [M] and [D] are the concentrations of the monomer and the dimer, respectively, and [M]0 is the total receptor concentration. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 40 References: 1. J. J. O'Shea, R. Plenge, JAK and STAT signaling molecules in immunoregulation and immune-mediated disease. Immunity 36, 542-550 (2012). 2. S. Pflanz et al., IL-27, a heterodimeric cytokine composed of EBI3 and p28 protein, induces proliferation of naive CD4+ T cells. Immunity 16, 779-790 (2002). 3. H. Yoshida, C. A. Hunter, The immunobiology of interleukin-27. Annu Rev Immunol 33, 417-443 (2015). 4. J. S. Stumhofer et al., Interleukin 27 negatively regulates the development of interleukin 17-producing T helper cells during chronic inflammation of the central nervous system. Nat Immunol 7, 937-945 (2006). 5. C. Diveu et al., IL-27 blocks RORc expression to inhibit lineage commitment of Th17 cells. J Immunol 182, 5748-5756 (2009). 6. D. C. Fitzgerald et al., Suppression of autoimmune inflammation of the central nervous system by interleukin 10 secreted by interleukin 27-stimulated T cells. Nat Immunol 8, 1372-1379 (2007). 7. J. S. Stumhofer et al., Interleukins 27 and 6 induce STAT3-mediated T cell production of interleukin 10. Nat Immunol 8, 1363-1371 (2007). 8. C. Pot, L. Apetoh, A. Awasthi, V. K. Kuchroo, Induction of regulatory Tr1 cells and inhibition of T(H)17 cells by IL-27. Semin Immunol 23, 438-445 (2011). 9. M. J. Boulanger, D. C. Chow, E. E. Brevnova, K. C. Garcia, Hexameric structure and assembly of the interleukin-6/IL-6 alpha-receptor/gp130 complex. Science 300, 2101- 2104 (2003). 10. S. Rose-John, Interleukin-6 Family Cytokines. Cold Spring Harb Perspect Biol 10, (2018). 11. C. A. Hunter, S. A. Jones, IL-6 as a keystone cytokine in health and disease. Nature Immunology 16, 448-457 (2015). 12. T. Korn et al., IL-6 controls Th17 immunity in vivo by inhibiting the conversion of conventional T cells into Foxp3+ regulatory T cells. Proc Natl Acad Sci U S A 105, 18460-18465 (2008). 13. A. Kimura, T. Kishimoto, IL-6: regulator of Treg/Th17 balance. Eur J Immunol 40, 1830-1835 (2010). 14. G. W. Jones et al., Loss of CD4+ T cell IL-6R expression during inflammation underlines a role for IL-6 trans signaling in the local maintenance of Th17 cells. J Immunol 184, 2130-2139 (2010). 15. C. Rolvering et al., Crosstalk between different family members: IL27 recapitulates IFN gamma responses in HCC cells, but is inhibited by IL6-type cytokines. Bba-Mol Cell Res 1864, 516-526 (2017). 16. A. P. Costa-Pereira et al., Mutational switch of an IL-6 response to an interferon- gamma-like response. P Natl Acad Sci USA 99, 8043-8047 (2002). 17. J. Schmitz, M. Weissenbach, S. Haan, P. C. Heinrich, F. Schaper, SOCS3 exerts its inhibitory function on interleukin-6 signal transduction through the SHP2 recruitment site of gp130. Journal of Biological Chemistry 275, 12848-12856 (2000). 18. H. Yasukawa et al., IL-6 induces an anti-inflammatory response in the absence of SOCS3 in macrophages. Nat Immunol 4, 551-556 (2003). 19. B. A. Croker et al., SOCS3 negatively regulates IL-6 signaling in vivo. Nat Immunol 4, 540-545 (2003). 20. C. Brender et al., Suppressor of cytokine signaling 3 regulates CD8 T-cell proliferation by inhibition of interleukins 6 and 27. Blood 110, 2528-2536 (2007). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 41 21. A. Camporeale, V. Poli, IL-6, IL-17 and STAT3: a holy trinity in auto-immunity? Front Biosci (Landmark Ed) 17, 2306-2326 (2012). 22. G. Regis, S. Pensa, D. Boselli, F. Novelli, V. Poli, Ups and downs: the STAT1:STAT3 seesaw of Interferon and gp130 receptor signalling. Semin Cell Dev Biol 19, 351-359 (2008). 23. S. Lucas, N. Ghilardi, J. Li, F. J. de Sauvage, IL-27 regulates IL-12 responsiveness of naive CD4(+) T cells through Stat1-dependent and -independent mechanisms. P Natl Acad Sci USA 100, 15047-15052 (2003). 24. S. Kamiya et al., An indispensable role for STAT1 in IL-27-induced T-bet expression but not proliferation of naive CD4(+) T cells. Journal of Immunology 173, 3871-3877 (2004). 25. A. Takeda et al., Cutting edge: Role of IL-27/WSX-1 signaling for induction of T-Bet through activation of STAT1 during initial Th1 commitment. Journal of Immunology 170, 4886-4890 (2003). 26. C. Neufert et al., IL-27 controls the development of inducible regulatory T cells and Th17 cells via differential effects on STAT1. Eur J Immunol 37, 1809-1816 (2007). 27. T. Owaki et al., STAT3 is indispensable to IL-27-mediated cell proliferation but not to IL-27-induced Th1 differentiation and suppression of proinflammatory cytokine production. Journal of Immunology 180, 2903-2911 (2008). 28. K. Hirahara et al., Asymmetric Action of STAT Transcription Factors Drives Transcriptional Outputs and Cytokine Specificity. Immunity 42, 877-889 (2015). 29. S. Oniki et al., Interleukin-23 and interleukin-27 exert quite different antitumor and vaccine effects on poorly immunogenic melanoma. Cancer Res 66, 6395-6404 (2006). 30. M. Fischer et al., I. A bioactive designer cytokine for human hematopoietic progenitor cell expansion. Nat Biotechnol 15, 142-145 (1997). 31. H. H. Oberg, D. Wesch, S. Grussel, S. Rose-John, D. Kabelitz, Differential expression of CD126 and CD130 mediates different STAT-3 phosphorylation in CD4+CD25- and CD25high regulatory T cells. Int Immunol 18, 555-563 (2006). 32. P. O. Krutzik, M. R. Clutter, A. Trejo, G. P. Nolan, Fluorescent cell barcoding for multiplex flow cytometry. Curr Protoc Cytom Chapter 6, Unit 6 31 (2011). 33. U. A. Betz, W. Muller, Regulated expression of gp130 and IL-6 receptor alpha chain in T cell maturation and activation. Int Immunol 10, 1175-1184 (1998). 34. J. Martinez-Fabregas et al., Kinetics of cytokine receptor trafficking determine signaling and functional selectivity. Elife 8, (2019). 35. C. Gorby et al., Engineered IL-10 variants elicit potent immunomodulatory effects at low ligand doses. Sci Signal 13, (2020). 36. V. Ruprecht, Weghuber, J., Wieser, S., Schütz, G. J, in Advances in Planar Lipid Bilayers and Liposomes. (2010), vol. 12,, pp. 21-40. 37. I. Moraga et al., Instructive roles for cytokine-receptor binding parameters in determining signaling and functional potency. Science Signaling 8, (2015). 38. S. Wilmes et al., Receptor dimerization dynamics as a regulatory valve for plasticity of type I interferon signaling. J Cell Biol 209, 579-593 (2015). 39. S. Wilmes et al., Mechanism of homodimeric cytokine receptor activation and dysregulation by oncogenic mutations. Science 367, 643-652 (2020). 40. I. Moraga et al., Tuning Cytokine Receptor Signaling by Re-orienting Dimer Geometry with Surrogate Ligands. Cell 160, 1196-1208 (2015). 41. S. Pflanz et al., WSX-1 and glycoprotein 130 constitute a signal-transducing receptor for IL-27. J Immunol 172, 2225-2231 (2004). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 42 42. M. Wiederkehr-Adam et al., Characterization of phosphopeptide motifs specific for the Src homology 2 domains of signal transducer and activator of transcription 1 (STAT1) and STAT3. J Biol Chem 278, 16117-16128 (2003). 43. A. Pradhan, Q. T. Lambert, L. N. Griner, G. W. Reuther, Activation of JAK2-V617F by components of heterodimeric cytokine receptors. J Biol Chem 285, 16651-16663 (2010). 44. H. Kim, T. S. Hawley, R. G. Hawley, H. Baumann, Protein tyrosine phosphatase 2 (SHP-2) moderates signaling by gp130 but is not required for the induction of acute- phase plasma protein genes in hepatic cells. Mol Cell Biol 18, 1525-1533 (1998). 45. D. W. Huang, B. T. Sherman, R. A. Lempicki, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4, 44-57 (2009). 46. J. Bancerek et al., CDK8 kinase phosphorylates transcription factor STAT1 to selectively regulate the interferon response. Immunity 38, 250-262 (2013). 47. S. Rutz et al., Deubiquitinase DUBA is a post-translational brake on interleukin-17 production in T cells. Nature 518, 417-421 (2015). 48. K. L. O'Hagan, S. D. Miller, H. Phee, Pak2 is essential for the function of Foxp3+regulatory T cells through maintaining a suppressive Treg phenotype. Sci Rep- Uk 7, (2017). 49. D. Z. Ye, J. Field, PAK signaling in cancer. Cell Logist 2, 105-116 (2012). 50. Y. Liao, J. Wang, E. J. Jaehnig, Z. Shi, B. Zhang, WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res 47, W199-W205 (2019). 51. J. Satoh, H. Tabunoki, A Comprehensive Profile of ChIP-Seq-Based STAT1 Target Genes Suggests the Complexity of STAT1-Mediated Gene Regulatory Mechanisms. Gene Regul Syst Bio 7, 41-56 (2013). 52. I. Rusinova et al., Interferome v2.0: an updated database of annotated interferon- regulated genes. Nucleic Acids Res 41, D1040-1046 (2013). 53. H. N. Suh et al., Role of interleukin-6 in the control of DNA synthesis of hepatocytes: involvement of PKC, p44/42 MAPKs, and PPARdelta. Cell Physiol Biochem 22, 673- 684 (2008). 54. A. V. Villarino et al., IL-27 limits IL-2 production during Th1 differentiation. J Immunol 176, 237-247 (2006). 55. K. Hirahara et al., Interleukin-27 Priming of T Cells Controls IL-17 Production In trans via Induction of the Ligand PD-L1. Immunity 36, 1017-1030 (2012). 56. X. Hu et al., Sensitization of IFN-gamma Jak-STAT signaling during macrophage activation. Nat Immunol 3, 859-866 (2002). 57. V. Francois-Newton, M. Livingstone, B. Payelle-Brogard, G. Uze, S. Pellegrini, USP18 establishes the transcriptional and anti-proliferative interferon alpha/beta differential. Biochem J 446, 509-516 (2012). 58. K. Zenke, M. Muroi, K. I. Tanamoto, IRF1 supports DNA binding of STAT1 by promoting its phosphorylation. Immunol Cell Biol 96, 1095-1103 (2018). 59. K. Karwacz et al., Critical role of IRF1 and BATF in forming chromatin landscape during type 1 regulatory cell differentiation. Nat Immunol 18, 412-421 (2017). 60. A. Yoshimura, Y. Wakabayashi, T. Mori, Cellular and molecular basis for the regulation of inflammation by TGF-beta. J Biochem 147, 781-792 (2010). 61. A. Awasthi et al., A dominant function for interleukin 27 in generating interleukin 10- producing anti-inflammatory T cells. Nat Immunol 8, 1380-1389 (2007). 62. J. B. Brown et al., P-selectin glycoprotein ligand-1 is needed for sequential recruitment of T-helper 1 (Th1) and local generation of Th17 T cells in dextran sodium sulfate (DSS) colitis. Inflamm Bowel Dis 18, 323-332 (2012). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 43 63. M. Matsumoto et al., CD43 collaborates with P-selectin glycoprotein ligand-1 to mediate E-selectin-dependent T cell migration into inflamed skin. J Immunol 178, 2499-2506 (2007). 64. D. N. Slenter et al., WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res 46, D661-D667 (2018). 65. A. Petretto et al., Proteomic analysis uncovers common effects of IFN-gamma and IL- 27 on the HLA class I antigen presentation machinery in human cancer cells. Oncotarget 7, 72518-72536 (2016). 66. L. H. Wong, I. Hatzinisiriou, R. J. Devenish, S. J. Ralph, IFN-gamma priming up- regulates IFN-stimulated gene factor 3 (ISGF3) components, augmenting responsiveness of IFN-resistant melanoma cells to type I IFNs. J Immunol 160, 5475- 5484 (1998). 67. M. Tokuyama et al., ERVmap analysis reveals genome-wide transcription of human endogenous retroviruses. Proc Natl Acad Sci U S A 115, 12565-12572 (2018). 68. C. Garbers et al., Plasticity and cross-talk of interleukin 6-type cytokines. Cytokine Growth Factor Rev 23, 85-97 (2012). 69. S. Kang, M. Narazaki, H. Metwally, T. Kishimoto, Historical overview of the interleukin-6 family cytokine. J Exp Med 217, (2020). 70. R. Umeshita-Suyama et al., Characterization of IL-4 and IL-13 signals dependent on the human IL-13 receptor alpha chain 1: redundancy of requirement of tyrosine residue for STAT3 activation. Int Immunol 12, 1499-1509 (2000). 71. O. W. Nadeau et al., The proximal tyrosines of the cytoplasmic domain of the beta chain of the type I interferon receptor are essential for signal transducer and activator of transcription (Stat) 2 activation. Evidence that two Stat2 sites are required to reach a threshold of interferon alpha-induced Stat2 tyrosine phosphorylation that allows normal formation of interferon-stimulated gene factor 3. J Biol Chem 274, 4045-4052 (1999). 72. M. N. Sharif et al., IFN-alpha priming results in a gain of proinflammatory function by IL-10: implications for systemic lupus erythematosus pathogenesis. J Immunol 172, 6476-6481 (2004). 73. D. Richter et al., Ligand-induced type II interleukin-4 receptor dimers are sustained by rapid re-association within plasma membrane microcompartments. Nat Commun 8, 15976 (2017). 74. J. P. Twohig et al., Activation of naive CD4(+) T cells re-tunes STAT1 signaling to deliver unique cytokine responses in memory CD4(+) T cells. Nat Immunol 20, 458- 470 (2019). 75. P. C. Heinrich et al., Principles of interleukin (IL)-6-type cytokine signalling and its regulation. Biochem J 374, 1-20 (2003). 76. D. Levin, D. Harari, G. Schreiber, Stochastic receptor expression determines cell fate upon interferon treatment. Mol Cell Biol 31, 3252-3266 (2011). 77. I. Moraga, D. Harari, G. Schreiber, G. Uze, S. Pellegrini, Receptor density is key to the alpha2/beta interferon differential activities. Mol Cell Biol 29, 4778-4787 (2009). 78. C. C. M. Ho et al., Decoupling the Functional Pleiotropy of Stem Cell Factor by Tuning c-Kit Signaling. Cell 168, 1041-1052 e1018 (2017). 79. P. Charlot-Rabiega, E. Bardel, C. Dietrich, R. Kastelein, O. Devergne, Signaling events involved in interleukin 27 (IL-27)-induced proliferation of human naive CD4+ T cells and B cells. J Biol Chem 286, 27350-27362 (2011). 80. J. Diegelmann, T. Olszak, B. Goke, R. S. Blumberg, S. Brand, A Novel Role for Interleukin-27 (IL-27) as Mediator of Intestinal Epithelial Barrier Protection Mediated via Differential Signal Transducer and Activator of Transcription (STAT) Protein .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 44 Signaling and Induction of Antibacterial and Anti-inflammatory Proteins. Journal of Biological Chemistry 287, 286-298 (2012). 81. H. Bender et al., Interleukin-27 displays interferon-gamma-like functions in human hepatoma cells and hepatocytes. Hepatology 50, 585-591 (2009). 82. T. Imamichi, J. Yang, W. Huang da, B. Sherman, R. A. Lempicki, Interleukin-27 induces interferon-inducible genes: analysis of gene expression profiles using Affymetrix microarray and DAVID. Methods Mol Biol 820, 25-53 (2012). 83. J. M. Fakruddin et al., Noninfectious papilloma virus-like particles inhibit HIV-1 replication: implications for immune control of HIV-1 infection by IL-27. Blood 109, 1841-1849 (2007). 84. A. C. Frank et al., Interleukin-27, an anti-HIV-1 cytokine, inhibits replication of hepatitis C virus. J Interferon Cytokine Res 30, 427-431 (2010). 85. S. L. LaPorte et al., Molecular and structural basis of cytokine receptor pleiotropy in the interleukin-4/13 system. Cell 132, 259-272 (2008). 86. J. B. Spangler, I. Moraga, K. M. Jude, C. S. Savvides, K. C. Garcia, A strategy for the selection of monovalent antibodies that span protein dimer interfaces. J Biol Chem 294, 13876-13886 (2019). 87. A. Kirchhofer et al., Modulation of protein properties in living cells using nanobodies. Nat Struct Mol Biol 17, 133-138 (2010). 88. M. C. Hochberg, Updating the American College of Rheumatology revised criteria for the classification of systemic lupus erythematosus. Arthritis Rheum 40, 1725 (1997). 89. J. Cox, M. Mann, MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol 26, 1367-1372 (2008). 90. J. Cox et al., Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 10, 1794-1805 (2011). 91. P. O. Krutzik, G. P. Nolan, Fluorescent cell barcoding in flow cytometry allows high- throughput drug screening and signaling profiling. Nat Methods 3, 361-368 (2006). 92. W. Huang da, B. T. Sherman, R. A. Lempicki, Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37, 1-13 (2009). 93. W. Huang da, B. T. Sherman, R. A. Lempicki, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4, 44-57 (2009). 94. N. Kozer et al., Exploring higher-order EGFR oligomerisation and phosphorylation--a combined experimental and theoretical approach. Mol Biosyst 9, 1849-1863 (2013). 95. D. N. Itzhak, S. Tyanova, J. Cox, G. H. Borner, Global, quantitative and dynamic mapping of protein subcellular localization. Elife 5, (2016). 96. T. Toni, D. Welch, N. Strelkowa, A. Ipsen, M. P. Stumpf, Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J R Soc Interface 6, 187-202 (2009). 97. J. Vogelsang et al., A reducing and oxidizing system minimizes photobleaching and blinking of fluorescent dyes. Angew Chem Int Ed Engl 47, 5465-5469 (2008). 98. A. Kirchhofer et al., Modulation of protein properties in living cells using nanobodies. Nat Struct Mol Biol 17, 133-U162 (2010). 99. A. Serge, N. Bertaux, H. Rigneault, D. Marguet, Dynamic multiple-target tracing to probe spatiotemporal cartography of cell membranes. Nat Methods 5, 687-694 (2008). 100. C. You et al., Receptor dimer stabilization by hierarchical plasma membrane microcompartments regulates cytokine signaling. Sci Adv 2, e1600452 (2016). 101. F. Roder, A. Lubk, D. Wolf, T. Niermann, Noise estimation for off-axis electron holography. Ultramicroscopy 144, 32-42 (2014). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 45 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 46 FIGURE LEGENDS: Figure 1 Cytokine receptor activation by IL-27 and (Hyp)IL-6: a) Cartoon model of stepwise assembly of the IL-27 and HypIL-6-induced receptor complex and subsequent activation of STAT1 and STAT3. b) Dose-dependent phosphorylation of STAT1 and STAT3 as a response to IL-27 and HypIL-6 stimulation in TH-1 cells, normalized to maximal IL-27 stimulation. Data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. c) Phosphorylation kinetics of STAT1 and STAT3 followed after stimulation with saturating concentrations of IL-27 (2nM) and HypIL-6 (20nM) or unstimulated TH-1 cells, normalized to maximal IL-27 stimulation. Data was obtained from five biological replicates with each two technical replicates, showing mean ± std dev. d) Top: Phosphorylation kinetics of STAT1 and STAT3 followed after stimulation with HypIL-6 (20nM) or left unstimulated, comparing wt RPE1 and RPE1 GP130KO reconstituted with high levels of mXFPm-GP130 (=10x [GP130]). Data was normalized to maximal stimulation levels of wt RPE1. Left: cell surface GP130 levels comparing RPE1 GP130KO, wt RPE1 and RPE1 GP130KO stably expressing mXFPm-GP130 measured by flow cytometry. Data was obtained from one biological replicate with each two technical replicates, showing mean ± std dev. Bottom right: cell surface levels of GP130 measured by flow cytometry for indicated cell lines. e) Cartoon model of cell surface labeling of mXFP-tagged receptors by dye-conjugated anti-GFP nanobodies (NB) and identification of receptor dimers by single molecule dual-colour co-localization. f) Raw data of dual-colour single-molecule TIRF imaging of mXFPe-IL-27RαNB-RHO11 and GP130NB-DY649 after stimulation with IL-27. Particles from the insets (IL-27Ra: red & GP130: blue) were followed by single molecule tracking (150 frames ~ 4.8s) and trajectories >10 steps (320ms) are displayed. Receptor heterodimerization was detected by co-localization/co-tracking analysis. g) Relative number of co-trajectories observed for heterodimerization of IL-27Rα and GP130 as well as homodimerization of GP130 for unstimulated cells or after indicated cytokine stimulation. Each data point represents the analysis from one cell with a minimum of 23 cells measured for each condition. *P < 0.05, **P ≤ 0.01,***P ≤ 0.001; n.s., not significant. h) Stoichiometry of the IL-27–induced receptor complex revealed by bleaching analysis. Left: Intensity traces of mXFPe-IL27RαNB-RHO11 and GP130NB-DY649 were followed until fluorophore bleaching. Middle: Merged imaging raw data for selected timepoints. Right: overlay of the trajectories for IL-27Rα (red) and GP130 (blue). Figure 2: Mathematical modelling results in RPE1 and Th-1 cells. a) Simplified cartoon model of IL-27/HypIL-6 signal propagation layers and coverage of the mathematical modelling approach. b) Model selection results showing the relative probabilities of each hypothesis, for different values of the distance threshold, 𝛿∗, in RPE1 cells. c) Pointwise median and 95% credible intervals of the predictions from the mathematical model, calibrated with the experimental data, using the posterior distributions for the parameters from the ABC-SMC. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 47 d) Kernel density estimates of the posterior distributions for the parameters 𝑝 ∈ {𝑟#,. & ,𝑟#,. , ,𝑟",. & ,𝑟",. , ,𝑘$% & ,𝑘$% , ,𝑘$' & ,𝑘$' , ,𝑞,𝑑$,𝛽., [𝑅#(0)],[𝑅"(0)],[𝑆#(0)],[𝑆((0)]} in the mathematical models where 𝑗 ∈ {6,27} and 𝑖 ∈ {1,3}. Figure 3: IL-27Rα cytoplasmic domain is required for sustained pSTAT1 kinetics. a) Representation of the cytoplasmic domain of IL-27Rα with its highlighted tyrosine residues Y543 and Y613. b) STAT1 and STAT3 phosphorylation kinetics of RPE1 clones stably expressing wt and mutant IL-27Rα after stimulation with IL-27 (10 nM, top panels) or after stimulation with HypIL-6 (20 nM, bottom panels), normalized to maximal levels of wt IL-27Rα stimulated with IL-27 (top) or HypIL-6 (bottom). Data was obtained from three experiments with each two technical replicates, showing mean ± std dev. Bottom right: cell surface levels variants measured by flow cytometry for indicated IL-27Rα cell lines. c) Cytoplasmic domain of IL-27Rα is required for sustained pSTAT1 activation. Left: Cartoon representation of receptor complexes. Right: STAT1 and STAT3 phosphorylation kinetics of RPE1 clones stably expressing wt IL-27Rα and IL-27Rα- GP130 chimera after stimulation with IL-27 (10 nM, top panels) or after stimulation with HypIL-6 (20 nM, bottom panels). Data was normalized to maximal levels for each cytokine and cell line. Data was obtained from two experiments with each 2 technical replicates, showing mean ± std dev. d) Phosphatases do not account for differential pSTAT1/3 activity induced by IL-27 and HypIL-6. Left: Schematic representation of workflow using JAK inhibitor Tofacitinib. Right: MFI ratio of Tofacitinib-treated and non-treated RPE1 mXFPe-IL-27Rα cells for pSTAT1 and pSTAT3 after stimulation with IL-27 (10nM) and HypIL-6 (20nM). Data was obtained from two experiments with each two technical replicates, showing mean ± std dev. Figure 4: Unique and overlapping effects of IL-27 and HypIL-6 on the phosphoproteome of Th-1 cells. a) Volcano plot of the phospho-sites regulated (p value £ 0.05, fold change ³+1.5 or £- 1.5) by IL-27 (left) and HypIL-6 (right). Data was obtained from three biological replicates. b) Heatmap representation (examples) of shared and differentially up- (left) and downregulated (right) phospho-sites after IL-27 and HypIL-6 stimulation. Data represents the mean (log2) fold change of three biological replicates. c) Tyrosine and Serine phosphorylation of selected STAT proteins after stimulation with IL-27 (red) and HypIL-6 (blue). *P < 0.05, **P ≤ 0.01,***P ≤ 0.001; n.s., not significant. d) pS727-STAT1 and pS727-STAT3 phosphorylation kinetics in Th-1 cells after stimulation with IL-27 or HypIL-6, normalized to maximal IL-27 stimulation. Data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. e) GO analysis “biological processes” of the phospho-sites regulated by IL-27 (red) and HypIL-6 (blue) represented as bubble-plots. f) Phosphorylation of target proteins associated with STAT3/CDK transcription initiation complex after stimulation with IL-27 (blue) and HypIL-6 (red) and schematic representation of transcription regulation of RNA polymerase II with identified phospho-sites (red flags). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 48 Figure 5: Kinetic decoupling of gene induction programs depends on sustained STAT1 activation by IL-27. a) Principal component analysis for genes found to be significantly upregulated (left) or downregulated (right) for at least one of the tested conditions (time & cytokine). Data was obtained from three biological replicates. b) Kinetics of gene induction shared between IL-27 and HypIL-6 (relative to IL-27) for upregulated genes (red) or downregulated genes (green). c) Kinetics of gene numbers induced after IL-27 and HypIL-6 stimulation for upregulated genes (left) and downregulated genes (right). d) GSEA reactome analysis of selected pathways with significantly altered gene induction in response to IL-27 or HypIL-6 stimulation. Data represents the mean (log2) fold change of three biological replicates. e) Cluster analysis comparing the gene induction kinetics after IL-27 or HypIL-6 stimulation. Gene induction heatmaps for example genes as well as induction kinetics (mean) are shown for highlighted gene clusters. Data represents the mean (log2) fold change of three biological replicates. Figure 6: IL-27-induced upregulation of IRF1 amplifies induction of STAT1-dependent genes a) Kinetics of IRF1 protein expression as a response to continuous IL-27 and HypIL-6 stimulation in Th-1 cells. Data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. Dotted line indicates baseline level. b) Kinetics of IRF1 protein expression and siRNA-mediated IRF1 knockdown in RPE1 IL- 27Rα cells stimulated with IL-27 (2nM). Data was obtained from one representative experiment with each two technical replicates, normalized to maximal IRF1 induction (6h), showing mean ± std dev. c) Kinetics of STAT1 (left) and STAT3 (right) phosphorylation after siRNA-mediated IRF1 knockdown in RPE1 IL-27Rα cells stimulated with IL-27 (2nM). Data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. d) Kinetics of gene induction (STAT1, GBP5, OAS1, SOCS3) followed by RT qPCR in RPE1 IL-27Rα cells stimulated with IL-27 (2nM) with and without siRNA-mediated knockdown of IRF1. Data was obtained from three experiments with each two technical replicates, showing mean ± SEM. Figure 7: IL-27-induced STAT1 response drives global proteomic changes in Th-1 cells. a) Workflow for quantitative SILAC proteomic analysis of Th-1 cells continuously stimulated (24h) with IL-27 (10nM), HypIL-6 (20nM) or left untreated. b) Global proteomic changes in Th-1 cells induced by IL-27 (left) or HypIL-6 (right) represented as volcano plots. Proteins significantly up- or downregulated are highlighted in red (p value £ 0.05, fold change ³+1.5 or £-1.5). Significantly altered ISG-encoded proteins by IL-27 are highlighted in yellow. Data was obtained from three biological replicates. c) Venn diagrams comparing unique upregulated (left) and downregulated (right) proteins by IL-27 (blue) and HypIL-6 (red) as well as shared altered proteins. ISG-encoded proteins are highlighted in yellow. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 49 d) Heatmaps of the top 30 up- and downregulated proteins by IL-27 compared to HypIL- 6. Data representation of the mean (log2) fold change of three biological replicates. e) Heatmap representation and enrichment plot of proteins identified by GSEA reactome pathway enrichment analysis “Cytokine signaling and immune system” induced by IL- 27. Data representation of the mean (log2) fold change of three biological replicates. f) Correlation of IL-27 and HypIL-6-induced RNA-seq transcript levels (³+2 or £-2 fc) with quantitative proteomic data (³+1.5 or £-1.5 fc). Data representation of the mean (log2) fold change of three biological replicates. Figure 8: Receptor and STAT concentrations determine the nature of the cytokine response. a) Copy numbers of indicated proteins determined for different T-cell subsets using mass- spectrometry based proteomics (ImmPRes - http://immpres.co.uk). b) Model predictions for varying levels of STAT1 and STAT3 (left panel) or IL-27Rα and GP130 (right panel) for phosphorylation kinetics of STAT1 and STAT3. c) Gene expression profiles determined by RNAseq analysis comparing indicated genes of a cohort of SLE risk patients with a cohort of healthy controls. Data obtained from: Proc Natl Acad Sci U S A 115, 12565-12572 . *P < 0.05, **P ≤ 0.01,***P ≤ 0.001; n.s., not significant. d) Dose-dependent phosphorylation of STAT1 and STAT3 as a response to IL-27 and HypIL-6 stimulation in naive and IFNα2-primed (2nM, 24h) Th-1 cells, normalized to maximal IL-27 stimulation (ctrl). Data was obtained from four biological replicates with each two technical replicates, showing mean ± std dev. e) Phosphorylation of STAT1 (left) and STAT3 (right) as a response to IL-27 (2nM, 15min) and HypIL-6 (10nM, 15min) stimulation in healthy control (ctrl) and SLE patient CD4+ T-cells. Data was obtained from five healthy control donors (5) and six SLE patients. *P < 0.05, **P ≤ 0.01,***P ≤ 0.001; n.s., not significant. f) Tofacitinib titration to inhibit STAT1 and STAT3 phosphorylation by HypIL-6 (10nM, 15min) in Th-1 cells (left) and RPE1 cells stably expressing wt IL-27Rα (right). Supp. Figure 1: a) Comparison of dose-dependent phosphorylation (STAT1/3) of purchased IL-27 and mIL-27sc in activated CD4+ cells, normalized to maximal MFI levels. Data was obtained from one (purchased) or two (mIL-27sc) biological replicates with each two technical replicates, showing mean ± std dev. b) Schematic workflow of T-cell isolation, TH1 differentiation, fluorescence barcoding and gating strategy for high throughput flow cytometry. c) Phosphorylation kinetics of STAT1 and STAT3 followed after stimulation with IL-27 (10nM) and HypIL-6 (20nM) or unstimulated TH1 cells. Data (from Fig. 1c) was normalized to maximal MFI levels for each cytokine. Data was obtained from five biological replicates with each two technical replicates, showing mean ± std dev. d) Phosphorylation kinetics of activated PBMCs (CD4+, CD8+) of STAT1 and STAT3 followed after stimulation with IL-27 (2nM) and HypIL-6 (20nM) or unstimulated cells. Data was normalized to maximal IL-27 stimulation. Data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. e) Dose-response experiments in wt RPE1 cells for pSTAT1 (left) and pSTAT3 (right), stimulated with IL-27 or HypIL-6, normalized to maximal HypIL-6 stimulation. Data was .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 50 obtained from one representative experiment with each two technical replicates, showing mean ± std dev. Supp. Figure 2: a) Dose-response experiments for pSTAT1 and pSTAT3 comparing RPE1 GP130 KO cells (left), wt RPE1 (middle) and RPE1 mXFPe-IL27Ra (right) after stimulation with IL-27 or HypIL-6, normalized to maximal HypIL-6 stimulation. Data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. b) Ligand-induced receptor dimerization: Top panel: Dual-colour co-tracking of IL-27Rα and GP130 in the absence (top) and presence (bottom) of IL-27 (20nM). Trajectories (150 frames, ~4.8 s) of individual mXFPe-IL27RαNB-RHO11 (red) and GP130NB-DY649 (blue) and co-trajectories (magenta) are shown for a representative cell. Bottom panel: Dual-colour co-tracking of GP130 in the absence (top) and presence (bottom) of HypIL-6 (20nM). Trajectories (150 frames, ~4.8 s) of individual mXFPe-IL27RαNB-RHO11 (red) and GP130NB-DY649 (blue) and co-trajectories (magenta) are shown for a representative cell. c) Top: Cartoon model of cell surface labeling of mXFP-tagged GP130 by dye-conjugated anti-GFP nanobodies (NB) and formation of single-colour homodimers (left) or dual- colour homodimers (right). Below: Examples for intensity traces of single-colour dual- step bleaching (left) or dual-colour single-step bleaching (right). Insets show raw data for selected timepoints and corresponding trajectories. d) Top: comparison of diffusion coefficients (D) for mXFPe-IL-27RαNB-RHO11 (red) and mXFPmGP130NB-DY649 (blue) in presence and absence of IL-27 stimulation (20nM), as well as co-trajectories after IL-27 stimulation (magenta). Bottom: comparison of diffusion coefficients for mXFPm-GP130NB-RHO11 (red) in presence and absence of HypIL-6 stimulation (20nM), as well as co-trajectories after HypIL-6 stimulation (magenta). Each data point represents the analysis from one cell with a minimum of 23 cells measured for each condition. *P < 0.05, **P ≤ 0.01,***P ≤ 0.001; n.s., not significant. Supp. Figure 3: a) Reactions involving ligand binding and dimerization in the HypIL-6 model. b) Reactions involving ligand binding and dimerization in the IL-27 model. c) Reactions involving the STAT molecules (𝑆. 𝑓𝑜𝑟 𝑗 ∈ {1,3}) in the HypIL-6 model. d) Reactions involving the STAT molecules (𝑆. 𝑓𝑜𝑟 𝑗 ∈ {1,3}) in the IL-27 model. e) Reactions involving receptor internalisation/degradation in the HypIL-6 model. Here 𝐻1 = 𝛽) and 𝐻2 = 𝛾)([𝑝𝑆1] + [𝑝𝑆1]). f) Reactions involving receptor internalisation/degradation in the IL-27 model. Here 𝐻1 = 𝛽"* and 𝐻2 = 𝛾"*([𝑝𝑆1] + [𝑝𝑆1]). g) Dephosphorylation of (𝑆. 𝑓𝑜𝑟 𝑗 ∈ {1,3}) in the cytoplasm. This reaction occurs in both models. h) Key for the molecules in the reactions. Supp. Figure 4: a) STAT1 (left) and STAT3 (right) phosphorylation kinetics of RPE1 clones stably expressing wt IL-27Rα after stimulation with IL-27 or after stimulation with HypIL-6 .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 51 normalized to maximal IL-27 stimulation. Data was obtained from three experiments with each two technical replicates, showing mean ± std dev. b) Dose-response experiments for pSTAT1 (left) and pSTAT3 (right) in RPE1 cells stably expressing wt IL-27Rα or tyrosine-mutants after stimulation with IL-27, normalized to maximal stimulation of wt IL-27Rα. Data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. Supp. Figure 5: a) Dose-response experiments for pSTAT1 (left) and pSTAT3 (right) in RPE1 cells stably expressing wt IL-27Rα or IL-27Ra-GP130 chimera after stimulation with IL-27. Data normalized to maximal stimulation of wt IL-27Rα. Data was obtained from one representative experiment with each two technical replicates, showing mean ± std dev. b) STAT1 (left) and STAT3 (right) phosphorylation kinetics in RPE1 IL-27Rα cells stimulated with IL-27 or HypIL-6 with and without JAK inhibition by Tofacitinib. Data was normalized to maximal IL-27 stimulation. Data was obtained from two experiments with each two technical replicates, showing mean ± std dev. c) STAT1 (left) and STAT3 (right) phosphorylation kinetics in Th-1 cells stimulated with IL-27 or HypIL-6 with and without JAK inhibition by Tofacitinib. Data was normalized to to maximal IL-27 stimulation. Data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. d) MFI ratio of Tofacitinib-treated and non-treated Th-1 cells for pSTAT1 (left) and pSTAT3 (right) after stimulation with IL-27 (10nM) and HypIL-6 (20nM). Data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. Supp. Figure 6: a) STAT1 (left) and STAT3 (right) phosphorylation kinetics in RPE1 IL-27Rα cells stimulated with IL-27 or HypIL-6 with and without pretreatment with cycloheximide (CHX). Data was normalized to to maximal IL-27 stimulation. Data was obtained from two experiments with each two technical replicates, showing mean ± std dev. b) STAT1 (left) and STAT3 (right) phosphorylation kinetics in TH1 cells stimulated with IL-27 or HypIL-6 with and without pretreatment with cycloheximide (CHX). Data was normalized to to maximal IL-27 stimulation. Data was obtained from two biological replicates with each two technical replicates, showing mean ± std dev. Supp. Figure 7: a) Workflow for quantitative SILAC phospho-proteomic analysis of TH-1 cells stimulated (15min) with IL-27 (10 nM), HypIL-6 (20 nM) or left untreated. b) Schematic representation of the main GO terms regulated by IL27 as inferred from our p-proteomics studies. Red represents downregulated p-sites and blue represents upregulated p-sites upon IL27 stimulation of human primary Th-1 cells. c) Schematic representation of the main GO terms regulated by HyIL6 as inferred from our p-proteomics studies. Red represents downregulated p-sites and blue upregulated p-sites upon HyIL6 stimulation of human primary Th-1 cells. Supp. Figure 8: .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 52 a) Venn diagrams comparing the numbers of unique upregulated (left) and downregulated (right) phospho-sites by IL-27 (blue) and HypIL-6 (red) as well as the number of shared phospho-sites. b) List of most strongly altered phosphosites (downregulated: green; upregulated: red) in response to IL-27 (left) or HypIL-6 (right). c) GO analysis “cellular location” and “UP keywords” of the phospho-sites regulated by IL27 (red) and HypIL-6 (blue) represented as bubble-plots. d) Phosphorylation of target proteins related to Treg functions and schematic representation of their activity on T-cells. Supp. Figure 9: a) Kinetics of gene induction in Th-1 cells induced by IL-27 represented as volcano plots. Genes significantly up- or downregulated are highlighted in red (p value £ 0.05, fold change ³+2 or £-2). Data was obtained from three biological replicates. b) Kinetics of gene induction in Th-1 cells induced by HypIL-6 represented as volcano plots. Genes significantly up- or downregulated are highlighted in red (p value £ 0.05, fold change ³+2 or £-2). Data was obtained from three biological replicates. c) Kinetics of gene induction in Th-1 cells induced by HypIL-6 represented as volcano plots. Genes identified to be significantly up- or downregulated by IL-27 are highlighted in red (p value £ 0.05, fold change ³+2 or £-2). Data was obtained from three biological replicates. Supp. Figure 10: a) Gene induction kinetics represented as pie-charts, separated for upregulated genes (top panel) and downregulated genes (bottom panel). b) Kinetics of ISG induction (examples) as heatmap representation comparing IL-27 with HypIL-6 (top) and GSEA reactome pathway enrichment “IFN signaling” for genes induced by IL-27 after 6h (bottom). Data represents the mean (log2) fold change of three biological replicates. c) Heatmaps of the top 30 up- and downregulated genes by IL-27 compared to HypIL-6 for 1h, 6h and 24h. Data represents the mean (log2) fold change of three biological replicates. d) Kinetics of IRF1 protein expression as a response to continuous IL-27 and HypIL-6 stimulation in Th-1 cells. Data was obtained from three biological replicates with each two technical replicates, showing mean ± std dev. Supp. Figure 11: a) Pie charts of proteomic changes (unique & shared) for upregulated (left) and downregulated (right) proteins in response to IL-27 or HypIL-6 stimulation in Th-1 cells. b) Left: GSEA reactome pathway enrichment analysis “Interferon signaling” for proteins induced by IL-27. Middle: heatmap representation pathway-associated proteins comparing IL-27 with HypIL-6 stimulation. Data represents the mean (log2) fold change of three biological replicates. Right: Localization of the identified proteins in context to the data distribution of IL-27-induced proteomic changes. Pathway-associated proteins are highlighted for IL-27 (blue) and HypIL-6 (red) as well as non-significant data distribution (grey). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 53 c) Left: GSEA reactome pathway enrichment analysis “Cytokine signaling and immune system” for proteins induced by IL-27. Middle: heatmap representation pathway- associated proteins comparing IL-27 with HypIL-6 stimulation. Data represents the mean (log2) fold change of three biological replicates. Right: Localization of the identified proteins in context to the data distribution of IL-27-induced proteomic changes. Pathway-associated proteins are highlighted for IL-27 (blue) and HypIL-6 (red) as well as non-significant data distribution (grey). d) Average Intensity distribution of untreated proteomic data. Top up- and downregulated proteins (≥ +4x or ≤ -4x change) altered by IL-27 (left) or HypIL-6 (right) stimulation are indicated. Supp. Figure 12: a) Pointwise median and 95% credible intervals of the WT and chimera mathematical models, using the posterior distributions for the parameters from the ABC-SMC. b) Dose response curve in RPE1 using the posterior distributions from the ABC-SMC and varying the concentrations of HypIL-6 and IL-27 in the model. c) Pointwise median and 95% credible intervals of the WT mathematical model and simulations of a mutant model with 𝑘#' & = 10,> nM-1 s-1 and 𝑘#' , = 10M s-1, using the posterior distributions for the parameters from the ABC-SMC for the other parameters. Supp. Figure 13: a) Fold induction of total STAT1 and STAT3 levels in Th-1 measured by flow cytometry. Data was obtained from two biological replicates. b) Total levels of STAT1 and STAT3 measured in CD4+ by flow cytometry for healthy control (ctrl) and Lupus patients (SLE). Data was obtained from five (ctrl) and six (SLE) biological replicates. *P < 0.05, **P ≤ 0.01,***P ≤ 0.001; n.s., not significant. c) Ratio of pSTAT1 and pSTAT3 after IL-27 (15min, 2nM) or HypIL-6 (15 min, 10nM) stimulation measured in CD4+ by flow cytometry for healthy control (ctrl) and Lupus patients (SLE). Data was obtained from five (ctrl) and six (SLE) biological replicates normalized to mean ratio of healthy control samples. d) Tofacitinib titration to inhibit STAT1 and STAT3 phosphorylation by IL-27 (2nM) in Th- 1 cells (left) and RPE1 cells stably expressing wt IL-27Rα (right). .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 54 Supp. Movie 1: Single-molecule co-tracking as a readout for dimerization of cytokine receptors. Cell surface labelling of mXFPe-IL-27Rα by eNBRHO11 (left, top) and mXFPm-GP130 by mNBDY649 (left, bottom) after stimulation with IL-27 (20nM). In the overlay of the zoomed section of both spectral channels (mXFPe-IL-27RαRHO11: Red, mXFPm-GP130DY649: Blue), yellow lines indicate co-locomotion of IL-27Rα and GP130 (≥ 10 steps). Acquisition frame rate: 30 Hz, Playback: real time. Supp. Movie 2: Dynamics of IL-27-induced receptor assembly. Formation of a single-molecule heterodimer of mXFPe-IL-27RαRHO11 (Red) and mXFPm-GP130DY649 (Blue) in presence of IL-27. Yellow lines indicate co-locomotion of IL-27Rα and GP130 (≥ 10 steps). Acquisition frame rate: 30 Hz, Playback: real time with break at time of receptor dimerization. Supp. Movie 3: Ligand-induced heterodimerization of IL-27Rα and GP130. Overlay of the two spectral channels (mXFPe-IL-27RαRHO11: Red, mXFPm-GP130DY649: Blue) in absence (left) or presence (right) of IL-27 (20nM). Yellow lines indicate co-locomotion of IL-27Rα and GP130 (≥ 10 steps). Acquisition frame rate: 30 Hz, Playback: real time. Supp. Movie 4: Ligand-induced homodimerization of GP130. Overlay of the two spectral channels (mXFPm- GP130RHO11: Red, mXFPm-GP130DY649: Blue) in absence (left) or presence (right) of HypIL-6 (20nM). Yellow lines indicate co-locomotion of IL-27Rα and GP130 (≥ 10 steps). Acquisition frame rate: 30 Hz, Playback: real time. .CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425379doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425379 http://creativecommons.org/licenses/by/4.0/ 0.0 0.5 1.0 1.5 2.0 0 5000 10000 15000 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. IL-27 HypIL-6 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. IL-27 HypIL-6 Fig. 1 IL-27Rα p28 EBI3 IL-27 JAK1JAK2 GP130 HypIL-6 IL-6IL-6Rα(ECD) pSTAT1/3 a) b) e) time / min time / min pS TA T1 / re l. M FI pS TA T3 / re l. M FI pSTAT1 pSTAT3 𝚫 𝚫 𝚫 𝚫 𝚫 -4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 HypIL-6 -4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 c / log nMc / log nM pS TA T1 / re l. M FI pS TA T3 / re l. M FI pSTAT1 pSTAT3 𝚫 c) 5µm GP130 IL-27 IL-27Rα GP130 Co-Localization eNBRho11 mNBDy647 IL-27Rα R el . C o- Lo co m ot io n in te ns ity . / a .u . IL-27Rα GP130 time / s IL-27Rα GP130 Dimers f) 0 s 0.54 s 1.53 s 2.43 s 500 nmIL-27Rα GP130Rho11 bleached 𝚫FRET Rho11 bleached DY649 bleached g) h) d) time / mintime / min pS TA T1 / re l. M FI pS TA T3 / re l. M FI pSTAT1 pSTAT3 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Heterodimerization IL-27Rα + GP130 +HypIL-6+IL-27 Homodimerization GP130 + GP130 *** *** 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 wt [GP130] unstim. 10x [GP130] unstim. wt [GP130] + HypIL-6 10x [GP130] + HypIL-6 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 co un t receptor expression GP130 KO wt [GP130] 10x [GP130] a) Fig. 2 1. Receptor assembly 4. Proteome changes 3. Gene induction IL-27 IL-27 Rα GP13 0 pSTAT1/3 STAT1/3 2. STAT activation mathematical modelling pS TA T1 / re l. M FI pS TA T3 / re l. M FI time / min time / min 𝜹∗ N o. a cc ep te d pa ra m et er s c) b) d) 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. wt Y543F Y613F Y543F-Y613F 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. wt chimera 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. wt chimera 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27Rα cytoplasmic domain Y543 Y613 TSGRCYHLRHKVLPRWVWEKVPDPANSSSGQPHMEQVPEAQPLGDLPILEVEEMEPPPVMESS QPAQATAPLDSGYEKHFLPTPEELGLLGPPRPQVLA* Fig. 3 0min 5min 15min 30min 60min 90min 120min 180min +T of ac iti ni b unstim. +IL-27 +HypIL-6 time / min pS TA T3 / re l. M FI pS TA T1 / re l. M FI time / min -80% pSTAT1 -20% pSTAT3 b) a) d) 0 15 30 45 60 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 IL-27 HypIL-6 time / min R at io p S TA T1 + /- To f. +Tofacitinib 0 15 30 45 60 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 IL-27 HypIL-6 time / min R at io p S TA T3 + /- To f. +Tofacitinib IL-27Rα GP130 +IL-27 IL-27Rα-GP130 GP130 +IL-27 GP130 GP130 +HypIL-6 pS TA T1 / re l. M FI time / min HypIL-6 pSTAT1 pS TA T1 / re l. M FI time / min IL-27 pSTAT1 𝚫 𝚫 𝚫 𝚫 IL-27 pSTAT3 HypIL-6 pSTAT3 pS TA T3 / re l. M FI pS TA T3 / re l. M FI time / min time / min c) time / min pS TA T3 / re l. M FI pS TA T1 / re l. M FI time / min HypIL-6 pSTAT1 IL-27 pSTAT1 IL-27 pSTAT3 HypIL-6 pSTAT3 pSTAT1 pSTAT3 co un t receptor expression ctrl wt Y543F Y613F Y543F- Y613F JAK1 JAK2 NE LFA S2 33 PP M1 G T 122 RC HY 1 S 257 LA RP 7 S 300 PO LR 2A S19 10 PO LR 2A S19 20 PO LR 2A S19 13 0 1 2 5 10 15 20 Fig. 4 -8 -4 -2 -1 0 1 2 4 8 0 1 2 3 4 5 6 7 8 9 10 11 12 fold change / log2 p v al u e / - lg 10 unchanged downregulated upregulated -8 -4 -2 -1 0 1 2 4 8 0 1 2 3 4 5 6 7 8 9 10 11 12 fold change / log2 p v al u e / - lg 10 unchanged downregulated upregulated MAP1B CHD12 SCAF11 WRNIP1 BOLA1 BAD STAT3 STAT1 UBR5 STAT5 MAP1B CHD12 SCAF11WRNIP1 BOLA1 RCHY1 NELFA STAT1 STAT3 PPM1G 155 87 140 78 b) a) IL-27 HypIL-6 c)shared and differentially regulated p-sites LGALSL (S) BAD (S) STAT4 (Y) STAT3 (Y) STAT1 (Y) STAT5A,B (Y) PTPN11 (Y) PPM1G (T) SUGP2 (S) CARD11 (S) STAT3 (S) RNASE9 (S, T) AHNAK (S) CLK3 (S) AHNAK (T) BAD (S) ARL6IP4 (S) UBR5 (S) PIEZO1 (S) REPS1 (S) SRRM2 (S) ANKRD36C (T) CDCA7L (S) NELFA (S) NDRG1 (S) PRR12 (S) RCHY1 (S) OSBPL11 (S) ZNF217 (S) RPS6KA3 (S) 0 1 2 3 4 >5 CDH12 (S) MAP1B (S) ZNF280C (S,T) ADGRF2 (T,Y) ZC2HC1A (S) BOLA1 (S) GTF2I (S) TACC1 (S, Y) SCAF11 (S) ABCC1 (S) WRNIP1 (S) SEC23IP (S) OSBPL8 (S) STAU2 (S) LRRFIP1 (S) TOP2B (S) ZCRB1 (S) RFX5 (S) PABPN1 (S) ARHGDIA (S) FAM47E (T,Y) NUDT19 (S) HNRNPF (S) TPR (S) TALDO1 (S) PCNX (S) KLC1 (S) RBM39 (S) IRS2 (S) PML (S) -4 -3 -2 -1 0 < -4 IL- 27 Hy pIL -6 fc / lo g 2 IL- 27 Hy pIL -6 fc / lo g 2 Fo ld c ha ng e p TEF b 7 SK snRNP LARP7PPM1G RNA Pol-2 NELFACy clin T1 CDK9 STAT3 p53 RCHY1 Cyclin C CDK8 Mediator complex f) 0 30 60 90 120 150 180 0.0 0.5 1.0 1.5 2.0 IL-27 HypIL-6 time / min 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 HypIL-6 time / min pS -S TA T1 r el . M FI e) Fo ld c ha ng e 0 2 4 6 8 10 12 STAT1 Y701 STAT3 Y705 STAT5 Y694 STAT6 Y641 STAT1 S727 STAT3 S727 Tyrosine-P Serine-P IL-27 HypIL-6 * * * ** *** ** *** IL-27 HypIL-6 pS -S TA T3 r el . M FI mR NA P ro ce ss ing mR NA S pli cin g mR NA ex po rt JA K/ ST AT ca sc ad e Ce ll-c ell ad he sio n Tr an sc rip tio n Po sit ive R NA po l II re gu lat ion Ne ga tiv e R NA po l II re gu lat ion Nu cle ar po re co mp lex as se mb ly Re gu lat ion R ho si gn ali ng Hi sto ne H 3-K 4 t rim eth yla tio n DN A me th yla tio n Re gu lat ion R NA po l II d) FOS SOCS3 CD69 IFNG EGR1 NFKBIA KLF5 JUN OSM RHOB IL13 -3 -2 -1 0 1 2 3 4 5 0 6 12 18 24 -2 -1 0 1 2 3 4 IL-27 HypIL-6 0 6 12 18 24 -2 -1 0 1 2 3 4 IL-27 HypIL-6 GBP1 GBP2 GBP4 GBP5 IFI44 IL12RB2 IL15 IRF8 IRF9 JAK2 MX1 OAS1 PARP9 STAT1 STAT2 TRAFD1 TRIM21 TRIM22 UBE2L6 USP18 0 1 2 CD274 IFIT1 IFIT2 IFIT3 IFIT5 IRF1 RGS1 SOCS1 -1 0 1 2 3 1h 6h 24h 1h 6h 24h IL-27 HypIL-6 1h 6h 24h 1h 6h 24h Interferon signature STAT1 dependent genes STAT3 dependent genes 0 6 12 18 24 -2 -1 0 1 2 3 4 IL-27 HypIL-6 fo ld c ha ng e / l og 2 fo ld c ha ng e / l og 2 24h 1h 6h 24h 24h 1h 6h 24h IL-27 HypIL-6 fc / log2 fc / log2 fc / log2 IL-27 HypIL-6 IL-27 HypIL-6 time / h 1h 6h 1h 6h Fig. 5 0 100 200 Z X 200 100 0 -100 -100 -200 -200 -100 0 Y 100 IL-27 HypIL-6 1h 6h24h 1h 6h 24h Y X 0 -100 -200 -300 200 100 -1000 -400 -500 0 500 -200 0 Z 200 1h 6h 24h 1h 6h 24h 0 6 12 18 24 0.0 0.2 0.4 0.6 0.8 1.0 upregulated genes downregulated genes Upregulated genes Downregulated genesa) time / h Fr ac tio n sh ar ed w ith IL -2 7 b) e) time / h fo ld c ha ng e / l og 2 time / h 0 6 12 18 24 0 50 100 150 IL-27 HypIL-6 0 6 12 18 24 0 100 200 300 400 500 600 700 800 IL-27 HypIL-6 ge ne s ge ne s time / h time / h upregulated downregulatedc) d) Interferon Signaling Immune System Interferon alpha/beta signaling Interferon gamma signaling Cytokine Signaling in Immune system 0 1 2 3 4 24h 1h 6h 24h fc / log2 IL-27 HypIL-6 1h 6h fo ld c ha ng e / l og 2 Fig. 6 0 6 12 18 24 0.0 0.2 0.4 0.6 0.8 1.0 1.2 control siRNA IRF1 siRNA IR F1 /r el . M FI time / h IRF1 protein levels 0 6 12 18 24 0 5 10 15 20 25 30 control siRNA IRF1 siRNA 0 6 12 18 24 0 20 40 60 80 GAPDH siRNA control siRNA fo ld in du ct io n time / h fo ld in du ct io n time / h STAT1 OAS1 0 6 12 18 24 0 200 400 600 800 1000 control siRNA IRF1 siRNA 0 6 12 18 24 0 10 20 30 40 50 control siRNA IRF1 siRNA fo ld in du ct io n time / h fo ld in du ct io n time / h GBP5 SOCS3 b) c) IRF1 protein levels IR F1 / M FI time / h a) 0 6 12 18 24 0 20000 40000 60000 80000 100000 control siRNA IRF1 siRNA untransfected pS TA T1 / M FI time / h pSTAT1 0 6 12 18 24 0 10000 20000 30000 40000 control siRNA IRF1 siRNA untransfected pS TA T3 / M FI time / h pSTAT3 d) 0 6 12 18 24 8000 10000 12000 14000 16000 18000 20000 IL-27 HypIL-6 -5 -4 -3 -2 -1 0 1 2 3 4 8 0 1 2 3 4 5 6 7 8 -5 -4 -3 -2 -1 0 1 2 3 0 1 2 3 4 5 6 7 8 4 8 Differentiate to TH1 In SILAC media Light (R0K0) Medium (R6K6) High (R10K8) Stimulation 24 hIsolate PBMCs From buffy coat & CD4+ isolation Mix 1:1 cell numbers Fractionation LC-MS/MS MaxQuant peptide quantification Lyse Reduce Alkylate Digest unstim. IL-27 HypIL-6 IL-27 HypIL-6 MX1 STAT1 STAT2 IFITM1 GBP4 GBP5 VPS25 TGFb ISG20 UBE2L6 6857 3552 unchanged changed ISGs Upregulated proteins IL-27 HypIL-6 Downregulated proteins IL-27 HypIL-6 in du ct io n TGFB1 SMARCD2 VPS25 RALA SELPLG DRG1 ATP2B4 PRKAR1A LARP7 ABCB11 TCEAL3 MAPK14 HLA-C RAP2C FAM111A SUZ12 BCAT2 ARID1B ARF6 MIEN1 METTL14 UVRAG PIP4K2A ZMYM6NB COX17 ISY1 EIF3C B2M HBS1L DNAJC2 TMED1 ITGA4 MLLT4 ACSL5 FOXO1 ATG4B PPP6R3 SLC9B2 RNF114 DNAJC10 RBM22 CUL4B CASP4 PPP1R18 ROCK1 MCM6 DENND4C NDUFA10 TMED3 SDE2 KPNA5 JAK3 ARHGAP9 COA3 SNX3 LIMD1 SELK RNF20 CNDP2 ERBB2IP PMPCA HLA-E SRCAP SEC24B ANAPC5 BTAF1 CCDC86 RPL29 MYH14 IL7R TUBB8 RTN4 LANCL2 AARS2 QTRTD1 SCPEP1 CCDC9 HIST1H3A KTI12 GTF3C4 RPAP3 NUDT16L1 OTULIN ACOT1 GSTM2 HIST1H1E P2RX4 MYADM ABCB11 PLD3 GTF2B NPEPPS NAA15 CBX1 MT-CO1 LUC7L3 TP53BP1 GDI1 SPTBN1 YWHAG RBM27 HLA-DQB1 KDM1A QARS PCBP2 EHD1 YIF1B DNASE2 LIG1 GBF1 NUDT21 RPL14 BTN3A3 TXNRD1 LMNB2 TBC1D10B EXOSC2 NDUFA4 NCBP2 MCM3AP MIPEP CBX3 HMHA1 CSNK2B TBC1D2B BOP1 MLST8 SNAPIN GBP5 UBE2L6 GBP4 STAT2 TRAFD1 PARP9 STAT1 PARP14 DDX60 MX1 ISG20 GBP1 NMI BST2 NUB1 IFI35 XRN1 LGALS3BP LAP3 TRANK1 TRIM22 NT5C3A PLSCR1 DNAJA1 GBP2 OAS2 IFITM1 PML TYMPALOX5AP PPP1R2 ACADM PRKCSH ZCCHC10 SRPK2 MECP2 HMGN4 EIF4E3 PSMB1 E nr ic hm en t s co re R an ke d lis t m et ri c Rank in ordered dataset GSEA pathway reactome: Cytokine signaling and immune system IL-27 HypIL-6 TGFB1 GBP5 RALA UBE2L6 GBP4 STAT2 STAT1 MX1 ISG20 GBP1 MAPK14 IFITM1 HLA-C 0 1 2 Fig. 7 a) b) d) c) e) GBP5 UBE2L6 GBP4 STAT2 TRAFD1 PARP9 STAT1 PARP14 MX1 GBP1 DDX60 IFI35 XRN1 LGALS3BP TRIM22 GBP2 0 1 2 1h 6h 24h 24h 1h 6h 24h 24h fc/ log2 tra ns cr ipt pr ot ein tra ns cr ipt pr ot ein IL-27 HypIL-6 f) fc/ log2 fc / lo g 2 (0/23) (1/34) (2/18)(26/57) (1/11) (0/24) ISGs DENND4C DNAJC10 TGFB1 SMARCD2 NDUFA10 VPS25 GBP5 RALA RBM22 UBE2L6 SELPLG GBP4 STAT2 TRAFD1 PRKAR1A PARP9 STAT1 PARP14 LARP7 ABCB11 TCEAL3 MX1 ISG20 CUL4B DRG1 GBP1 CASP4 MAPK14 ATP2B4 DDX60 PPP1R2 BOP1 TP53BP1 CCDC86 ALOX5AP TBC1D2B CSNK2B SCPEP1 HMHA1 SNAPIN CBX3 LUC7L3 QTRTD1 MLST8 MT-CO1 NUDT21 GBF1 AARS2 LIG1 BTAF1 DNASE2 YIF1B EHD1 LANCL2 CBX1 PCBP2 MIPEP MCM3AP QARS NCBP2 -5 -4 -3 -2 -1 0 1 2 3 >3IL -2 7 Hy pI L- 6 NCBP2 DENND4C DNAJ10C fold change / log2fold change / log2 p va lu e / - lo g 1 0 p va lu e / - lo g 1 0 Fig. 8 pS TA T (n or m al iz ed ) c / log μM f) co py n um be rs n ai ve C D 4 n ai ve C D 8 T H 1 T H 2 T H 17 C T L N K M as t B M D M E o si n o p h il0 1000 2000 3000 4000 n ai ve C D 4 n ai ve C D 8 T H 1 T H 2 T H 17 C T L N K M as t B M D M E o si n o p h il0 1000 2000 3000 4000 n ai ve C D 4 n ai ve C D 8 T H 1 T H 2 T H 17 C T L N K M as t B M D M E o si n o p h il0 2000 4000 6000 8000 10000 n ai ve C D 4 n ai ve C D 8 T H 1 T H 2 T H 17 C T L N K M as t B M D M E o si n o p h il0 500000 1000000 1500000 2000000 2500000 n ai ve C D 4 n ai ve C D 8 T H 1 T H 2 T H 17 C T L N K M as t B M D M E o si n o p h il0 100000 200000 300000 400000 GP130 IL-6Rα IL-27Rα STAT1 STAT3 -3 -2 -1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 pSTAT1 pSTAT3 -3 -2 -1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 pSTAT1 pSTAT3 pS TA T (n or m al iz ed ) c / log μM Th-1 RPE1 e) b) a) 0 5000 10000 15000 20000 0 200 400 600 800 1000 unstim. ctrl unstim. SLE IL-27 ctrl IL-27 SLE HypIL-6 ctrl HypIL-6 SLEpS TA T1 / M FI pS TA T3 / M FI pSTAT3 n.s. ** ** n.s. *** ** pSTAT1 pS TA T1 / re l. M FI c / log nM pS TA T1 / re l. M FI c / log nM d) -4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 IL-27 IL-27 primed HypIL-6 HypIL-6 primed -4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 IL-27 IL-27 primed HypIL-6 HypIL-6 primed pSTAT1 pSTAT3 time / min time / min time / min time / min pS TA T3 / r el . M FI pS TA T1 / r el . M FI pS TA T3 / r el . M FI pS TA T1 / r el . M FI pS TA T3 / r el . M FI pS TA T1 / r el . M FI pS TA T3 / r el . M FI pS TA T1 / r el . M FI 0 2000 4000 6000 8000 10000 12000 14000 0 5000 10000 15000 20000 25000 IL-6Rα GP130 IL-27Rα R P K M R P K M n.s. n.s.n.s. STAT1 STAT3 **** SLE dis. risk healthy control c) supp. Fig. 1 -4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 (Miltenyi) mIL-27sc -4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 (Miltenyi) mIL-27sc IL-27 / log nM pS TA T1 / re l. M FI pSTAT1 IL-27 / log nM pS TA T3 / re l. M FI pSTAT3 time / min pS TA T1 / re l. M FI pSTAT1 time / min pS TA T3 / re l. M FI pSTAT3 time / min pS TA T1 / re l. M FI pSTAT1 time / min pS TA T3 / re l. M FI pSTAT3 CD4+ CD8+ b) d) 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. IL-27 HypIL-6 time / min pS TA T3 / re l. M FI pSTAT3 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. IL-27 HypIL-6 time / min pS TA T1 / re l. M FI pSTAT1 𝚫 𝚫 𝚫 c) dose-response or kinetic exp. II) stimulation & sample barcoding III) merge cells & AB staining Leukocytes CD3+ CD8+ CD4+ Leukocytes CD3+ CD8-/CD4+ Barcodeall data IV) flow cytometryI) PBMC isolation and TH1 differentiation a) pS TA T / r el . M FI c / log nM pS TA T / r el . M FI c / log nM e) -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 RPE1 + IL-27 RPE1 + HypIL-6 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 RPE1 + IL-27 RPE1 + HypIL-6 pSTAT1 pSTAT3 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. IL-27 HypIL-6 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. IL-27 HypIL-6 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. IL-27 HypIL-6 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 unstim. IL-27 HypIL-6 Heterodimerization IL-27Rα GP130 Trajectories Rho11 Trajectories DY647 Co-Trajectories Homodimerization GP130 GP130 unstim. +IL-27 unstim. +HypIL-6 5 µm c) 0.0 0.5 1.0 1.5 2.0 0 2000 4000 6000 8000 10000 0.0 0.5 1.0 1.5 2.0 0 5000 10000 15000 20000 500 nm500 nm Fl uo re sc en ce in t. / a .u . time / s Fl uo re sc en ce in t. / a .u . time / s Dual-color dimerSingle-color dimer Single-color dual-step bleaching Dual-color single-step bleaching 2 labels 1 label 𝚫FRET DY649 bleached label 1 bleached label 2 bleached Rho11 bleached HypIL-6 0.0 s 0.9 s 1.6 s 2.1 s 0.0 s 0.9 s 1.9 s 2.1 s 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.00 0.02 0.04 0.06 0.08 0.10 0.12 D / µm 2 s -1 GP130IL-27Rα Dimer +IL-27 +IL-27 +IL-27 D / µm 2 s -1 GP130 Dimer +HypIL-6 d) +HypIL-6 ** n.s. *** *** *** supp. Fig. 2 b) -4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 -4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 𝚫GP130 𝚫IL-27Rα +GP130 𝚫IL-27Rα +GP130 +IL-27Rα -4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 pSTAT1 IL-27 pSTAT3 HypIL-6 pSTAT1 HypIL6 pSTAT3 c / log nM pS TA T / r el . M FI c / log nM pS TA T / r el . M FI c / log nM pS TA T / r el . M FI a) a) b) c) d) e) f) g) h) supp. Fig. 3 b) IL-27 / log nM pS TA T1 / re l. M FI IL-27 / log nM pS TA T3 / re l. M FI -4 -3 -2 -1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 -4 -3 -2 -1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 - wt Y543F Y613F Y543F-Y613F 𝚫Y613F 𝚫Y613F 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 unstim. IL-27 HypIL-6 pS TA T3 / re l. M FI pS TA T1 / re l. M FI time / min time / min 𝚫 𝚫 𝚫 𝚫 a) 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 unstim. IL-27 HypIL-6 pSTAT1 pSTAT3 pSTAT1 pSTAT3 supp. Fig. 4 TH1 cells (ratio +/- Tofacitinib) 0 15 30 45 60 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 IL-27 HypIL-6 0 15 30 45 60 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 IL-27 HypIL-6 time / min R at io p S TA T1 + /- To f. +Tofacitinib +Tofacitinib R at io p S TA T3 + /- To f. time / min d) -4 -3 -2 -1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 IL-27Rα(wt) IL-27Rα-GP130 pS TA T / r el . M FI IL-27 / log nM a) -4 -3 -2 -1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 IL-27Rα(wt) IL-27Rα-GP130 pS TA T / r el . M FI IL-27 / log nM c) 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 HypIL-6 IL-27 + Tof. HypIL-6 + Tof. 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 HypIL-6 IL-27 + Tof. HypIL-6 + Tof. time / min pS TA T3 / re l. M FI RPE1 IL-27Rα cells TH1 cells time / min pS TA T3 / re l. M FI b) +Tofac. +Tofac. 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 HypIL-6 IL-27 + Tof. HypIL-6 + Tof. 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 HypIL-6 IL-27 + Tof. HypIL-6 + Tof. time / min pS TA T1 / re l. M FI time / min pS TA T1 / re l. M FI +Tofac. +Tofac. supp. Fig. 5 supp. Fig. 6 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 IL-27 HypIL-6 IL-27 + CHX HypIL-6 + CHX 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 IL-27 HypIL-6 IL-27 + CHX HypIL-6 + CHX 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 HypIL-6 IL-27 + CHX HypIL-6 + CHX 0 30 60 90 120 150 180 0.0 0.2 0.4 0.6 0.8 1.0 1.2 IL-27 HypIL-6 IL-27 + CHX HypIL-6 + CHX b) time / min pS TA T3 / re l. M FI RPE1 IL-27Rα cells TH1 cells time / min pS TA T3 / re l. M FI a) time / min pS TA T1 / re l. M FI time / min pS TA T1 / re l. M FI IL-27 GP130 IL-27Rα p-S485 PIAS1 p-Y701 S727 STAT1 p-Y705 S727 STAT3 p-Y693 STAT4 p-Y694 STAT5A p-Y699 STAT5B JAK/STAT Cascade Cell-cell adhesion p-T38 S41 AHNAK p-S540 PPFIBP1 p-S141 PAK2 p-Y701 S727 STAT1 p-S490 LIMA1 p-S16 S521 LRRFIP1 p-S578 S621 MICALL1 p-S385 ADD1 p-S36 S39 ALDOA p-T508 EIF4G2 p-S334 SEPT07 p-S277 SNX2 p-S168 TMPO Actin cytoskeleton p-T38 S41 AHNAK p-S490 LIMA1p-S36 S39 ALDOA p-S334 SEPT07 p-S463 CD2AP p-S573 FYB p-S3 CFL1 Pre-autophagosomal structures p-T658 NBR1p-S755 ATG9A p-S272 S366 SQSTM1 Regulation of RNA Pol II Negative Regulation of RNA Pol II p-S184 ETV6 p-S2 HIST1H1C p-S2 HIST1H1D p-S2 HIST1H1B p-S2 T3 SMARCA4 p-S183 RFX5 p-S255 DNMT3A p-S465 SAP130 p-S485 PIAS1 p-Y701 S727 STAT1 p-Y705 S727 STAT3 p-S272 S366 SQSTM1 p-S2120 S2124 S1259 SPEN p-S183 T185 ZNF280C p-S1425 SPEN AAA mRNA Processing p-S239 ARL6IP4 p-S109 RBM15B p-S1359 PHRF1 p-S388 S766 SCAF11 p-S573 SUGP2 p-T414 ACIN1 p-T601 ADAR p-S627 CCAR2 p-S50 METTL3 p-S653 S797 SRRM1 mRNA Splicing p-S13 NCBP2 p-S109 RBM15B p-S1542 SRRM2 p-S239 ALYREF p-S1425 SPEN p-S1910 S1913 S1920 POLR2A p-S271 HNRNPUp-S50 METTL3 p-S653 S797 SRRM1p-S95 PABPN1 p-S876 SRRM2 p-S2120 S2124 S2159 SPEN mRNA Nuclear export p-S239 ALYREF p-S633 NUP153 p-S653 S797 SRRM1 p-S13 NCBP2 p-S1023 NUP214 p-S221 NUP50 Histone H3-K4 methylation p-S2 HIST1H1D p-S161 KMT2A p-S2 HIST1H1C DNA methylation p-S496 BAZ2A p-S161 KMT2A p-S255 DNMT3A Transcription p-S1591 DENND4Ap-T190 BCLAF1 p-S16 S521 LRRFIP1p-S191 MRGBP p-S218 MYSM1 p-S183 NFKBIB p-S295 PAXBP1 p-S448 POU2F1 p-S109 RBM15B p-S2 T3 SMARCA2 p-S1342 BAZ1B p-S496 BAZ2A p-S627 CCAR2 p-S538 CHAF1B p-S36 CHD6p-S1856 GTF3C1 p-S206 GON4L p-S311 MSL3 p-S166 NACA p-S121 PPHLN1 p-S2 S9 PTMAp-S183 RFX5 p-S221 RPS3 p-S2120 S2124 S2159 SPEN p-S23 TFDP1 p-S56 MGA p-S5 PHF11 p-S857 PHF8 p-S1080 RBL2 p-S43 SAP30BP p-S465 SAP130 p-S34 ITGB1BP1 p-S485 PIAS1 p-Y701 S727 STAT1 p-Y705 S727 STAT3 p-Y693 STAT4 p-Y694 STAT5A p-Y699 STAT5B p-S1425 SPEN p-S183 T185 ZNF280C p-S113 ZNF34 p-S388 ZNF507 p-S85 ZNF513 p-Y641 STAT6 p-Y701 STAT1 p-Y705 S727 STAT3 p-Y693 STAT4 p-Y694 STAT5A p-Y699 STAT5B JAK/STAT Cascade Cell-cell adhesion p-S336 NDRG1 p-S41 AHNAK p-Y701 STAT1 p-T38 AHNAK p-S127 ANXA2 p-S119 S277 SNX2 p-S578 MICALL1 p-S30 T42 SEPT9 p-S521 LRRFIP1 p-SS299 CLINT1 p-S168 TMPO Golgi apparatus HypIL-6 GP130 Actin filament p-S2398 AKAP13p-Y397 HCK p-S395 S790 S1411 AKAP13 p-S1114 FKBP15 p-S1261 MYO9B p-Y397 HCK p-S1118 LRBA p-Y397 LYN p-S42 PASK p-S553 RAB11FIP5 p-S301 RAF1 p-S5 WDR44 p-S299 CLINT1 p-S121 PPHLN1 p-S535 SLC1A5 p-T175 ARHGEF2 p-S368 ARFGAP2 p-S1874 HTT p-S172 OSBPL11 p-S341 ZDHHC2 Regulation of RNA Pol II p-S1080 RBL2 p-S191 MRGBP p-S16 S521 LRRFIP1 p-S327 RBBP8 p-S2 T3 SMARCA4 p-S103 GTF2I p-S183 RFX5 p-S23 TFDP1 p-S344 NFATC3 p-Y705 S727 STAT3 p-Y694 STAT5A p-Y699 STAT5B Positive Regulation of RNA Pol II p-S233 NELFA p-S75 S79 NUCKS1 p-S301 RAF1 p-S366 SQSTM1 p-S681 TRIM28 p-S575 THRAP3 p-S565 PML p-S11 SAFBp-S344 NFATC3 p-S208 NCOA7 p-S415 RPS6KA3 p-S176 YBX1p-S41 PKNOX1 p-S771 TP53BP1 p-S175 ARHGEF2 AAA mRNA Processing p-S392 TFIP11 p-S627 CCAR2 p-S35 CASC3 p-S388 S766 SCAF11 p-S573 SUGP2 p-S337 RBM39 p-S772 RBBP6 p-S109 RBM15B p-S471 XRN2 p-S653 SRRM1 mRNA Splicing p-S392 TFIP11 p-S187 HNRNPF p-S35 CASC3 p-S2124 S2159 SPEN p-S43 CDC40 p-S21 RNPC3 p-S5 SRSF3p-S2 SRSF2 p-S653 SRRM1p-S95 PABPN1 p-S82 HNRNPD p-S176 YBX1 mRNA Nuclear export p-S633 NUP153 p-S2 POM121p-S653 SRRM1 p-S43 CDC40 p-S2 SRSF2 p-S35 CASC3 Transcription p-S1591 DENND4A p-S135 GATAD2Bp-T190 BCLAF1 p-S565 PML p-S109 RBM15B p-S337 RBM39 p-S1342 BAZ1B p-S627 CCAR2 p-S1856 GTF3C1 p-S82 HNRNPD p-S2234 NCOR2 p-S121 PPHLN1 p-S771 TP53BP1 p-S2124 S2159 SPEN p-S183 T185 ZNF280C p-S388 ZNF507 p-S113 ZNF34p-S521 LRRFIP1 p-S56 MGA p-S5 PHF11 p-S372 MIER1 p-Y641 STAT6 p-S795 ZNF217 p-S261 CDCA7L p-S34 ITGB1BP1 p-S208 NCOA7 p-Y701 STAT1 p-Y705 S727 STAT3 p-Y693 STAT4 p-Y694 STAT5A p-Y699 STAT5B p-S233 ACTL6A p-S183 NFKBIB Rho signaling p-S301 RAF1 p-S395 S790 S1411 AKAP13 p-S24 ARHGDIA p-S1261 MYO9B p-T175 ARHGEF2 p-S2398 AKAP13 p-S327 RBBP8 p-Y641 STAT6 p-S103 GTF2I p-S521 LRRFIP1 p-S75 S79 NUCKS1 p-S382 ARID1A p-S344 NFATC3 p-S233 ACTL6A p-Y699 STAT5B p-Y705 S727 STAT3 p-Y694 STAT5A p-S11 SAFB p-Y705 S727 STAT3 p-Y641 STAT6 p-Y693 STAT4 p-Y694 STAT5A p-Y699 STAT5B p-Y701 STAT1 p-S575 THRAP3 p-S2 SRSF2 p-S5 SRSF3 p-S1838 TPR Nuclear Pore Assembly p-S1838 TPR p-S509 AHCTF1 p-S633 NUP153 p-S382 ARID1A p-S11 SAFB Differentiate to Th-1 In SILAC media Light (R0K0) Medium (R6K6) High (R10K8) Stimulation: 15min Isolate PBMCs From buffy coat & CD4+ isolation Mix 1:1 cell numbers Fractionation LC-MS/MS MaxQuant peptide quantification Lyse Reduce Alkylate Digest unstim. IL-27 HypIL-6 Phosphopeptide Enrichment (TiO2) a) b) c) supp. Fig. 7 0 2 4 6 8 10 0 2 4 Nucleus Membrane Cytoplasm Pre-autophagosomal struct. Actin cytoskeleton Actin filament Golgi apparatus IL-27 HypIL-6 0 5 10 15 20 25 0 2 4 Nucleus Methylation Cytoplasm Transcription mRNA processing Chromatin regulator mRNA transport Actin cytoskeleton Actin filament Golgi apparatus Golgi apparatus IL-27 HypIL-6 Cellular location UP keywords peptide Fold change / log2 peptide Fold change / log2 CHD12 S144 -6.33 LGALSL S4 9.05 MAP1B S2271 -3.66 RNASE9 S53 T54 5.73 ZNF280C S183 T185 -3.16 AHNAK S41 T38 4.00 ADGRF2 T601 Y588 -3.11 BAD S25 3.99 ZC2HC1A S223 -2.39 CLK3 S157 3.74 BOLA1 S81 -2.30 STAT4 Y693 3.67 GTF2I S103 -2.25 DCP1B S283 3.47 TACC1 S689 Y695 -2.17 STAT3 Y705 2.81 SCAF11 S776 -2.08 STAT1 Y701 2.63 ABCC1 S915 -1.97 STAT5A/B Y694/Y699 2.18 WRNIP1 S151 -1.95 PTPN11 Y546 1.93 SEC23IP S737 -1.92 BAD S134 1.84 RBM15B S109 -1.81 ARL6IP4 S239 1.78 MECP2 S25 -1.65 UBR5 S1549 1.77 PSMD11 S14 -1.63 PIEZO1 S1646 1.70 OSPBL8 S68 -1.40 PPM1G T122 1.69 peptide Fold change / log2 peptide Fold change / log2 TACC1 S689 Y695 -4.88 LGALSL S4 6.49 CDH12 S144 -4.16 STAT4 Y693 5.74 MAP1B S2271 -4.01 MYO9B S1261 4.34 ZNF280C S183 T185 -3.42 ANKRD36C T828 4.30 ADGFR2 T601 Y588 -3.37 CDCA7L S261 3.54 ZC2HC1A S223 -2.46 STAT3 Y705 3.40 BOLA1 S81 -2.44 NELFA S233 2.92 WRNIP1 S151 -2.40 PPM1G T122 2.90 FAM47E T158 Y161 -2.17 BAD S25 2.84 SCAF11 S776 -2.15 NDRG1 S336 2.79 ABCC1 S915 -2.07 STAT1 Y701 2.69 NUDT19 S4 -1.97 SUGP2 S573 2.18 GTF2I S103 -1.85 PRR12 S44 1.98 ZC3H3 S408 -1.69 STAT3 S727 1.97 SEC23IP S737 -1.64 PTPN11 Y546 1.73 PSMD11 S14 -1.60 RCHY1 S257 1.72 b) c) d) IL-27 HypIL-6 UBR 5 S 154 9 BAD S1 34 PAK 2 S 141 0 1 2 3 4 5 6 * IL-27 HypIL-6 88 67 73 62 25 53 Downregulated phospho-sites Upregulated phospho-sites IL-27 HypIL-6 TH17 Treg p-UBR5 p-PAK2 p-BAD a) Fo ld c ha ng e supp. Fig. 8 a) b) c) -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 fold induction / log2 p v al u e / - lg 10 unchanged regulated 7327 23219 112631h 6h 24h -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 fold induction / log2 p v al u e / - lg 10 unchanged regulated -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 fold induction / log2 p v al u e / - lg 10 unchanged regulated -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 fold induction / log2 p v al u e / - lg 10 unchanged regulated IL-27 6036 111304 1265321h 6h 24h -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 fold induction / log2 p v al u e / - lg 10 unchanged regulated -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 fold induction / log2 p v al u e / - lg 10 unchanged regulated 1h 6h 24h -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 fold induction / log2 p v al u e / - lg 10 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 fold induction / log2 p v al u e / - lg 10 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 fold induction / log2 p v al u e / - lg 10 HypIL-6 HypIL-6 (IL-27 regulated genes highlighted) supp. Fig. 9 IL-27 top 30 up & downregulated genes FOSB RGS1 IFIT3 FOS IFIT2 C5orf58 SOCS1 SOCS3 CD69 NFKBIZ PTCHD3P2 PRR25 RGS16 CMPK2 C10orf10 PMAIP1 DUSP5 CCL3 IFNG EGR1 SGK1 IFIT1 CFL2 GRM2 KLF6 NFKBIA DNAJB13 KLF5 JUN ZNF888 BCDIN3D PLEKHF1 ZKSCAN4 SENP8 TNFSF14 ALG1L2 HIST1H4J B3GALT2 PARS2 AJUBA KBTBD7 EFNA3 ID3 DUSP2 TRGV5P IGIP ADRB2 ZNF396 ZSWIM3 SOWAHD hsa-mir-146a GUSBP9 CEBPE CDK5R1 ARL4D NUAK2 NOG SERTAD3 ZFP36L2 DDIT4 -1 0 1 2 3 4 5 IFIT3 CTSL1 IFI44L RGS1 RSAD2 GBP1P1 SLC6A9 SLAMF8 LAMP3 ETV7 CHAC1 GBP1 FAM157B GTF2IRD1 GBP5 LRRC2 GBP4 SEMA3G PTCHD3P2 CETP SOCS1 SLC7A11 STAT1 CMPK2 WARS HAPLN3 SMTNL1 BCL2L14 IFIT2 EPSTI1 GAS2L1 RASSF4 IGFBP4 HBEGF ADORA1 CGN FGF11 TNFRSF10D P4HA2 DDIT4 NEK11 TMEM213 NPTX1 MT1DP DUSP6 P4HA1 IL10 MATN2 PDE7B HSPG2 CD248 AK4 DTX4 PPFIA4 CFD DHDH EGR1 FOS PFKFB4 MIR210HG -5 -4 -3 -2 -1 0 1 2 3 IFI44L C1orf61 GBP1P1 IFI27 SPAG6 IFIT3 IFIT1 RSAD2 SLAMF8 FCRL6 GBP1 RGS1 GBP5 ETV7 LAMP3 USP18 STAT1 CMPK2 NFIX RUFY4 CETP GBP4 IFIT2 WARS ALG13-AS1 IFI44 LRRN2 FRMD3 TNFSF13B BCL2L14 MAP7 CDC42EP4 ITGAX HSPG2 AICDA HIST1H2BO APBA1 VLDLR C2orf48 RIMKLA SDK2 ATOH8 KISS1R HIST1H2BL DTX4 EMP1 WNT1 CCDC74B AK4 OSCP1 PFKFB4 STC2 S100A9 SPON1 EGR1 FOS VEGFA ADORA1 MIR210HG PPFIA4 -6 -5 -4 -3 -2 -1 0 1 2 3 IL -2 7 Hy pI L- 6 IL -2 7 Hy pI L- 6 IL -2 7 Hy pI L- 6 Total=80 IL-27 HypIL-6 shared Total=119 IL-27 HypIL-6 shared Total=132 IL-27 HypIL-6 shared Total=49 IL-27 HypIL-6 shared Total=387 IL-27 HypIL-6 shared Total=590 IL-27 HypIL-6 shared Upregulated genes Downregulated genes Time 1h 6h 24h IL-27 HypIL-6 Interferon Stimulated Genes (ISGs) 1h 6h 24h 1h 6h 24h GBP1 GBP4 GBP5 IFIT1 IFIT2 IFIT3 IFNG IRF1 IRF8 IRF9 MX1 OAS1 PARP9 RGS1 SOCS1 SOCS3 STAT1 STAT2 USP18 -1 0 1 2 3 a) b) c) 1h 6h 24h GSEA pathway enrichment: IFN Signalling Rank in ordered dataset 0 100 200 300 400 En ric hm en t Sc or e 0. 0 0. 4 lis t m et ric 0 -4 4 Upregulated genes Downregulated genes fc / lo g 2 fc / lo g 2 fc / lo g 2 fc / lo g 2 supp. Fig. 10 GSEA pathway reactome: Interferon signalling 0 1000 2000 3000 -5 0 5 10 protein ID fo ld c h an g e / l o g 2 data distribution IL-27 HypIL-6 E nr ic hm en t s co re R an ke d lis t m et ri c IL-27 HypIL-6 GBP5 UBE2L6 GBP4 STAT2 STAT1 MX1 ISG20 GBP1 IFITM1 HLA-C BST2 IFI35 TRIM22 B2M OAS2 0 0.5 1.0 1.5 fc/ log2 a) b) c) E nr ic hm en t s co re R an ke d lis t m et ri c Rank in ordered dataset GSEA pathway reactome: Cytokine signalling and immune system IL-27 HypIL-6 TGFB1 GBP5 RALA UBE2L6 GBP4 STAT2 STAT1 MX1 ISG20 GBP1 MAPK14 IFITM1 HLA-C 0 1 2 0 1000 2000 3000 -5 0 5 10 protein ID fo ld c h an g e / l o g 2 data distribution IL-27 HypIL-6 Upregulated proteins Downregulated proteins Total=92 61.96% IL-27 26.09% HypIL-6 11.96% shared Total=75 30.67% IL-27 24.00% HypIL-6 45.33% shared fc/ log2 supp. Fig. 11 Rank in ordered dataset a) b) c) supp. Fig. 12 time / min pS TA T1 / re l. M FI time / min pS TA T1 / re l. M FI time / min pS TA T3 / re l. M FI time / min pS TA T1 3/ r el . M FI c / log nM pS TA T3 / re l. M FI time / min pS TA T1 / re l. M FI time / min pS TA T1 / re l. M FI time / min pS TA T3 / re l. M FI time / min pS TA T1 3/ r el . M FI pS TA T (n or m al iz ed ) c / log μM pS TA T (n or m al iz ed ) c / log μM -3 -2 -1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 pSTAT1 pSTAT3 -3 -2 -1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 pSTAT1 pSTAT3 Th-1 RPE1 Tofacitinib titration – IL-27 signaling supp. Fig. 13 a) d) 0 8 16 24 1.0 1.1 1.2 1.3 1.4 1.5 STAT1 STAT3 fo ld in du ct io n time / h 0 500 1000 1500 2000 2500 ctrl SLE 0 100 200 300 ctrl SLE S TA T1 / M FI S TA T3 / M FI total STAT1 total STAT3 b) p: 0.067 p: 0.009 0.8 1.0 1.2 1.4 1.6 1.8 2.0 IL-27 ctrl IL-27 SLE HypIL-6 ctrl HypIL-6 SLE ra tio p S TA T1 /p S TA T3 p: 0.023 p: 0.009 c)